Python at Google.notes
Friday, March 25, 2005
TITLE OF PAPER: Python at Google
URL OF PRESENTATION: --not available--
PRESENTED BY: Greg Stein
REPRESENTING: Google
CONFERENCE: PyCon 2005
DATE: March 25, 2005
LOCATION: Marvin Theater
--------------------------------------------------------------------------
REAL-TIME NOTES / ANNOTATIONS OF THE PAPER:
{If you've contributed, add your name, e-mail & URL at the bottom}
[ A new copy of the O'Reilly Python Success Stories booklet will be produced
Contact Stephan Diebel @ pythonology.org ]
"Python has been an important part of Google since the beginning, and remains so as the system grows and evolved. Today dozens of Google engineers use Python, and we're looking for more people with skils in this language"
-- Peter Norvig, Director of Search Quality at Google
My background
Python developer
10 years
Contributed to Python itself
Authored a number of modules and applications
ViewCVS
Open Source Guy
Contributed to numerous projects (including Python)
Current chairman of the Apache Software Foundation
ViewCVS, written entirely in Python
Contributed to Subversion, Apache server
"We consider Python to be our 'secret sauce'"
--Paul Everitt, talking about Digital Creations, circa 1996
This is a recognition of how Python can help a business.
My view of Python in the workplace
Python at eShop
1995 "What in the world is Python?"
1996 "This is great stuff."
(MS acquired eShop in '96)
Python at Microsoft
1996: "It's called what?"
1997: "You actually shipped Python code?" (MerchantServer 1.0)
1998: "Nice prototype. We'll rewrite it in the next version." And they
did, in C++.
Python in the workplace (continued)
Python at CollabNet
2001: "No, we don't really use Python here." (they used Java)
2003: "Definitely! Write that in Python"
Python caught on here like a virus, moving from developer to developer.
Python at Google
2004 "Of *course* we use Python. Why wouldn't we?"
Changing attitudes over time
Small companies eventually "Got it" ahead of the curve
Champion was needed
Larger Companies follow Python's growth curve
Supporting environment was needed
A number of factors made Python possible in larger organizations:
It is now possible. Here's why:
Python had to grow for it to become "business acceptable"
Large enough talent pool - "where are we going to be able to find these people?"
Support services: Books, Consulting, World Wide Web
Follow the trailblazers
Python passed the tipping point years ago
Not a problem to incorporate it into your business, lots of support,
consulting
Business advantage
"These are some of the reasons we use Python at Google"
Highly adaptable
Changing requirements
- You need a language that is very flexible, so you can adapt your tools during development
Changes in computing environment
Rapid development
For new and experienced developers
The market moves very very quick; you want to be able to keep up with it. If it takes two years for you to respond to something that is needed today, you're behind the curve.
Easy to maintain - most important point in Greg's viwe
You can come back a year later, look at that code, and understand what
is going on
Google's programming environment
Primary Languages
C++
Java
Python
If you want to write a piece of something else, like Perl, you have to
almost get special permission. (Exceptions in ops, but for actual
product stuff, see above)
Miscellaneous
Some Perl used by Operations (others almost have to get permission to use Perl)
PHP creeeps in for internal webapps
Saw Ruby sneaking around
Small amount of C#
In actual progress stuff, C++, Java, Python
SWIG is your friend
SWIG: Simplified Wrapper Interface Generator
www.swig.org
Started by David Beazley
Multi-language environment
A lot of people at Google don't know Python and produce C++ code.
SWIG pulls these "islands together"--they have a lot of stuff lying
around written in various languages. SWIG examines a C++ header file
and auto-generates Python bindings
So for all of our libraries that we have - for parsing HTML,
crawling HTTP and so on - they are made available to Python
using SWIG.
Good for Google programmers who use C++ but don't know Python
Very fast mechanism for integration
Integrated into build system
Makes it very easy for us to add a rule into our build system to just add a library into our python dependancy module
Where do we use it?
Across our internal network
Across a system lifecycle
Live Services
Basic Network
<diagram of development pushing through infrastructure to (1000) servers>
Some usage to support development
Wrappers for Version control (Perforce) (JB note: Perforce can output
marshalled Python objects -- very cool, extremely useful for scripting. Also see svn SWIG mention in Q&A)
They improved branch management.
Running unit tests on checkin
People "earn" their ability to check in after then understand code
guidelines, etc.
Automatically enforce style guidelines
Build System (itself written in Python)
Packaging
We've got giant bundles of code and giant bundles of data which need to
be delivered up to the servers.
Packaging system is built in Python
Third generation of this system
Ability to roll back a version
We can keep iterating and moving forward because we're building all this stuff in Python
Some usage in the network infrastructure
Binary/data pusher
Figures out best way to send stuff from one place to
another -- dev to data center, etc
We're on third/fourth generation of this, keep increasing the scale of
the problem. Python's making that possible - able to iterate quickly
Package repository
Some usage on production servers
Monitoring
Is this thing still alive? Is it running? Does it think it's healthy? Is
it seeing problems with the hard disk? Is the CPU temperature fine?
All of this information is gathered with a little Python program running on the server, then collected by another Python program.
Auto-restart
Complete the Lifecycle
Log reporting
We generate a "large" amount of log information
Data is pulled back from the servers
Analyzed using lots of Python tools
Ad group needs to spot fraudulent clicks. This is a constant cat-and-
mouse game with the script kiddies writing fraudulent ad clickers.
Easy to alter the reports based on ever-changing needs
Every time we find some way people are fraudulently clicking our ads, we
patch that hole. It's a continuous process.
Python-based servics
Google Groups
"Python Old-timers" David Jeske and Brandon Long (of eGroups and
Neotonic/ClearSilver) are the leads on Groups.
All built using Python code
Highly pythonic
They didn't use that giant mountain of C++ stuff
code.google.com
Stein and DiBona
Others? We have so much going on...
How code.google.com was built (block diagram)
/\ \/
Front end Stuff
/\ \/
code.google.com
SWIG
Google Stuff
The funky front end stuff deals with denial of service attacks, reporting, blocking IPs known to be bad
We get to take advantage because we've wrapped this
The HTTP server it's built on has all of the reporting and monitoring things on it - the "Google Stuff"
code.google.com
goopy package - support for functional style programming
Functional stuff to start with
Place to put future modules
Closing
We have a lot of Python code, covering a broad range of needs.
Python has helped Google for many, many years.
SWIG is underrated.
I saw a little rant on Guido's blog (Guido shakes head) - it's kind of difficult to get your head wrapped around it but when you need access to some library of functionality from Python you don't need to go and bulid it yourself - you can use SWIG to wrap it automatically. This fits the Python ideal of smart reuse.
We are now starting to open-source some of the pile.
Questions and Answers (a good 25 minutes for these)
Q: When are you going to open source the build system? (Guido)
A: I don't know. If I recall, Greg has talked about it
Chris DiBona: We're thinking of releasing some of our wrappers around
Perforce first
Q: About SWIG, have you looked at the Boost::Python library?
A: I did see that come up recently; I don't think we use it a lot but it has
been mentioned. I'll take a closer look at it.
Q: What about ctypes?
A: I saw that a while ago on a different project. As far as I know we don't use
it, SWIG works well with our build system
Q: elaborates on ctypes/SWIG differences. While SWIG will build a
Python wrapper for a given C lib, ctypes will let you dynamically load up a C
lib and call its functions.
A: calldll does something similar for windows environment
Q: Do you do anything in regard to network monitoring / SNMP with Python?
A: We do have a very large internal network, lots of traffic, the Ops guys do
have monitors to watch the flow, have to schedule moving large (100 GB or 1
TB-size) files.
Q: (Alex Martelli - who is starting at Google in three days) Back to the
wrapping issue. SWIG and ctypes will not help at all with C++ templates -
Boost is better in this regard. SWIG has been extended to support templets
recently.
A: We do use some templates, but we normally try to avoid them and use SWIG. In
that sense, SWIG works well for us. Some of the template stuff I'd like
better access to, and I end up having to do some extra goo to get things
working.
Q: What is missing from the Python ecosystem?
A: (Anna Ravenscroft, Alex's wife, yells "Alex") But we've solved that problem.
Today they are mostly using Python 2.2, trying to figure out how to use
Python 2.3 -- big upgrade problem
Q: How do you evangelize people who are happy with C++ and SQL and don't seem to
want to try Python?
A: We make it easy to use any of the languages, and don't really force people to
use a different language. The different applications are based on what the
team understands best. We make it easy for all of these things to interact -
if you have a server written in Java we have a custom RPC system that helps
bridge the gap and communicate with other servers.
Q: How many software engineers roughly does Google employ (Steve Holden)?
A: I do know that the public employee count is over 3,000 employees as of
December, but I don't know the break-out in terms of numbers of engineers.
It's hundreds of engineers but I can't really say any more.
Some of the apps written in Java (blogger) can communicate with C++ using
RPC, so not using Python is not a problem
Q: You must have masses of linguistic data (terabytes). How do you access that
data so fast?
A: Yes. I don't know, I don't work in that area. As far as speed, "we just
throw servers at it."
Q: Within Google, is there anything for which Python is considered inappopriate?
A: Is there anything where Python is not appropriate? Well yeah, something like
our indexing system where we scan the web pages and produce an index. Python
is good, and fast, and IronPython is even faster, but it's not fast enough.
We use C for that.
For other things, it's based on the engineering team. We make it possible for
the teams to use what language they like.
Personally, I'd like to see more Python, so some of the things I've been
doing have been working on enabling that.
Q: What kind of bug-tracking system do you use?
A: Bug tracking. Our system is not that good.
We have one, anybody in their right mind has one
Bugzilla derivative
MS has an awesome bug tracking system
Even what I had at collab.net was better
Google's looking at different options for fixing that system.
Q: I want to jump in with another comment on wrapping. I have a plotting library
in C++ with heavy use of templates and I tried wrapping it in three different
things (cxx, Boost, and SWIG). SWIG is actually pretty good now, swig
template support is much better than it used to be. Boost makes things way,
way too big.
A: Based on this feedback it seems like Boost is capable in certain environments
and is definitely worth looking at. Need to evaluate before using.
Q: SWIG performance in real time environment?
A: It is a non-issue. However, I was challenged about this at MS: someone said
"Python won't be fast enough!" I said, "how fast does it have to be? 1000
pages per second?" He couldn't say. So I said "then just don't worry about
unless it proves too slow."
We did go ahead and rewrite some of Python the stuff into ActiveX COM objects
and ASP and... it was slower (laughter and applause).
Much time in Python is spent outside the interpreter loop; much time is
spent, e.g., in the String object, which is written in C.
[On code.google.com] There's still that Global Interpreter Lock in there, but
I still saw some SERIOUS page performance on that thing. Don't be afraid of
bringing Python into your projects.. Your bottleneck will be the network
bandwidth (some person on a 56kbps line), not Python
Q: Mentioned a number of languages used at Google. We use Python because it's
terser (among other reasons). Can you speculate on lines of code in various
languages at Google? (Do you even know total lines of code at Google?)
A: I have no idea. It's a LOT.
Joke from audience: the code counter is still running!
C++ is probably the majority, probably followed by Python.
C++, Python, Java - gut feeling
Q: Five years from now, if people are right about Moore's law, more
multiprocessor systems. What about the getting rid of the Global Interpreter
Lock project that you did a few years ago?
A: Wow. Yeah, that was a few years ago. Back in '96 I made a few patches to
Python 1.4 to get rid of the GIL. We used that at MS to make free threaded
COM objects. We were getting a lot of lock contention. We had to protect
different data structures - like in Python there are pools of frame objects
which had to be protected (??). Things were blocking around those pools. For
2 processors there was a bonus, but for 3 or 4 it was actually slower.
Free threading - Python's thread state was one of the benefits from that set
of patches. sys.exc_info was another.
The Global Interpreter Lock hasn't actually been a problem.
Q: Every once in a while, you are going to introduce a bug into the system. How
do you guys debug across the language boundaries?
A: We don't have any particular tools, or antyhing like that. Have libraries for
logging. My favorite technique is adding print statements (applause/
laughter). It would be wonderful if we had special tools but we don't.
Some people ask what IDE they should use for cross language Java/Python
development. Eclipse is quite good, but even that doesn't have any cross-
language stuff.
Q: Do you have any current hobby projects that you are working on that you can
talk about?
A: Stuff outside Google they can't tell me not to talk about.
Subversion based wiki (subwiki)
svn exposes its libraries to Python via SWIG
You could build a new svn client or interact with a server from Python
ViewCVS does this
subwiki uses the svn repository to store the wiki pages
Googly stuff - mostly code.google.com
Q: What does Google have to say about web application frameworks
A: It's a tough one. Lot of stuff set up in C++. code.google.com was not built
using an off-the-shelf framework; we used Google's custom HTTP server.
GMail is not written in Python. I don't actually know if it's C++ or Java. (Chris DiBona: it's Java.)
Q: Followup - is there anything that Google can contribute (via open source) in the web framework arena?
A: Got a lot of stuff we've been talking about moving into the open soruce arena. Stuff tends to build on itself; trying to get it untangled. Stuff relies on Google-specific stuff, won't be interesting outside of Google.
Q: Tim O'Reilly talked about Google redefining applications. In this view we're
sort of moving away from Google 1.0. When you upgrade, what sort of staging
environment do you have?
A: We definitely have staging environments. One of the things built in to the
systems I talked about for moving things out. The main web server -
www.google.com - is a BIG chunk of code and data - because we have
translations and stuff for everything. In any case, they're called canary
servers (chuckles from crowd) - we put stuff on the canary servers and see if
they're going to fall over. Also, because we get so much traffic we can turn
a knob and expose something like 1% of our traffic to those servers. If they
don't fall over, we expose some more.
The "turning the knob" is a little command line tool written in Python.
Q: (Alex Martelli) Prompted by your mention of unwrapping pieces so they can be
open source. It actually sounds like something that's a very good software
engineering exercise, because it forces decoupling from your proprietary
stuff. Even if we never open source the actual pieces, just having done the
unwrapping seems like a big advantage.
A: It would be a big advantage if we were distributing code. For us, a 50 MB
executable is not a problem, though you'd never try to push that to a client
too often. While it would be an interesting engineering exercise and would
improve the code it has not been a priority.
Chris DiBona followup: Opening your code tends to make it better, for example in our (?)malloc library we said it worked faster for these situations, and when we looked at it we found a bug in our code.
--------------------------------------------------------------------------
REFERENCES: {as documents / sites are referenced add them below}
http://www.swig.org
http://code.google.com
--------------------------------------------------------------------------
QUOTES:
"We don't do that at Microsoft; we ship C++ code"
"Python passed the tipping point years ago"
"[You can] read [Python] in 2 hours, program in it in 2 days, be productive for the company in 2 weeks."
"We use a LOT of SWIG"
"We've got quite a few servers..." (laughter)
"I've worked in large environments before, but nothing on the order of this"
"We have a lot of log data"
"Today we're using primarily Python 2.2 deployed on our servers, but we're trying to work out how to move to Python 2.3."
"Our bug tracking system is not that good"
"Pushing bits out to some guy on a 56k modem IS your bottle neck. Pulling records out of a database is your bottleneck. It's very rarely going to by Python."
"I think we probably have more Python code than we have Java" - a guess
"I think we probably have more Python than we do Java, because of all of those tools and things for supporting the environment, wrappers and all these things."
"Mr. Ascher. That's Dr. Ascher, to you."
"My favourite debugging environment is PRINT."
--------------------------------------------------------------------------
CONTRIBUTORS: {add your name, e-mail address and URL below}
Ted Leung <twl@sauria.com> <http://www.sauria.com/blog>
Linden Wright <lwright@mac.com>
Erik Rose <corp@grinchcentral.com>
Andy Wright
Nicholas Riley <nriley@sabi.net> <http://njr.pycs.net/>
Simon Willison <cs1spw@bath.ac.uk> <http://simon.incutio.com/>
Jonathan Blocksom <blocksom@gollygee.com>
Abhay Saxena <ark3@email.com>
--------------------------------------------------------------------------
E-MAIL BOUNCEBACK: {add your e-mail address separated by commas }
--------------------------------------------------------------------------
NOTES ON / KEY TO THIS TEMPLATE:
A headline (like a field in a database) will be CAPITALISED
This differentiates from the text that follows
A variable that you can change will be surrounded by _underscores_
Spaces in variables are also replaced with under_scores
This allows people to select the whole variable with a simple double-click
A tool-tip is lower case and surrounded by {curly brackets / parentheses}
These supply helpful contextual information.
--------------------------------------------------------------------------
Copyright shared between all the participants unless otherwise stated...