Fast Networking with Python.notes
Wednesday, March 30, 2005
TITLE OF PAPER: Fast Networking with Python
URL OF PRESENTATION: _URL_of_powerpoint_presentation_
PRESENTED BY: Itamar Shtull-Trauring
REPRESENTING: _name_of_the_company_they_represent_
CONFERENCE: PyCon 2005
DATE: March 25, 2005
LOCATION: Grand Ballroom, GWU
--------------------------------------------------------------------------
REAL-TIME NOTES / ANNOTATIONS OF THE PAPER:
{If you've contributed, add your name, e-mail & URL at the bottom}
What do we mean by "fast"?
I/O bound
CPU bound
I/O Bound
Not enough bandwidth (no point in optimizing your code)
High latency (same comment about optimizing code)
Solution: improve protocol
CPU Bound
Can't keep up with nework
Networking code is slow, or
Business/application logic is slow
Example: Batch processing
If the processing of the data your sending is slow, then it doesn't
matter how fast you can send it
Our Goal
Network code fast enough
if you can keep up with the network thats good enough, you don't need to be faster than the data being received over the wire - you might as well be doing something else
I/O bound
Lazy Optimization
Upgrade to Python 2.4
Buy new hardware
(sometimes cheaper than new code)
Algorithms
Python can be faster than C
if the algorithm is designed better in python than C it's faster
Example: Scheduling
while True:
for t in getScheduledEvents():
t.run()
sleep(getTimeUntilNextEvent())
Want to only get the events that are going to happen now, because you don't care about the rest right now.
Example: Scheduling
glib: iterate over all timers O(n) [looks at every event, thousands of times a second]
Twisted: heapq of timers, only get newest O(k) [because the event list is sorted by time you can stop iterating over the events when you hit something that is scheduled for later time]
Twisted is faster for large number of timers
Data Copying
Python strings are immutable
Adding two strings copies data
(When you concatenate strings in Python, you're copying them and allocating more memory; doing this the wrong way can be extremely expensive.)
Bad String Concatenation
s = ""
for i in chunksOfData:
s += i
(Actually in 2.4 this special case is optimized, but not if 's' were an attribute of an object)
(If you do 1000 writes, it's actually doing half million copies of your string length, and it's really, really slow.)
Do this Instead
l = []
for i in chunksOfData:
l.append(i)
s = "".join(l)
(For n chunks, it's really inexpensive.)
Comparison graph:
(green line on floor, horizontal, orange line linear increasing: message: use lists)
0.08 /
0.06 /
0.04 /
0.02 /
--------
Lower-Level Optimization
Reduce allocation
Reduce memory copying
(Networking code shouldn't be copying the strings - you are just pushing bytes - you should pass the memory chunk around and process it)
Buffer() is your friend (Kinda)
buffer() is like char*
Non-copying reference to substring
(will look at but will not copy)
Buffer() vs string copy
(scatter plot: near linear increasing, vs flatline)
(jump around 250 or so is probably due to imalloc())
Writing Without Buffer()
def doWrite(self):
amountWritten = write(self.data)
if amountWritten:
self.data = self.data[amountWritten:]
(you have large chunk to write, won't write all at once; make copy of remaining data, write it out next time -- copying over and over)
Writeing With Buffer()
def doWrite(self):
slice = buffer(self.data, self.offset, len(self.data) - self.offset)
amountWriten = write(slice)
self.offset += amountWritten
if self.offset == len(self.data);
self.data = ""; self.ooffset = 0
(can take slice of buffer, only parts we haven't written yet: writing without having to do any copying; then update offset. Real code probably won't be done just like this -- you might have a MB buffer which sits around during the write process, sucking up memory for no good reason.)
Improving Python
Control over memory allocation
Better buffer type
Changes to existing built-ins
(want better control over memory allocation
allocate a chunk of buffer, and only copy if you really need it)
Buffer API
C API for getting at char*
Implemented by strings, buffer(), mmap
Suported by many APIs, e.. file.write()
Needs to be supported more
>>> s = "10lala"; b = buffer(s, 0 ,2)
>>> str(b)
'10'
>>> int(b)
ValueError: invalid literal for int() ...
(it's complaining about original string, not what the buffer is pointing to;
python needs much better support for Buffer API)
Array.Array
Can be used as mutable array of bytes
Very expensive to str()
Can't do non-copy slices
Missing .split(), .find(), etc.
(all kinds of annoying limitations)
(Buffer doesn't have find() or split() either)
(expensive to take Array and make it a string)
Example: Non-Copying Split()
str.split() copies into new strings
Modified C code that returns buffer()
Non-copying version is 8% faster
No improvement for small strings
Sockets and Files
.file.read(), socket.recv() allocate on read
.file.readinto() - unsupported (undocumented, may not be in next release)
.socket.recvinto() - nonexistent
(there are other design issues with mutable buffers; you might want to use one as a key in a dict without making a copy of it, but you can't have mutable objects in a dict; if you do dict key +1 you don't know if it modify the key or if it will modify the original copy of the key or will it create a new copy and lose the connection)
Other Options
C/Pyrex/C++ extensions
Fusion: C++ protocols for Twisted (http://itamarst.org/software/
Flumotion technique (Twisted + glib C code)
(do most of their client requests in Twisted, C for the rest, data intensive operations)
If these problems are interesting to you, we're hiring. :-)
Q. There was some talk about the buffer() being deprecated
A. I'm not the one to ask but it may have been the buffer type but he isn't really the one to ask
Q. about what size do you start seeing the benefit being less
A. I didn't do any systematic timings, the 8% was done using a 500 byte string into 70 byte chunks
Q. Numarray - is it useful for this
A. Python would not be quite fast enough to read/write to the c++ numarray library - python would be fast enough
--------------------------------------------------------------------------
REFERENCES: {as documents / sites are referenced add them below}
http://itamarst.org
--------------------------------------------------------------------------
QUOTES:
"... you could be doing something else, like video games"
--------------------------------------------------------------------------
CONTRIBUTORS: {add your name, e-mail address and URL below}
Chris Shenton <chris@shenton.org>
Erik Rose <corp@grinchcentral.com>
--------------------------------------------------------------------------
E-MAIL BOUNCEBACK: {add your e-mail address separated by commas }
--------------------------------------------------------------------------
NOTES ON / KEY TO THIS TEMPLATE:
A headline (like a field in a database) will be CAPITALISED
This differentiates from the text that follows
A variable that you can change will be surrounded by _underscores_
Spaces in variables are also replaced with under_scores
This allows people to select the whole variable with a simple double-click
A tool-tip is lower case and surrounded by {curly brackets / parentheses}
These supply helpful contextual information.
--------------------------------------------------------------------------
Copyright shared between all the participants unless otherwise stated...