pycurl CurlMulti mini-HOWTO

In the course of writing a little python command line RSS engine, I naturally came to a point where I needed to download the RSS feeds to store them and work with them.

My options looked like this:

  • Use urllib2.urlopen which would download the feeds serially. When you have hundreds of feeds in your opml file, that takes too long. Besides, what do I have ADSL for?
  • Use Twisted. But, my application is not a web app and frankly, switching the whole thing over to the Twisted event-driven model is overkill and unnecessarily complicated.
  • Use wget with threads. The problem with that would have been that I'd have to jump through hoops to pass the data from the threads back to the main application, possibly with messy kludges like temporary files. No thanks!
  • Use libcurl through the pycurl library. Ah, yes, perfect. Or so I thought.

So, the documentation for pycurl is all well and fine except for the small fact that the documntation for the CurlMulti object is obtuse, and the example code they provide opens more questions than it resolves. Look at this:

01 import pycurl
02 c = pycurl.Curl()
03 c.setopt(pycurl.URL, "http://curl.haxx.se")
04 m = pycurl.CurlMulti()
05 m.add_handle(c)
06 while 1:
07     ret, num_handles = m.perform()
08     if ret != pycurl.E_CALL_MULTI_PERFORM: break
09 while num_handles:
10     apply(select.select, m.fdset() + (1,))
11     while 1:
12         ret, num_handles = m.perform()
13         if ret != pycurl.E_CALL_MULTI_PERFORM: break

In lines 6-8, the code repeatedly runs perform() on the pycurl.CurlMulti() object until the return value for pre is not pycurl.E_CALL_MULTI_PERFORM. Why does it do this twice? Here's how this works.

You can get through lines 1-5 pretty well through the pucyrl documentation, but in short:

  • Line 1 obviously imports the library
  • Line 2 sets up a new Curl object. A Curl object holds one item to be downloaded.
  • Line 3 tells the Curl object what precisely to download
  • Line 4 declares a CurlMulti object. This is quite simply a container object which holds one or more url objects.
  • Line 5 adds the Curl object into the container CurlMulti object
  • Lines 6-8: This loop goes on until the if condition in line 8 is met. Line 7 calls perform() on the CurlMulti object, which is basically a "go fetch now" command. Curl objects are non-blocking, which means they keep on working in the background, freeing up your application to do other things while they spin around productively. Since we don't really have much else to do in this specific case until we get our bloody feeds, we keep on calling perform() like a whining child until we are told that, "Dude, I'm pretty much done. Get off my back". Which pycurl tells us by setting the variable ret as returned by perform() to a value which is something other than pycurl.E_CALL_MULTI_PERFORM. So in a nutshell, this whole loop is about nagging the CurlMulti object until it tells us in its own way that it is done.
  • Lines 9-13: Wait a second, lines 11-13 are exactly the same as lines 6-8. What gives? I'll tell you what gives. What gives is that lines 6-8 were in truth not about actually geting any honest-to-$DEITY data out of any application; it was about initializing the variable num_handles. Yes, we're braindead. We used the loop in lines 6-8 to initialize num_handles so we could loop over it in lines 9-13, calling select() on the file descriptors which were blossoming out of the CurlMulti object as a consequence of calling perform() on it ad nauseam.

And if you're thinking now that we could rip out lines 6-8 and just count how many actual Curl objects we stuck so carnally into our CurlMulti object, you would be spot on the money. So, this code works too and is a hell of a lot less obtuse:

import pycurl
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://haxx.curl.se")
m = pycurl.CurlMulti()
m.add_handle(c)
# I'm not handwaving with this next line
# if you're adding several Curl objects you can
# damn well count them and initialize accordingly
num_handles = 1
while num_handles:
	while 1:
		ret, num_handles = m.perform()
		if ret != pycurl.E_CALL_MULTI_PERFORM:
			break
	m.select(1.0)

$DEITY, what awful documentation!