Wednesday, December 23, 2009

Asynchronous HTTP Request

Note (4/2/11): Please see my recent post detailing asynchronous HTTP requests using Twisted.

Note (3/13/11): I originally wrote this post while looking for callback-style HTTP request functionality in python. I made the mistake of thinking that "callback-style" is the same as "asynchronous". The following details my efforts to achieve a callback-style HTTP request using urllib2. The final (updated) code example illustrates how to use threads to achieve asynchronicity. I'd recommend using a thread pool if you plan more than just a handful of requests. And, as others have noted, Twisted is really the best python framework for asynchronous programming. Also, I'd like to thank the commenters for pointing out my mistakes; I'm sorry for not realizing my errors sooner.

You might think it would be easy to write python code to perform an asynchronous achieve a callback-style web request. It ought to be as simple as providing a url and callback function to some python library routine, no? Well, technically, it is that simple. But somehow, the documentation makes the task surprisingly difficult.

One option, of course, is Twisted. But, reading through the (sparse, fractured) documentation made me think there had to be something easier. This led me to urllib2. The short answer is that, yes, urllib2 does what I want. But, the documentation is sufficiently backwards that it took me over an hour to figure out how to accomplish the task.

Accomplishing a blocking simple HTTP request with urllib2 is simple and the documentation reflects that: use openurl. The return value of openurl provides the response and additional information in a file-like object. The problem is how to achieve the same result in an asynchronous callback-style manner. One would think openurl could simply take an additional handler object which is called with the response as its only argument when the request completes. Ha! build_opener looked vaguely promising as it accepted handler(s). This led me to create a class which inherited from BaseHandler which defined protocol_response. No dice. And, as I later realized, protocol_response takes three arguments (self, req, response), not two, and changes names depending on the protocol. Of course, at that point, I was at a loss as to how the protocol name was determined (the BaseHandler documentation ignored this issue). And, the examples were useless since they all used standard handlers. Next, I tried inheriting from HTTPHandler, overriding http_response with a method that simply prints the url, info and response text. This almost worked. It successfully retrieved the web page and printed it. But, then, it raised the following exception:

Traceback (most recent call last):
  File "./webtest.py", line 14, in 
    o.open('http://www.google.com/')
  File "/usr/lib/python2.6/urllib2.py", line 389, in open
    response = meth(req, response)
  File "/usr/lib/python2.6/urllib2.py", line 496, in http_response
    code, msg, hdrs = response.code, response.msg, response.info()
AttributeError: 'NoneType' object has no attribute 'code'
After much searching, I finally realized that I had failed to return a response-like object from my http_response method. This seems like an odd requirement for a callback method. And, it could have been easily clarified in the documentation with an example.

Alas, after all that, I was able to use urllib2 to successfully make an asynchronous HTTP request, so I can't complain too much. Here's the code for anyone who's interested:

#!/usr/bin/env python

import urllib2
import threading

class MyHandler(urllib2.HTTPHandler):
    def http_response(self, req, response):
        print "url: %s" % (response.geturl(),)
        print "info: %s" % (response.info(),)
        for l in response:
            print l
        return response

o = urllib2.build_opener(MyHandler())
t = threading.Thread(target=o.open, args=('http://www.google.com/',))
t.start()
print "I'm asynchronous!"

Update (3/12/11): My comment before the sample code indicated that the sample code was asynchronous. But, it wasn't. I've updated it to be asynchronous. When originally writing this post, I intended the example code to show the urllib2 handler approach.

12 comments:

  1. What makes you think this is asynchronous? The socket is what blocks, and it still does here, even with your own handler.

    Twisted remains the best choice for all things async/network related.

    If you don't like that:

    Eventlet has a urllib2 replacement that uses non-blocking sockets. Example here:
    http://eventlet.net/doc/

    ReplyDelete
  2. Because I'm running my own servers behind a load balancer and the queries take multiple seconds and total time consumed is appx. 1/n when I use n servers and other python code runs while the queries are being served.

    Thanks for the eventlet link. Looks like the perfect middle-ground between twisted and urllib2.

    ReplyDelete
  3. A simple test would be doing some computing+printing after the o.open() line and see if that comes out earlier than the printing in the handler. I doubt it does.

    Your "1/n evidence" probably comes from multi-tasking of the underlying operating system.

    ReplyDelete
  4. Yes, I'm able to spawn many queries and/or perform computing+printing after the o.open() line but before the first response.

    ReplyDelete
  5. I didn't know who to believe, so I did a test.

    I added: print "Main thread continues" at the bottom of the code, and ran it.

    The "Main thread continues" did not appear until the URL was completely fetched, suggesting the commenters are correct.

    ReplyDelete
  6. By asynchronous, I was referring to being able to issue and process multiple HTTP requests in parallel, not that the main thread would continue immediately after the HTTP requests are spawned. If you want the HTTP requests to run "in the background", create a separate thread for that work.

    ReplyDelete
  7. As other commentators have stated, this code doesn't do anything but print some response information after the blocking response has finished. This code does nothing more asynchronously than any other part of urllib2. Can you remove or edit this post so that it will no longer show up in search results for python asynchronous http?

    (I'd opt remove, it's embarrassing that you think this is asynchronous...)

    ReplyDelete
  8. Marr75: I realize now that my comment before the sample code is in error. I meant to indicate that the example is working urllib2 callback-style code. Instead, I implied that the sample code itself is async. I've modified the example code to be async.

    ReplyDelete
  9. can you give a code example with session id modified and post variables added to webrequest on your app, please?

    for example:
    t = threading.Thread(target=o.open, args=('http://www.google.com/','mycustomsessid=3j45h3jk5h',{'post1=val1','postcolumn2=val2'},))

    thanks

    ReplyDelete
  10. Dougkan: I'm not very familiar with setting session id and post variables in this context. I was working on simple one-shot get requests when I wrote this post. It appears you've already provided at least a partial example. What else are you looking for?

    ReplyDelete
  11. Regardless of what the other guys said, this code was very useful and allowed me to spawn the requests I needed in parallel as prescribed. Thank you!!

    As is often the case with geeks, if somebody has a good idea and implements it well; a whole bunch of people are waiting to pick it to pieces.. keep up the good work dude!!

    ReplyDelete