You are currently browsing the monthly archive for October 2009.

Continuing on with our middleware series, we now cover authentication. There are a ton of authentication and authorization WSGI middleware, as well an basic authnetication example used in the WSGI documentation. Some are out of date, and a lot of others are tightly integrated with other parts of a particular frameworks request handling. It would have been easy enough to RollYO basic authentication, but I really hate reinventing a wheel I don’t have to.

I decided to investigate AuthKit, part of Pylons, to service my authentication needs, and struggled through a lack of documentation and fairly large code base, all for your pleasure.

Authentication with AuthKit

AuthKit assumes a lot of the setup for your middleware follows Pylons conventions. It was a struggle for me to make heads or tails of the examples, not being familiar with Pylons application configuration and how requests were routed. The secret sauce to actually make AuthKit work with bottle is to realize that there are actually multiple levels of AuthKit middleware that you have to invoke to get the authorization chain to even start up. Here is how you go about it in Bottle:

from authkit import authenticate, authorize 
from authkit.permissions import RemoteUser

from bottle import *

# bottle exposed function
@default() # maps to root URL
def hello():
    return "hello"

# get the default bottle application
app = default_app()

# set up an authorization permission for 
# basic authentication of a remote user
app = authorize.middleware(app, RemoteUser())

# A simple authentication function
def basic_auth(environ, username, password):
    return username ==  password

# now activate the authentication
auth_config = {
    'authkit.setup.method':'basic',
    'authkit.basic.realm':'Test Realm',
    'authkit.basic.authenticate.function':basic_auth,
    'authkit.setup.enable':'True'
}
app = authenticate.middleware(app,app_conf=auth_config)

# run the application
run(app=app)

To make this work for App Engine, you need to include the AuthKit sources and account for deploying Bottle applicatios on GAE, covered in the Bottle docs and other posts.

Bottle is a great little Web application framework, but in it’s quest for simplicity, it left out a couple of key components that are needed for cu3w0rx: HTTP method overriding and basic authentication. Luckily Python’s WSGI middleware can fulfill this role.

WSGI Middleware

WSGI middleware is a handy way to add functionality to an application by adding layers to the request/response chain in between the client request and your application. Incidently, Ruby’s Rack project took inspiration from WSGI middleware.

Method Overriding

A while back I bemoaned the fact that CherryPy’s RoutedDispatcher could not handle PUT and DELETE requests from forms submitted by your typical web browser, which most often only does GET and POST requests. I submitted a patch to the CherryPy project, but I now believe that this is the wrong approach, and that middleware can handle altering the REQUEST_METHOD header in response to a submitted form with a hidden “_method” parameter, as is the accepted convention. Here is some middleware for the server side application which will resige on GAE:

# method_overide.py
# WSGI middleware to set the HTTP REQUEST_METHOD header from a submitted form
# that contains a "_method" hidden variable.

class MethodOverride(object):
  def __init__(self, app):
    self.app = app

  def __call__(self, environ, start_response):
    method = webapp.Request(environ).get('_method')
    if method:
      environ['REQUEST_METHOD'] = method.upper()
    return self.app(environ, start_response)

And here is how you would insert it in between your bottle application:

from bottle import *
from google.appengine.ext.webapp import util
from method_override import MethodOverride

@route("/test_put",method="PUT")
def testput():
  return "PUT success"

@route("/test_delete",method="DELETE")
def testdelete():
  return "DELETE success"

# run in GAE
def main():
  app= default_app()
  # insert the method override middleware
  app = MethodOverride(default_app())

  util.run_wsgi_app(app)

if __name__ == '__main__':
  main()

Now forms that define the REQUEST_METHOD as a hidden param “_method” will be routed to the correct function.

One of the great aspects of CloudCrowd is that the code base is so small and readable. A big part of that comes by virtue of using the Sinatra Web application framework for the master and slave daemon processes. Sinatra is much closer to the GAE webapp framework than CherryPy (by default), in that you define methods that correspond to the HTTP verbs (GET, PUT, POST, DELETE and HEAD). Sinatra differs in that the method’s first argument is the route that the method services. The GAE webapp framework instead forces you to define a class that the defines the HTTP verbs, and later map those classes to some route. Let’s take a look at “Hello World!” to illustrate the difference. Here is the GAE webapp version

from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app

class HelloWorld(webapp.RequestHandler):
	def get(self):
		self.response.out.write("Hello World!")

app = webapp.WSGIApplication([("/",HellowWorld)])

def main():
	run_wsgi_application(app)

if __name__ == "main":
	main()

Now here is the version in Sinatra:

require "rubygems" # RubyGems is optional, depending on your setup
require "sinatra"
get '/' do
  "Hello World!"
end

Much more readable and concise. Here is a version of HellWorld in CherryPy, using the default MethodDispatcher, for comparison:

import cherrypy

class HelloWorld:
    exposed = True
    def GET(self):
        return "Hello World!"

app = HelloWorld()

d = cherrypy.dispatch.MethodDispatcher()
conf = {'/': {'request.dispatch': d}}
cherrypy.tree.mount(root, "/", conf)

Better, but you are still separating the the mapping of the root URL “/” to the HelloWorld class outside of the class. I started digging around the interwebz for Python frameworks that would work closer to the lightweight Sinatra and found Bottle. Bottle is small, self-contained and uses python decorators to map a function to a route. Brilliant. Here is the example using Bottle:

from bottle import route, run

@route("/") # assumes GET method
def hello():
	return "Hello world!"

run() # This starts the HTTP server

Now that’s what I’m talking about! Deploying a Bottle application to GAE is covered in their documentation. We’ll be using Bottle for cu3w0rx to create the master and slave daemons in later posts.

cu3w0rx. Lovely name, eh? Moving on …

In this series we will be looking to implement a simple Map-Reduce framework that closely models the design and implementation of the CloudCrowd, which is written in Ruby. CloudCrowd  has some nice design choices. Specifically its small size (~1,800 LOC), use of JSON for message transport, and emphasis on HTTP as the protocol is quite nice. It is also pretty to look at, and the interface is entirely AJAX driven, so relies on the same service calls as the rest of the suite.  I think it is worth our while to set the stage for the project. Here are a list of things we will be taking directly from CloudCrowd:

  1. The App Engine application will serve as the central master resource.
  2. It will serve as the master work queue. All jobs will be submitted to it for processing.
  3. All communication will be via JSON messages. As in CC, the web site will make use of the JSON returned from AJAX requests to the resource handlers.
  4. There will be a clear specification for the top-level properties of a Job message, but handlers will be responsible for vetting the provided options to handle the request.
  5. Operations on inputs are assumed to happen on local disk. E.g. jobs will be staged onto the worker nodes’ scratch space.
  6. Jobs will have the option of a callback URL on success.
  7. For simplicity’s sake, authentication to the master, and between master and slaves, will be the same basic HTTP authentication credentials.
  8. A worker node will accept work item requests based on the machine’s load.

There will be several points that will stray from CloudCrowds implementation as well:

  1. Nodes will publish their capabilities to the master. It will not be assumed that all nodes have all the same capabilities. In this respect cu3w0rx will more resemble the nanite project (Ruby + ERLang).
  2. Worker nodes will not share the same code base as the master. The major reason for this is that worker nodes will not run on GAE, hence it makes no sense to hamstring them with the restrains that GAE imposes on python.
  3. Workers will maintain their own state in a local database. This will server to keep track of capabilities, number of jobs processed, monitoring statistics, and results from previous jobs.
  4. Map and reduce are implemented as two separate jobs, as far as the worker nodes are concerned.
  5. Since this is a demonstration project, we will not implement a scheme to save result files to non-volitile storage. Instead we will provide a way to give authenticated access to result files from worker nodes. Result files from Reduce phases will live only at the final destination host (e.g. the hosts that have checked back into the master as having succeeded a particular job.
  6. The queue is not necessarily FIFO. I am actually not sure if this is also true about CloudCrowd, but its worth mentioning here.

I would also like to implement a way to provision and configure cloud VMs using the excellent libcloud library, but I think that is outside the scope of a demonstration project. If you see anything missing (or think some stuff can be left out) leave a comment!

I just tried for 10 minutes to get an unused application identifier that would make sense for the new example app. Adding insult to injury, every single supposedly taken ID I tried to access via the <appID>.appspot.com URL was a 404. Perhaps there is a bit of squatting going on?

Yeah you can map to domain name, but it makes writing tutorials that much more of a pain ;)

After a long hiatus, I am trying to pick up the series of GAE posts. One small problem is that by day I am a Ruby programmer and switching context to Python for posts is a bit of extra work. That and a friendly prod from Ilya Grigorik for Ruby programmers to start writing about JRuby on Google App Engine has me thinking that I should play to my strengths more.

Having said that, I plan on doing one more series for Pythonistas, in order to implement a simple Map-Reduce work queue system using GAE as the master node. This comes from a direct need for us to support MapReduce type workflows on both Windows and Linux machines and an existing Ruby project that I would have loved to use (CloudCrowd) does not work on windows. In general any Ruby project that assumes fork() is available on the system tends to have problems in a Windows environment. The project is small enough that it will not be too much work to port over the concepts to Python.

“But Angel, doesn’t disco already fit the MapReduce void for Python?”

Technically, yes it does, but it relies on Erlang for communication between master and nodes, which is obviously a no-go for GAE.

“But Angel, if you are a Ruby guy, why don’t you just fork CloudCrowd and make it run on JRuby + GAE?”

Maybe one day I will, but Windows worker node compatibility is a must-have for us and as I researched CloudCrowd, the code base kept getting more and more Unix-centric. I did make a branch of the codebase that uses Ruby threads to overcome the use of fork() but the solution is non-optimal and broke when I tried to merge it back into the master branch which added even more into Unix dependencies for node CPU and memory statistics. Then other priorities took over and I have not looked back on that project.

Plus it will be an interesting project to cover on AppMecha!

Follow

Get every new post delivered to your Inbox.