simple s-expression parser in python

June 28, 2010

A few weeks back, I boasted in a bar that anyone could write a simple s-expression parser from scratch (no flex, no yacc, no bison, no antler, etc). Well, I could not boast without doing it myself, so I finally did. My s-expression lexer/parser weighs in at about 170 lines without tests (450 lines with them), and parsers integers and symbols. I even created a little repl which takes this:

1

    (+ 1 2)

And converts it into this:

1

    [Symbol(+), Number(1), Number(2)]

A pretty-printed python list consisting of objects representing symbols and numbers.

I think the design is fairly readable if a bit unusual. There are classes for each token (instances of which are used for fully and partially formed tokens) and we “add” these tokens together to form a larger token and ultimately invoke a next_token_exception allowing the lexer to yield the largest, greediest, most fully-formed token possible.

Here’s an example of the lexer in action:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


    >>> symbol_token("h")

	>>> _ + symbol_token("i")
	<parse_sexp.symbol_token object at 0x7fb9e2d25710>
	>>> formed = _
	>>> formed + " "
	Traceback (most recent call last):
	    File "", line 1, in 
  		File "parse_sexp.py", line 33, in __add__
    	raise next_token_exception()
	parse_sexp.next_token_exception
	>>> formed.value
	'hi'

After lexing, we’re left with a stream of fully-formed tokens; it’s a simple matter to filter out the whitespace tokens and start yielding instances of Symbol() and Number() classes with proper, pythonic types for the data.

Although I used regexes (quite unnecessarily) and generators, this little thing could probably be ported to C. It’d be the fastest, most useless parser/lexer around :)

Happy Hacking!