website | twitter

Friday, September 24, 2010

Tamacola (5)

Tamacola is not just another LISP language, it is designed as a meta-language to make a new language. I'll explain this feature today. Today's goal is to design a subset of lisp language. If you think that a lisp is too simple to keep your passion, sorry, be patient, simple thing first.

Prepare your Tamacola environment

To setup Tamacola environment, you need to download both Tamacola distribution and Tamarin VM. Those are available on http://www.vpri.org/vp_wiki/index.php/Tamacola. You need add the PATH environment variable to find the avmshell command, and also it would be useful to set the PATH to bin/ in the tamacola tree. To make sure Tamacola works, plese type:
make run-example
It runs all of the examples in the Tamacola distribution as well as recompile the compiler. If you don't find any error, you are ready to go. Otherwise, please let me know the problem.

Tamacola command

Tamacola command read a tamacola program and run immediately. If you want to make a Flash contents, another command tamacc (Tamacola Compiler) is more suitable. Now we are playing with an interactive shell of tamacola command, so I'll give you a brief explanation. The interactive shell starts with minus (-) option. Let's try a simple arithmetic. If you didn't setup PATH environment, please specify the directory name, too.
$ tamacola -
Cola/Tamarin
> (+ 3 4)
7
You can also give Tamacola source files as well as compiled binary names. Typically, source code ends with .k, and a binary ends with .abc. Tamacola is smart enough to detect newer file between .k and .abc.

Match against a string constant

Suppose you are on some working directory, and you have already set PATH environment to the bin/ directory. And then, we are going to write a very simple language, greeting:

;; greeting.g - A simple PEG example

greeting = "morning" -> "Good Morning!"
         | "evening" -> "Good Evening!"

This stupid example answers "Good Morning!" if you say "morning", and it answers "Good Evening!" if you say "evening". This PEG syntax is easy to understand. The right hand side of = is a rule name. A rule name is translated as a function once it is built. -> means an action rule, where if the left hand is matched the right hand side is returned. | is an Ordered options. In this case, the parser tries the first case "morning", and tries the second case "evening" only if the first case fails.

Save this syntax named "greeting.g". To test this language, type those commands:

$ mkpeg greeting.g
$ tamacola greeting.k -
compiling:  greeting.k
Cola/Tamarin
> (parse-collection greeting "morning")
"Good Morning!"
> (parse-collection greeting "evening tokyo")
"Good Evening!"

Mkpeg command converts grammar file (greeting.g) to tamacola source (greeting.k), a rule "greeting" the result can be read by tamacola shell. Greeting.k is built on the fly and the command prompt is shown.

Parse-collection's first argument is a parser name (in this case "greeting"), and the second is a input collection. As the name implies, it accepts any collection as the input stream.

The second case shows an interesting property of PEG syntax. Although the second rule matches the beginning part of the input "evening tokyo", still the input remains more string " tokyo". PEG doesn't care if the input is competely consumed or not. If you really want to make sure that the entire input is matched, you need to explicitly tell the Parser the point where end of the file.

Number parser

The last example only matched a predefined constant, but we make a parser for any integer number here.

;; number.g -- A number parser

digit   = [0123456789]
number  = digit+

We also convert the grammar specification into the tamacola program, but in this case, we give -n option to tell the namespace. A namespace is useful when you want to use a common name as a rule name like "number". Because "number" is already used in the system, you can not use it without namespace.

The grammar itself is easy to understand if you have an experience with regular expressoins. Brackets ([]) matches one of characters inside, and postfixed plus (+) repeats previous expression with one-or-many times.

$ mkpeg -n number number.g 
$ tamacola number.k -
compiling:  number.k
Cola/Tamarin
> (parse-collection number/number "xyz")
FAIL
> (parse-collection number/number "345")
{token-group:
(53 52 51)}

Because we use the namespace "number", we need specify the namespace before slash(/) in the function name.

As you might notice, this parser correctly rejects a non-number like "xyz", and accepts "345". But the result is not so useful. The return value of plus is a special object named "token-group", but we would want a number represented by the string, instead. So we put a conversion function to get the value.

number  = digit+:n      -> (string->number (->string n))
$ tamacola number.k -
compiling:  number.k
Cola/Tamarin
> (parse-collection number/number "345")
345

Now parser returns a number conveniently. Perhaps you might think that it is somewhat cheating. As the string->number function itself is a kind of number parser, we should have write a number parser without string->number! Yes we could. But it leads more interesting topic about left and right recursion, so I leave it for later.

S-expression parser

Now we are going to write a parser for almost real S-expression. This parser can only handle number and list, but it is useful enough to explain the essence of Tamacola.

;; sexp.g
;; Lexical Parser

spaces  = [ \t\r\n]*

digit   = [0123456789]
number  = digit+ :n spaces              -> (string->number (->string n))

char    = [+-*/abcdefghijklmnopqrstuvwxyz]
symbol  = char+ :s spaces               -> (intern (->string s))
        
sexp    = symbol
        | number
        | "(" sexp*:e ")"               -> (->list e)

In this grammar, only new operator is the postfix star (*) which repeats zero-or-many times. Rest is straightforward. To test this grammar, we use Tamacola's simple test framework. Writing test case is better than the interactive shell, because you don't have to type same expression many times.

;; sexp-test.k

(check (parse-collection sexp/spaces "    ")            => 'SPACES)
(check (parse-collection sexp/digit "0")                => 48)
(check (parse-collection sexp/number "345")             => 345)
(check (parse-collection sexp/char "a")                 => 97)
(check (parse-collection sexp/symbol "hello")           => 'hello)

(check (parse-collection sexp/sexp "345")               => 345)
(check (parse-collection sexp/sexp "hello")             => 'hello)
(check (parse-collection sexp/sexp "(hello world)")     => '(hello world))
(check (parse-collection sexp/sexp "(3 4)")             => '(3 4))
(check (parse-collection sexp/sexp "(print 4)")         => '(print 4))

The check function comes from SRFI-78. This function complains only if the left hand value and the right hand value differ. Otherwise, does nothing. I like this UNIX stile conciseness.

As a convention, a test program is added a postfix "-test" with the main program's name. I borrowed this custom from Go language.

Make sure this program do nothing.

$ tamacola sexp.k sexp-test.k 

Lisp Compiler

The PEG parser can handle any list structure as well as string. It allows you to write compiler in PEG. In a string parser, the input is a string and the output is some object (a list in our case), but in a compiler, the input is a lisp program and the output is a assembler code.

;; Compiler

arity   = .*:x                          -> (length (->list x))
insts   = inst* :xs                     -> (concatenate (->list xs)) 
                                        
inst    = is-number:x                   -> `((pushint ,x))
        | is-symbol:x                   -> `((getlex ((ns "") ,(symbol->string x))))
        | '( '+ inst:x inst:y )         -> `(,@x ,@y (add))
        | '( '- inst:x inst:y )         -> `(,@x ,@y (subtract))
        | '( '* inst:x inst:y )         -> `(,@x ,@y (multiply))
        | '( '/ inst:x inst:y )         -> `(,@x ,@y (divide))
        | '( inst:f &arity:n insts:a )  -> `(,@f (pushnull) ,@a (call ,n))

There are some new elements in the grammar. Quoted list '( ) matches a list structure, and a quoted symbol matches a symbol.

A prefix ampersand (&) prevents to consume the stream even if the rule matches. For example, &arity rule examine the rest of the list, but the contents are matched again by the insts rule later.

Is-number is matched against number, and is-symbol is for a symbol. Those rule can not be described as PEG grammar, but as a lisp function.

(define is-number
  (lambda (*stream* *parser*)
    (if (number? (peek *stream*))
        (begin (set-parser-result *parser* (next *stream*))
               #t)
        #f)))

(define is-symbol
  (lambda (*stream* *parser*)
    (if (symbol? (peek *stream*))
        (begin (set-parser-result *parser* (next *stream*))
               #t)
        #f)))

A rule is a function which receives the stream and the parser (an object which store the result). The rule function returns #t if it matches, and #f if it fails.

I think it is easier to see the test code than read my explanation.


(check (parse-collection sexp/arity '(a b c))   => 3)

(check (parse-collection sexp/insts '(3 4)      => '((pushint 3)
                                                     (pushint 4)))

(check (parse-collection sexp/inst '(3))        => '((pushint 3)))

(check (parse-collection sexp/inst '((+ 3 4)))  => '((pushint 3)
                                                     (pushint 4)
                                                     (add)))

(check (parse-collection sexp/inst '((f 3 4)))  => '((getlex ((ns "") "f"))
                                                     (pushnull)
                                                     (pushint 3)
                                                     (pushint 4)
                                                     (call 2)))

Put it in an envelope

We still need a little bit to construct a real assembler code. This detail topic is out of the context, so I simply show the code.

program = inst:x  -> `(asm
                       (method (((signature
                                  ((return_type *) (param_type ()) (name "program")
                                   (flags 0) (options ()) (param_names ())))
                                 (code ((getlocal 0)
                                        (pushscope)
                                        ,@x
                                       (returnvalue))))))
                      (script (((init (method 0)) (trait ())))))

And the test case.

(check (parse-collection sexp/program '((print 42)))
       => '(asm
            (method
             (((signature ((return_type *) (param_type ()) (name "program")
                           (flags 0) (options ()) (param_names ())))
               (code ((getlocal 0)
                      (pushscope)
                      (getlex ((ns "") "print"))
                      (pushnull)
                      (pushint 42)
                      (call 1)
                      (returnvalue))))))
            (script (((init (method 0)) (trait ()))))))

You can read the entire program in example/sexp.g in the Tamacola distribution. To try the program, please enter:

make -C example test-sexp

Left recursion

We left an interesting topic about left and right recursion. Let me show you our number parser again.

digit   = [0123456789]
number  = digit+:n               -> (string->number (->string n))

If we don't want to use string->number function, I would write the parser as:

;; Use fold-left
digit1   = [0123456789]:d        -> (- d 48)
number1  = digit1:x digit1*:xs   -> (fold-left
                                      (lambda (n d) (+ (* n 10) d))
                                      x
                                      (->list xs))

Digit1 rule converts the ascii value of the the digit character, and number1 rule construct a decimal number. As you see, you need to use fold-left function to construct a number because a number notation is essentially left recursion. For example, a number 34567 actually means:

(((3 * 10 + 4) * 10 + 5) * 10 + 6) * 10 + 7

However, PEG parser doesn't parse left recursion grammar in general. So I had to reconstruct the left recursion structure by fold-left. This is not hard at all if you familiar with functional programming. In functional programming, a list is considered as a right recursive data structure and it is even natural that a list is parsed by a right recursive way. However, I admit that it looks awkward for some people.

Yoshiki Ohshima provides a very useful extension to support a direct left recursion. To use his extension, the number parser is written as:

;; Use left-recursion

digit2   = [0123456789]:d        -> (- d 48)
number2  = number2:n digit2:d    -> (+ (* n 10) d)
         | digit2
number2s = number2

You need to load runtime/peg-memo-macro.k to use this extension.

$ tamacola ../runtime/peg-memo-macro.k number.k -
Cola/Tamarin
> (parse-collection number/number2s "345")
345

The real parser and compiler are bigger than presented grammars here, but I explained all of the essential ideas. I hope it helps you to make your own language!

2 comments:

  1. Hi Takashi.. thanks for these interesting posts. Could you say briefly how the new tool(s) compares with the previous cola stuff, and how complete/usable is it at this stage ?

    ReplyDelete
  2. Hi Simon, Tamacola is implemented some of Ian Piumarta's ideas, but it doesn't have all COLA's features. The good news is that Tamacola runs wherever Tamarin VM runs. It's really easier to setup the environment than COLA. The bad news is it lacks some interesting features like configurable object model, continuation, etc.

    How complete/usable is a difficult question. At least, one of my colleagues has happily used. I myself think Tamacola is easier to use than ActionScript (it's very biased opinion). But I don't recommend to use it unless you are a lisp hacker.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.