TAKASHI'S WORKSPACE: 2010-09

Friday, September 24, 2010

Tamacola (5)

Tamacola is not just another LISP language, it is designed as a meta-language to make a new language. I'll explain this feature today. Today's goal is to design a subset of lisp language. If you think that a lisp is too simple to keep your passion, sorry, be patient, simple thing first.

Prepare your Tamacola environment

To setup Tamacola environment, you need to download both Tamacola distribution and Tamarin VM. Those are available on http://www.vpri.org/vp_wiki/index.php/Tamacola. You need add the PATH environment variable to find the avmshell command, and also it would be useful to set the PATH to bin/ in the tamacola tree. To make sure Tamacola works, plese type:

make run-example

It runs all of the examples in the Tamacola distribution as well as recompile the compiler. If you don't find any error, you are ready to go. Otherwise, please let me know the problem.

Tamacola command

Tamacola command read a tamacola program and run immediately. If you want to make a Flash contents, another command tamacc (Tamacola Compiler) is more suitable. Now we are playing with an interactive shell of tamacola command, so I'll give you a brief explanation. The interactive shell starts with minus (-) option. Let's try a simple arithmetic. If you didn't setup PATH environment, please specify the directory name, too.

$ tamacola -
Cola/Tamarin
> (+ 3 4)
7

You can also give Tamacola source files as well as compiled binary names. Typically, source code ends with .k, and a binary ends with .abc. Tamacola is smart enough to detect newer file between .k and .abc.

Match against a string constant

Suppose you are on some working directory, and you have already set PATH environment to the bin/ directory. And then, we are going to write a very simple language, greeting:

;; greeting.g - A simple PEG example

greeting = "morning" -> "Good Morning!"
         | "evening" -> "Good Evening!"

This stupid example answers "Good Morning!" if you say "morning", and it answers "Good Evening!" if you say "evening". This PEG syntax is easy to understand. The right hand side of = is a rule name. A rule name is translated as a function once it is built. -> means an action rule, where if the left hand is matched the right hand side is returned. | is an Ordered options. In this case, the parser tries the first case "morning", and tries the second case "evening" only if the first case fails.

Save this syntax named "greeting.g". To test this language, type those commands:

$ mkpeg greeting.g
$ tamacola greeting.k -
compiling:  greeting.k
Cola/Tamarin
> (parse-collection greeting "morning")
"Good Morning!"
> (parse-collection greeting "evening tokyo")
"Good Evening!"

Mkpeg command converts grammar file (greeting.g) to tamacola source (greeting.k), a rule "greeting" the result can be read by tamacola shell. Greeting.k is built on the fly and the command prompt is shown.

Parse-collection's first argument is a parser name (in this case "greeting"), and the second is a input collection. As the name implies, it accepts any collection as the input stream.

The second case shows an interesting property of PEG syntax. Although the second rule matches the beginning part of the input "evening tokyo", still the input remains more string " tokyo". PEG doesn't care if the input is competely consumed or not. If you really want to make sure that the entire input is matched, you need to explicitly tell the Parser the point where end of the file.

Number parser

The last example only matched a predefined constant, but we make a parser for any integer number here.

;; number.g -- A number parser

digit   = [0123456789]
number  = digit+

We also convert the grammar specification into the tamacola program, but in this case, we give -n option to tell the namespace. A namespace is useful when you want to use a common name as a rule name like "number". Because "number" is already used in the system, you can not use it without namespace.

The grammar itself is easy to understand if you have an experience with regular expressoins. Brackets ([]) matches one of characters inside, and postfixed plus (+) repeats previous expression with one-or-many times.

$ mkpeg -n number number.g 
$ tamacola number.k -
compiling:  number.k
Cola/Tamarin
> (parse-collection number/number "xyz")
FAIL
> (parse-collection number/number "345")
{token-group:
(53 52 51)}

Because we use the namespace "number", we need specify the namespace before slash(/) in the function name.

As you might notice, this parser correctly rejects a non-number like "xyz", and accepts "345". But the result is not so useful. The return value of plus is a special object named "token-group", but we would want a number represented by the string, instead. So we put a conversion function to get the value.

number  = digit+:n      -> (string->number (->string n))

$ tamacola number.k -
compiling:  number.k
Cola/Tamarin
> (parse-collection number/number "345")
345

Now parser returns a number conveniently. Perhaps you might think that it is somewhat cheating. As the string->number function itself is a kind of number parser, we should have write a number parser without string->number! Yes we could. But it leads more interesting topic about left and right recursion, so I leave it for later.

S-expression parser

Now we are going to write a parser for almost real S-expression. This parser can only handle number and list, but it is useful enough to explain the essence of Tamacola.

;; sexp.g
;; Lexical Parser

spaces  = [ \t\r\n]*

digit   = [0123456789]
number  = digit+ :n spaces              -> (string->number (->string n))

char    = [+-*/abcdefghijklmnopqrstuvwxyz]
symbol  = char+ :s spaces               -> (intern (->string s))
        
sexp    = symbol
        | number
        | "(" sexp*:e ")"               -> (->list e)

In this grammar, only new operator is the postfix star (*) which repeats zero-or-many times. Rest is straightforward. To test this grammar, we use Tamacola's simple test framework. Writing test case is better than the interactive shell, because you don't have to type same expression many times.

;; sexp-test.k

(check (parse-collection sexp/spaces "    ")            => 'SPACES)
(check (parse-collection sexp/digit "0")                => 48)
(check (parse-collection sexp/number "345")             => 345)
(check (parse-collection sexp/char "a")                 => 97)
(check (parse-collection sexp/symbol "hello")           => 'hello)

(check (parse-collection sexp/sexp "345")               => 345)
(check (parse-collection sexp/sexp "hello")             => 'hello)
(check (parse-collection sexp/sexp "(hello world)")     => '(hello world))
(check (parse-collection sexp/sexp "(3 4)")             => '(3 4))
(check (parse-collection sexp/sexp "(print 4)")         => '(print 4))

The check function comes from SRFI-78. This function complains only if the left hand value and the right hand value differ. Otherwise, does nothing. I like this UNIX stile conciseness.

As a convention, a test program is added a postfix "-test" with the main program's name. I borrowed this custom from Go language.

Make sure this program do nothing.

$ tamacola sexp.k sexp-test.k

Lisp Compiler

The PEG parser can handle any list structure as well as string. It allows you to write compiler in PEG. In a string parser, the input is a string and the output is some object (a list in our case), but in a compiler, the input is a lisp program and the output is a assembler code.

;; Compiler

arity   = .*:x                          -> (length (->list x))
insts   = inst* :xs                     -> (concatenate (->list xs)) 
                                        
inst    = is-number:x                   -> `((pushint ,x))
        | is-symbol:x                   -> `((getlex ((ns "") ,(symbol->string x))))
        | '( '+ inst:x inst:y )         -> `(,@x ,@y (add))
        | '( '- inst:x inst:y )         -> `(,@x ,@y (subtract))
        | '( '* inst:x inst:y )         -> `(,@x ,@y (multiply))
        | '( '/ inst:x inst:y )         -> `(,@x ,@y (divide))
        | '( inst:f &arity:n insts:a )  -> `(,@f (pushnull) ,@a (call ,n))

There are some new elements in the grammar. Quoted list '( ) matches a list structure, and a quoted symbol matches a symbol.

A prefix ampersand (&) prevents to consume the stream even if the rule matches. For example, &arity rule examine the rest of the list, but the contents are matched again by the insts rule later.

Is-number is matched against number, and is-symbol is for a symbol. Those rule can not be described as PEG grammar, but as a lisp function.

(define is-number
  (lambda (*stream* *parser*)
    (if (number? (peek *stream*))
        (begin (set-parser-result *parser* (next *stream*))
               #t)
        #f)))

(define is-symbol
  (lambda (*stream* *parser*)
    (if (symbol? (peek *stream*))
        (begin (set-parser-result *parser* (next *stream*))
               #t)
        #f)))

A rule is a function which receives the stream and the parser (an object which store the result). The rule function returns #t if it matches, and #f if it fails.

I think it is easier to see the test code than read my explanation.


(check (parse-collection sexp/arity '(a b c))   => 3)

(check (parse-collection sexp/insts '(3 4)      => '((pushint 3)
                                                     (pushint 4)))

(check (parse-collection sexp/inst '(3))        => '((pushint 3)))

(check (parse-collection sexp/inst '((+ 3 4)))  => '((pushint 3)
                                                     (pushint 4)
                                                     (add)))

(check (parse-collection sexp/inst '((f 3 4)))  => '((getlex ((ns "") "f"))
                                                     (pushnull)
                                                     (pushint 3)
                                                     (pushint 4)
                                                     (call 2)))

Put it in an envelope

We still need a little bit to construct a real assembler code. This detail topic is out of the context, so I simply show the code.

program = inst:x  -> `(asm
                       (method (((signature
                                  ((return_type *) (param_type ()) (name "program")
                                   (flags 0) (options ()) (param_names ())))
                                 (code ((getlocal 0)
                                        (pushscope)
                                        ,@x
                                       (returnvalue))))))
                      (script (((init (method 0)) (trait ())))))

And the test case.

(check (parse-collection sexp/program '((print 42)))
       => '(asm
            (method
             (((signature ((return_type *) (param_type ()) (name "program")
                           (flags 0) (options ()) (param_names ())))
               (code ((getlocal 0)
                      (pushscope)
                      (getlex ((ns "") "print"))
                      (pushnull)
                      (pushint 42)
                      (call 1)
                      (returnvalue))))))
            (script (((init (method 0)) (trait ()))))))

You can read the entire program in example/sexp.g in the Tamacola distribution. To try the program, please enter:

make -C example test-sexp

Left recursion

We left an interesting topic about left and right recursion. Let me show you our number parser again.

digit   = [0123456789]
number  = digit+:n               -> (string->number (->string n))

If we don't want to use string->number function, I would write the parser as:

;; Use fold-left
digit1   = [0123456789]:d        -> (- d 48)
number1  = digit1:x digit1*:xs   -> (fold-left
                                      (lambda (n d) (+ (* n 10) d))
                                      x
                                      (->list xs))

Digit1 rule converts the ascii value of the the digit character, and number1 rule construct a decimal number. As you see, you need to use fold-left function to construct a number because a number notation is essentially left recursion. For example, a number 34567 actually means:

(((3 * 10 + 4) * 10 + 5) * 10 + 6) * 10 + 7

However, PEG parser doesn't parse left recursion grammar in general. So I had to reconstruct the left recursion structure by fold-left. This is not hard at all if you familiar with functional programming. In functional programming, a list is considered as a right recursive data structure and it is even natural that a list is parsed by a right recursive way. However, I admit that it looks awkward for some people.

Yoshiki Ohshima provides a very useful extension to support a direct left recursion. To use his extension, the number parser is written as:

;; Use left-recursion

digit2   = [0123456789]:d        -> (- d 48)
number2  = number2:n digit2:d    -> (+ (* n 10) d)
         | digit2
number2s = number2

You need to load runtime/peg-memo-macro.k to use this extension.

$ tamacola ../runtime/peg-memo-macro.k number.k -
Cola/Tamarin
> (parse-collection number/number2s "345")
345

The real parser and compiler are bigger than presented grammars here, but I explained all of the essential ideas. I hope it helps you to make your own language!

Wednesday, September 15, 2010

Tamacola (4)

Tamacola in Tamacola

After I made the Tamacola compiler written in COLA, next thing to do was to implement it in Tamacola itself. A language is called self-hosting if the language is written in the language itself. This implies various advantage.

First, once self-hosting is done, you don't need to use COLA anymore, you can improve or modify any language aspects on Tamarin VM. If I carefully design the environment, it would be possible to do language design only on the Web browser (it needs server side help for security reason, so it hasn't done yet).

Second, self hosting is a benchmark for the language to tell that it is good enough. Scheme is especially simple language, so there are a lot of people who implement toy-Scheme. But because my Tamacola is now self-hosting, I could proudly claim that this is not a toy! Well, this is rather self satisfaction, though.

Third, it provides a rich library including "eval" function. A compiler uses various programming techniques, and those must be useful for other programs, too.

To make it self-hosting, there were two key problem which are macros and eval.

Bootstrapping Macros

I heavily used macros in my compiler, for example, the parser written in PEG was converted a bunch of macro expressions. The problem is, expanding macros requires eval function but I wasn't able to make eval before the parser was done. It's a deadlock! Here is a typical macro written in COLA:

(define-form begin e (cons 'let (cons '() e)))

This is how the macro works. When the compiler find a expression like:

(begin
  (print "Hello")
  (print "World"))

Expressions inside begin is bound to e, the body (cons 'let (cons '() e)) is executed in compile time and the expression is expanded to:

(let ()
  (print "Hello")
  (print "World"))

Such expansion is impossible without eval function because the compiler need to evaluate a list (cons 'let (cons '() e)) given by user. What I would do when I didn't have eval yet. But I realized that macros only include basic list functions like car, cdr, and cons in many cases. And a more complicated macro could be hard corded as a special form in the compiler. So I invented a pattern base macros.

(define-pattern ((begin . e) (let () . e)))

Basically this is a subset of Scheme's syntax-rule. If the compiler finds an expression starting with begin, rest of the expression is bound to e and substituted as a right hand side. Those expansion requires only limited set of list functions, so the compiler doesn't have to provide full eval function. This macro syntax made my compiler readable, and I was able to continue happily.

Even after I implemented more dynamic traditional macro with eval function, I keep using this pattern base macros mainly.

Eval

To implement eval function, you need to understand the dynamic code loading facility provided by the VM. Note that this is not part of AVM2 specification, and Avmshell (a console Tamarin shell program) and Adobe Flash have different API.

Avmshell has straightforward API. You give compiled byte code, and the function returns the value. Because Tamacola is now written in Tamacola, you can invoke the compiler as a library function and get byte code you want to execute.

avmplus.Domain.loadBytes(byteArray:ByteArray)

You can get the domain object by Domain.currentDomain() static method. Those useful functions in Avmshell are found shell/ directory in the Tamarin-central repository.

Flash Player has somewhat tricky API for dynamic code loading. The signature is normal.

flash.display.Loader.loadBytes(bytes:ByteArray, context:LoaderContext = null):void

There are two problems for our purpose. First, this method is not designed mainly for dynamic loading, it only accepts SWF, JPG, PNG, or GIF files, and byte code happen to be accepted inside a SWF file. So I had to construct SWF file to load code. In case if you don't know about SWF file, SWF file is a kind of container format. You can embedded vector graphics, mp3 sounds, and ActionScript byte code. Making a SWF file is not particularly difficult though, it needs nasty bit fiddling.

Second, this is far more problematic, is that this method works as asynchronously. In other words, this doesn't return the result value. Instead, you need to give it a callback function to wait to finish the code. Additionally, this method doesn't return value at all, so if you want the return value, you need to setup some explicit return mechanism by yourself.

Practically, this cause a problem if you want to write a traditional macro definition and use the macro in a same source code. Because a traditional macro need to evaluate a lisp expression in a compile time, but the eval function doesn't return before the compilation thread is done. I could solve the problem by setting up compilation queue or something, but it would cost performance penalty which I don't want. And now I simply gave up.

I have explained pretty much all interesting aspect of the self hosting compiler. I'll talk about how to make a new language on the Tamacola environment later.

Tuesday, September 14, 2010

Tamacola (3)

How a lisp program is compiled to Tamarin VM

Now I'm going to talk a bit about how a lisp (almost Scheme) program is compiled into Tamarin's byte code. This topic is especially interesting if you are curious to make your own language or VM.

Tamarin VM is made for ActionScript, so its byte code is also specifically designed for ActionScript. In other words, it is a slightly tricky to implement other language than ActionScript. In case if you don't know about ActionScript, it is almost identical as JavaScript in the execution model. Difference between them is about optimization with explicit type notion and static field.

ActionScript and Scheme are common for those aspects:

Lexical scope.
Function object.
Variable arguments with no curring.
Dynamic typing (a value has type, not variable).

But there are significant difference.

ActionScript doesn't have a simple function call. Everything is a method call.
In ActionScript, a function has a scope. No scope block or let expression.
Tail call optimization is not supported.
Call stack can not be accessed.

Those limitations sound like that Tamarin VM is inferior. But no, actually those limitations come from Tamarin VM's advantage and optimization. If you happen to have a chance to design your VM, please learn from the lesson. There ain't no such thing as a free optimization. Any optimization kills some generality. I'll explain each case.

ActionScript doesn't have a simple function call neither Tamarin VM. This is rather harmless though. When you see a function like expression like trace("hello"), this is actually mean (the global object).trace("hello"), and eventually, the receiver passes to the function as the first argument. In other words, if you want to construct a function call with two arguments, you need to make three arguments where the first argument is "this" object. A slightly tricky part is primitive operators like + or -, which don't have "this" object. Those primitives are special case.

ActionScript also has lexical scope, but only a function has a scope. So I have to be careful when I compile let expression in Scheme. Most simplest way to implement a let expression is to use a function. A let expression can be always translated to a lambda in theory though, this is a huge performance disadvantage. So I use "with" expression in ActionScript. "With" expression is an unpopular syntax in ActionScript, but you can use any object as a scope object. I borrowed this idea from Happy-ABC project http://github.com/mzp/scheme-abc.

Lack of the tail call optimization in Tamarin VM was the most disappointed thing to me. It prevents a functional programming style. I simply gave up it. Tail call optimization is not difficult topic at all. If the target were a native code like x86, it would be a matter of swapping stack and jump. But Tamarin VM doesn't allow direct access of stack or jump to other function. I understand that it might cause a security issue though, it would be wonderful if VM would provide special byte code for tail call.

Finally, you can't access the call stack directly, therefore you can't implement call/cc. The reason why I can't call Tamacola as Scheme is the lack of tail call optimization and call/cc. It prevents many experimental language features like generator, process, or so. But considering rich libraries provided by the Flash API, I would say Tamacola will be a reasonably useful language eventually.

I'll tell you convolved self hosting process and macros tomorrow.

Tamacola (2)

COLA

Once the assembler was done, I was able to test verious Tamarin VM's features, even I wrote a tiny GUI application on Adobe Flash in the assembler. Then next step is the compiler.

Another goal of the project was to port Ian Piumarta's COLA framework to Tamarin (the project name came from this). And perhaps this is only the real technical contribution of the project. COLA is a meta language (a programming language to design another language) which resembles Scheme. COLA has a nice sub-language called Parser Expression Grammar that makes parser very terse. My plan was to write a boot-strappng compiler in PEG and COLA, then to implement COLA library, and to write the real compiler in PEG and Tamacola itself.

I won't give you the detail of PEG. But briefly, it is as simple as a regular expression and as powerful as a context free grammar.

When that time I started writing the compiler, COLA has no library at all except PEG framework, so I needed to write necessary libraries by myself from scratch. Fortunately COLA has quite a powerful external function call feature (a kind of FFI), macro sysytem, and a flexible object oriented framework. So writing library is not so hard. But I tried not to use COLA specific features as possible because it would be a problem when I rewrite the compiler in Tamacola itself later.

To implement the library, I borrowed function specifications from R6RS as well as possible to avoid unnecessary confusion. There were exception because COLA treat a slash "/" character as special for namespaces, I took PLT's function names in this case.

Writing lisp libraries is interesting puzzle to me because there were some requirements and constrain for the domain. Those requiments are:

Unit testing framework.
Library framework.
List manipulations.
String functions.
Bit operations and streams.
Pretty printer for debugging.

These requirements were carefully chosen. Because COLA has only modest debugging facility, the unit test framework must be there. So my first goal was to implement all functions needed by the unit testing. I needed a pretty printer for debugging, too.

Another "must have" library was bit operators, and file / in-memory streams that is needed to the assembler. Interestingly enough, R6RS doesn't define enough functions to support those. For example, there are no portable way to specify a stream to be binary or text. So I needed a bit creativity.

Eventually, I wrote all libraries and the compiler. And I got a pretty good sense about a minimun set of functions needed for compiler, which are testing framework, pretty printer, bit operators, and streams. In other words, if your language has those features, your language can be self-hosting.

The real puzzle part was the order. Those requirements must be precisely ordered by need. For example, the pretty printer must follow stream and string functions because the pretty printer uses those functions. Although you can write functions in random order as you like in Lisp, precise order makes testing and debugging is easy. I kept this discipline. I even implemented the test library twice, the first one was concise assert function, and the second one has more friendly fail message by the pretty printer.

It took a few weeks to build a simple compiler, but still there were long way up to the point where self-hosting can be done. One thing that I had learned from the stage was, even without debugger, debugging is not so hard if you have enough test cases and a good pretty printer.

Sunday, September 12, 2010

Tamacola (1)

Intro

I have published the source code of Tamacola, a lisp compiler which runs on Adobe Flash / Tamarin VM (or Adobe Virtual Machine 2) http://github.com/propella/tamacola. I'm pretty sure that the current version is useless if you are just looking for a lisp implementation on Tamarin (las3r and scheme-abc are much better), but Tamacola includes abundant tips if you are interested in making a self-hosting compiler on Tamarin VM. That's why I decided to publish it as-is.

I'm also working on a presentation slide for S3 conference http://www.hpi.uni-potsdam.de/hirschfeld/s3/s3-10/ to show it. I'm writing random thoughts about the compiler here so that I will compile them to a thread of talk.

I've already written the motivation on the paper (perhaps I will paste the URL in a month) so I don't repeat it. But in short, I wanted make a tiny language which bootstraps and runs on Adobe Flash.

A tiny language and bootstrapping seem contradicting idea as bootstrapping requires various language functions which tends to be large. On the other hand, this is practically a nice constrain because it keeps the language from too simple or too fat. Choosing Scheme like language as a target is natural to me because I wanted to concentrate basic implementation technique instead of language design.

Well, as one reviewer of the paper said, this is not particularly surprising or dramatically different in comparison with previous systems in the area, but some of the stories from the compiler should interest you!

How I started the assembler

In the beginning I created the assembler. Honestly, I wanted to avoid the task because writing assembler seemed not quite an interesting job. But in that time, I couldn't find a nice AVM2 assembler that suite my project. So I've done it. In retrospect, this was not bad at all. I could understand what avm2overview.pdf (the AVM2 specification) said quite well, and I got self confidence.

I wrote my assembler in PLT-Scheme because Ian Piumarta's COLA (Tamacola was supposed to be written in COLA and Tamacola itself, I'll tell you this later) is not finished yet in that time and Duncan Mak, a friend of mine, recommend it. This was actually a good choice. This is my first Scheme application and PLT's good documentation helped me a lot.

An interesting part of PLT-Scheme was it encourages a functional programming style, even PLT doesn't suppport set-car! and set-cdr! in the default library. So it was natural that my assembler was written without side-effect except I/O. This is the first key of the development of the assembler. Unfortunately, because Tamarin doesn't support tail-recursion optimazion and Tamarin's stack size is small, I gave up to eliminate all side-effect later. But the implementation was pure functional up to the time, and it was quite clean.

Indeed, it had to be clean considering boot-strapping. I wanted to make the assembler run in my language itself even before enough debugging facility is not ready. If it were not clean, a tiny bug would cause a few days of debugging. I avoided the nightmare with a functional style and Test Driven Development.

Test Driven Development is the second key. I virtually wrote every test case for each function even if it looks silly. Scheme has a couple of options of testing frame work. I chose SRFI-78. It only report assertion failer only something happen, otherwise it keeps silence. I somewhat like this UNIX taste terse.

The third key was to write an assembler and a disassembler in a same time. It sounds like an unnecessary job because I only needed an assembler eventually. But I had to analyze an output from asc (an asembler in Adobe Flex) and learn how an ActionScript program was converted to the Tamarin byte-code. The disassembler was very helpful to read the byte-code as well as debugging. If output of the disassembler generates the original byte-code by the assembler, there is high chance that my imprementation is correct, unless my understanding is wrong.

The assembler is named ABCSX http://github.com/propella/abcsx and it was ported to Gauche, COLA, and Tamacola later. I ported it to Gauche because I was curious about portability of Scheme language.

I had realized there are many places where I could reduce code redundancy in the assembler. An assembler tends to include repetitive process, but some of them are not quite captured well by function abstraction. I would be effective to apply macro and domain specific language in those part. I didn't have tried to solve it yet, but I want to solve it later.

(to be continued)

TAKASHI'S WORKSPACE