This site runs best with JavaScript enabled.

Build your own AST

Ruairidh Wynne-McHardy

16 May, 2020


Introduction

This follows on from Part One of our 'Build Your Own Programming Language' series where we built a basic lexer. If you haven't already, go back and read that first.

So now we have a way of splitting our code into its relevant tokens, we need to know what it actually means. So now we need to parse them (though technically parsing is the combination of both lexical and syntactic analysis).

Our code has been split into tokens by the lexer, now we need to arrange the tokens in a a way which means something. To achieve this, we turn them into what's known as an 'intermediate representation' or an abstract syntax tree.

That's to say, it's not our end product. Rather, it's a data-structure which we can manipulate. To get a feel for what an AST can look like, check out AST Explorer. You can see that everything gets split into a meaningful structure.

The AST we will make will be heavily inspired by Babel to keep things simple, but you could make an AST look like anything you'd like. In our case though, we'll be keeping close to the ESTree Spec which is also used by tools like Prettier or ESLint to understand the context of the code you write.

How to build an AST

To build an AST, we need to do a few things, namely:

  • Iterate through our tokens created by our lexer

  • For each primitive (number, string, etc) we add that token to the same level of our tree

  • For each function (known as a CallExpression) we collect its parameters and recurse down into the function body.

So let's check this out in code. Firstly, we'll make some functional utilities to see the beginning and end values of an array:

// ./utilities.js
const peek = array => array[0]
const pop = array => array.shift()
export { peek, pop }
// ./parse.js
import helpers from "./helpers"
import { peek, pop } from "./utilities"
const parenthesize = () => {}
const parse = tokens => {
const token = pop(tokens)
if (helpers.isNumber(token)) {
return {
type: "NumericLiteral",
value: token.value,
}
}
}

So it's kind of just a leaf right now, rather than a tree, but we'll keep building it out!

Let's add some more primitives:

// ./parse.js
import helpers from "./helpers"
import { peek, pop } from "./utilities"
const parenthesize = () => {}
const parse = (tokens) => {
const token = pop(tokens)
if (helpers.isNumber(token)) {
return {
type: "NumericLiteral",
value: token.value,
}
}
if (token.type === 'String') => {
return {
type: 'StringLiteral',
value: token.value,
}
}
if (token.type === 'Name') => {
return {
type: 'Identifier',
name: token.value,
}
}
}

Great. So now we can handle single values like strings, numbers, and names. But what about more complex things like functions? Well, we're going to take advantage of recurion and handle these. Check it out:

// ./parse.js
import helpers from "./helpers"
import { peek, pop } from "./utilities"
const parenthesize = (tokens) => {
const token = pop(tokens);
if (helpers.isOpeningParenthesis(token.value)) {
const expression = [];
while (!helpers.isClosingParenthesis(peek(tokens).value)) {
expression.push(parenthesize(token));
}
pop(tokens);
return expression;
}
return token;
}
const parse = (tokens) => {
if (Array.isArray(tokens)) {
const [first, ...rest] = tokens;
return {
type: 'CallExpression',
name: first.value,
arguments: rest.map(parse),
};
}
const token = tokens;
if (helpers.isNumber(token)) {
return {
type: "NumericLiteral",
value: token.value,
}
}
if (token.type === 'String') => {
return {
type: 'StringLiteral',
value: token.value,
}
}
if (token.type === 'Name') => {
return {
type: 'Identifier',
name: token.value,
}
}
}

We take the tokens and get the first one there. If it's an opening parenthesis, we keep going ahead until we find a closing parenthesis. For each value which isn't a closing parenthesis, we push it to our expression. If we don't hit an opening parenthesis, then we have hit a leaf node and just skip this.

Conclusion

Great, we have a basic AST in place! Not that hard, right? Using this, we'll build a REPL (Read Evaluate Print Loop) next and we'll be able to play around with our little language.

Share article

Join the Newsletter



Ruairidh Wynne-McHardy © 2020