Shunting Yard (Part 1)

The Shunting Yard Algorithm is, quite possibly, my favorite algorithm. It transforms a series of tokens into an abstract syntax tree, taking into account operator type, precedence, and parenthesis.

Through the next series of articles, I want to build a general purpose expression parser using the algorithm. By the end, we’ll have an algorithm that can be easily extended to add many different types of operators… any arity (unary, binary, ternary, etc), prefix, infix, and postfix.

Series: Part 1, Part 2, Part 3

The Problem

First: what is the problem we’re trying to solve?

We’ll save lexical analysis for another article. For now, let’s assume we have a series of tokens. For example:

``````// tokens for the input string: "3 + 7 * -10"
var tokens = [
{ type: 'num', value:  3  }, // the number 3
{ type: 'sym', value: '+' }, // the symbol '+'
{ type: 'num', value:  7  }, // the number 7
{ type: 'sym', value: '*' }, // the symbol '*'
{ type: 'sym', value: '-' }, // the symbol '-'
{ type: 'num', value: 10  }  // the number 10
];
``````

Remember: the lexer is very dumb. It is simply scanning the input string from left to right, saying, “I see the number `3`… I see the `+` symbol… I see the number `7`… I see the `*` symbol…”

We need to convert this into an abstract sytax tree (AST), like this: Abstract syntax tree of: 3 + 7 * -10

Which we’ll represent in JavaScript as:

``````var ast = {
parms: [{
type: 'val', // the value 3
value: 3
}, {
type: 'mul', // the multiplication operator
parms: [{
type: 'val', // the value 7
value: 7
}, {
type: 'neg', // the negation operator
parms: [{
type: 'val', // the value 10
value: 10
}
]
}
]
}
]
};
``````

The idea can be represented in code many different ways, but the core is the same: each node in the AST is either a terminal or a non-terminal.

A terminal is something that terminates the tree – it is a leaf node. In our example, `3`, `7`, and `10` are all terminals. They don’t have any children nodes. These are represented in our AST as objects with `type` set to `'val'`.

Non-terminals are things that contain more nodes. In the example, `add`, `mul`, and `neg` are non-terminals. They operate on their parameters and produce a value that is passed up along the tree.

Shunting Yard Skeleton

Where to even start? First, let’s build the most basic Shunting Yard function, so we can inspect it fully:

``````function SimpleShuntingYard(tokens){
// helper function to get next token and validate its type
var tokenIndex = 0;
function NextToken(requiredType){
if (tokenIndex >= tokens.length)
throw 'invalid expression: missing token';
var tok = tokens[tokenIndex];
tokenIndex++;
if (tok.type != requiredType)
throw 'invalid expression: expecting ' + requiredType;
}

// helper function to apply binary op based on a symbol
function ApplyBinOp(symbol, left, right){
var sym_to_type = {
'-': 'sub',
'*': 'mul',
'/': 'div'
};
if (!sym_to_type[symbol])
throw 'invalid expression: unknown operator ' + symbol;
return {
type: sym_to_type[symbol],
parms: [left, right]
};
}

var tok, nextValue;
var lastOp = false;
var lastValue;

function ApplyLastOp(){
// if we don't have a prev operator, then use the terminal
if (lastOp === false)
lastValue = nextValue;
else // otherwise, we have a pending operator, so apply it
lastValue = ApplyBinOp(lastOp, lastValue, nextValue);
}

while (1){
// grab a terminal (which is always type 'num', for now)
tok = NextToken('num');

// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};

// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
break;
// otherwise, we must have a binary operator

// grab an operator (which is always type 'sym', for now)
tok = NextToken('sym');

// apply any shunted operators
ApplyLastOp();

// save the new operator to be applied later
lastOp = tok.value;
}

// apply any remaining shunted operators
ApplyLastOp();

// all done, so return the top of the tree
return lastValue;
}
``````

What the hell is even going on in this piece of code! Does it actually work?

This may seem like a lot, but removing error checking, comments, and blank lines, the core algorithm is about 50 lines.

This simple algorithm can build ASTs for some expressions… but not all. Here are the results for the expression `2 * 3 + 4`:

``````var tokens = [
{ type: 'num', value:  2  },
{ type: 'sym', value: '*' },
{ type: 'num', value:  3  },
{ type: 'sym', value: '+' },
{ type: 'num', value:  4  }
];
SimpleShuntingYard(tokens);
=> {
parms: [{
type: 'mul',
parms: [{
type: 'val',
value: 2
}, {
type: 'val',
value: 3
}]
}, {
type: 'val',
value: 4
}
]
}
``````

Hey, that’s right..! How did it do that?! Let’s start by inspecting the easier parts, then building up (HAHA! get it? …no? ;-P).

NextToken

The `NextToken` function is responsible for getting the next token and validating it against the required type:

``````var tokenIndex = 0;
function NextToken(requiredType){
if (tokenIndex >= tokens.length)
throw 'invalid expression: missing token';
var tok = tokens[tokenIndex];
tokenIndex++;
if (tok.type != requiredType)
throw 'invalid expression: expecting ' + requiredType;
}
``````

Notice that `tokenIndex` starts at `0`, and continues counting up as tokens are requested. If there aren’t any more tokens, or if the token is the wrong type, then errors are thrown. Otherwise, the function simply returns the next token.

ApplyBinOp

The `ApplyBinOp` is also fairly straight forward:

``````function ApplyBinOp(symbol, left, right){
var sym_to_type = {
'-': 'sub',
'*': 'mul',
'/': 'div'
};
if (!sym_to_type[symbol])
throw 'invalid expression: unknown operator ' + symbol;
return {
type: sym_to_type[symbol],
parms: [left, right]
};
}
``````

Given a symbol, left node, and right node, we need to construct a binary operator node. This is simple. First, have a map between symbols and binary operators. If the given symbol isn’t in the map, then the user has provided an invalid operator. Otherwise, return the node object, with `type` set to the operator, and `parms` set to a list containing the left and right nodes.

Easy. Now for the meat:

The Shunt

Next, let’s look at the `lastOp` variable. This is the most important variable – it is the spirit of the Shunting Yard algorithm.

The Shunting Yard algorithm works by postponing application of operators. Why do we need to postpone? Because we haven’t gathered all the parameters yet!

We are scanning the tokens, from left to right. When we hit a binary operator, we need to wait before creating the node, because we don’t have all the information just yet. We need to grab the next terminal before we can create the node.

Once we grab the next terminal, we are free to create the node. That’s what this line does:

``````lastValue = ApplyBinOp(lastOp, lastValue, nextValue);
``````

Remember: `nextValue` is always the terminal we just grabbed:

``````// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};
``````

And `lastOp` contains the symbol we collected right before looping around:

``````// save the new operator to be applied later
lastOp = tok.value;
``````

So: what is `lastValue` doing in there?

The Tree

The `lastValue` variable is the magic that builds the AST. Before we can understand what it’s doing, we need to widen our inspection:

``````function ApplyLastOp(){
// if we don't have a prev operator, then use the terminal
if (lastOp === false)
lastValue = nextValue;
else // otherwise, we have a pending operator, so apply it
lastValue = ApplyBinOp(lastOp, lastValue, nextValue);
}
``````

When we go through the loop the first time, we don’t have a `lastOp` – we indicate this by initializing `lastOp` to `false`. Therefore, the first time through the loop, `lastValue` is simply set to `nextValue`, which is the terminal.

But the second time around, `lastOp` contains a symbol and `lastValue` will contain the previous terminal. We will load the next terminal into `nextValue`. It’s at this point we execute:

``````lastValue = ApplyBinOp(lastOp, lastValue, nextValue);
``````

What happens? The binary operator node is created, with the previous operator, the previous value, and the terminal we just read. Great – but why assign the results to `lastValue`?

We assign the resulting node to `lastValue` so that when the loop goes around a third time, `lastValue` will contain the binary operator (which, in turn, contains the terminals). This is what builds our tree. We are moving from terminals (leaf nodes), upwards along the branches, building larger and larger trees, using the binary operator nodes to collect everything.

In fact, once you understand these two concepts (`lastOp` for postponing operator application, and `lastValue` for tree building), you understand the core of the Shunting Yard algorithm.