Shunting Yard (Part 3)

Let’s continue adding to our Shunting Yard code base.

In Part 2, we created operator precedence. This allowed us to delay application of operators indefinitely, using a stack. Next, we’ll focus on adding parentheses, so that the end-user can manually override precedence.

Series: Part 1, Part 2, Part 3, …

Recursion

The secret sauce that will allow us to handle parentheses correctly is recursion.

And if you think about it, it makes sense. Imagine we have a function that can correctly parse an expression, to include parentheses, and returns the root node of the AST. Inside that function, if we were to come across a parenthesis, then we would want to parse that inner expression, and use the result (the root node of the inner expression’s AST) as a node in the larger AST.

There are two things we’ll need to keep in mind:

1. Our function must be able to parse an expression starting at the current location in the token stream. We can’t assume that we’re starting at the first token anymore, because when we call ourselves recursively, we’ll be in the middle of the expression.
2. Our function must exit when we run out of tokens (old behavior), but also when we see a `)`. Why? Because the function must exit when it is done reading an inner expression, which ends with `)`.

New Symbols

First, we would normally modify our lexer to correctly handle parentheses. Since this series is only focused on the parser, we’ll assume the lexer transforms a string like this:

``````2 * (3 + 4) - 10 / 2
``````

Into the following token stream:

``````var tokens = [
{ type: 'num', value:  2  },
{ type: 'sym', value: '*' },
{ type: 'sym', value: '(' },
{ type: 'num', value:  3  },
{ type: 'sym', value: '+' },
{ type: 'num', value:  4  },
{ type: 'sym', value: ')' },
{ type: 'sym', value: '-' },
{ type: 'num', value: 10  },
{ type: 'sym', value: '/' },
{ type: 'num', value:  2  }
];
``````

The Code

Here’s the code, which we’ll inspect in further detail below:

``````function ParenShuntingYard(tokens){
// helper function to get next token and validate its type
var tokenIndex = 0;
function NextToken(){
var args = [].slice.call(arguments);
if (tokenIndex >= tokens.length)
throw 'invalid expression: missing token';
var tok = tokens[tokenIndex];
tokenIndex++;
if (args.indexOf(tok.type) < 0)
throw 'invalid expression: expecting ' + args.join(' ');
}

// helper function to apply binary op based on symbol
function ApplyBinOp(symbol, left, right){
var sym_to_type = {
'-': 'sub',
'*': 'mul',
'/': 'div'
};
if (!sym_to_type[symbol])
throw 'invalid expression: unknown operator ' + symbol;
return {
type: sym_to_type[symbol],
parms: [left, right]
};
}

// determines if we apply left operator before right
function ApplyLeftFirst(leftSymbol, rightSymbol){
var sym_to_prec = { // map of symbol to precedence
'+': 0,
'-': 0,
'*': 1,
'/': 1
};
return sym_to_prec[leftSymbol] >= sym_to_prec[rightSymbol];
}

// parse expression at current location and return root node
function ParseExpr(){
var tok, nextValue;
var opStack = [];
var valStack = [];

function ApplyTopOp(){
var op = opStack.shift();
var rightValue = valStack.shift();
var leftValue = valStack.shift();
valStack.unshift(ApplyBinOp(op, leftValue, rightValue));
}

while (1){
// grab a terminal OR a left paren
tok = NextToken('num', 'sym');
if (tok.type == 'num'){
// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};
}
else{ // tok.type is 'sym'
if (tok.value == '('){
// if we have left paren, call ourselves recursively
nextValue = ParseExpr();
// ensure we have a right paren
tok = NextToken('sym');
if (tok.value != ')')
throw 'missing right parenthesis';
}
else
throw 'invalid expression';
}

// save the value to be operated on
valStack.unshift(nextValue);

// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
break;
// if our next token is a right paren, then exit
if (tokens[tokenIndex].type == 'sym' &&
tokens[tokenIndex].value == ')')
break;
// otherwise, we must have a binary operator

// grab an operator
tok = NextToken('sym');

// make sure it isn't an open paren
if (tok.value == '(')
throw 'invalid expression';

// check to see if we have previous operations to apply
while (opStack.length > 0 &&
ApplyLeftFirst(opStack[0], tok.value))
ApplyTopOp();

// save the new operator to be applied later
opStack.unshift(tok.value);
}

// apply any left over operators
while (opStack.length > 0)
ApplyTopOp();

// all done, so return the top of the tree
return valStack[0];
}

// kick everything off
var ast = ParseExpr();

// make sure we consumed the entire token stream
if (tokenIndex < tokens.length)
throw 'invalid expression';

// return the parsed expression
return ast;
}
``````

This seems like a hairy beast at first, but if you look at our previous code, you’ll see there aren’t many changes.

Let’s get the small stuff out of the way:

NextToken

The `NextToken` function now allows a variable number of arguments to be passed to it. It validates that the token type is one of the arguments.

Don’t be scared. This line looks strange if you’ve never seen it:

``````var args = [].slice.call(arguments);
``````

This is nothing more than converting `arguments` into an array (called `args`). This is necessary for JavaScript, because the `arguments` variable behaves really goofy according to the specification, which is a design flaw that we’re stuck with. Oh well.

Other than that, we simply validate that `tok.type` is inside the `args` array, and return the token.

Kicking off ParseExpr

The next biggest change is that the main loop is now inside a function, `ParseExpr`. This is the piece that we will call recursively… but before we get to that, let’s jump to the end:

``````// kick off everything
var ast = ParseExpr();

// make sure we consumed the entire token stream
if (tokenIndex < tokens.length)
throw 'invalid expression';

// return the parsed expression
return ast;
``````

We parse the expression by simply calling our magic `ParseExpr` function, and storing the result in `ast`. Simple.

We also have an extra check to make sure that we’ve consumed the entire token stream. The reason for this is because `ParseExpr` will exit when there is a right parenthesis – so if a user has extra right parentheses, then `ParseExpr` will return before consuming everything. Without the check, `ParenShuntingYard` would succeed for expressions like `2 + 3) 1 2 3 4`, which would be bad.

ParseExpr Body

Now for the meat.

Calling Ourselves

An inner expression can exist anywhere a terminal can exist. Thankfully, there is only one location where we read terminals – at the top of the main loop. Therefore, the first major change is to accept numbers or symbols:

``````var tok = NextToken('num', 'sym');
``````

If `tok.type` is `'num'`, then we’ve received a terminal, and our previous code works the same:

``````if (tok.type == 'num'){
// create the terminal node
nextValue = {
type: 'val',
value: tok.value
};
}
...
``````

However, if `tok.type` is `'sym'`, and `tok.value` is `'('`, then we have an inner expression.

``````...
else{ // tok.type is 'sym'
if (tok.value == '('){
// if we have left paren, call ourselves recursively
nextValue = ParseExpr();
// ensure we have a right paren
tok = NextToken('sym');
if (tok.value != ')')
throw 'missing right parenthesis';
}
else
throw 'invalid expression';
}
``````

Notice that this code expects `ParseExpr` to return before consuming the right parenthesis. That means this piece needs to consume and confirm it exists, and error otherwise.

At the end of this section of code, our node is stored in `nextValue`, whether from a terminal, or inner expression.

Exiting Correctly

Next, we exit the main loop if we are out of tokens – but now we also have to exit if we find a right parenthesis:

``````// if we're out of tokens, then exit
if (tokenIndex >= tokens.length)
break;
// if our next token is a right paren, then exit
if (tokens[tokenIndex].type == 'sym' &&
tokens[tokenIndex].value == ')')
break;
// otherwise, we must have a binary operator
``````

Remember that we need to exit if the next token is a right parenthesis, but we shouldn’t consume the token with `NextToken`, because the previous section of code is responsible for that. That simply means that we peek ahead using `tokens[tokenIndex]`, without increasing `tokenIndex`.

Checking for Errors

Lastly, because both operators and parenthesis share a token type of `'sym'`, we can no longer assume that `NextToken('sym')` is guaranteed to read an operator. We add some extra error checking to ensure we really read an operator:

``````// make sure it isn't an open paren
if (tok.value == '(')
throw 'invalid expression';
``````

We don’t need to check if it’s a right parenthesis, because we already did that while peeking above.

All Done!

And… that’s it! Not so bad, really.

Does it work?

``````var tokens = [
{ type: 'num', value:  2  },
{ type: 'sym', value: '*' },
{ type: 'sym', value: '(' },
{ type: 'num', value:  3  },
{ type: 'sym', value: '+' },
{ type: 'num', value:  4  },
{ type: 'sym', value: ')' },
{ type: 'sym', value: '-' },
{ type: 'num', value: 10  },
{ type: 'sym', value: '/' },
{ type: 'num', value:  2  }
];
ParenShuntingYard(tokens);
=> {
type: 'sub',
parms: [{
type: 'mul',
parms: [{
type: 'val',
value: 2
}, {
parms: [{
type: 'val',
value: 3
}, {
type: 'val',
value: 4
}
]
}
]
}, {
type: 'div',
parms: [{
type: 'val',
value: 10
}, {
type: 'val',
value: 2
}
]
}
]
}
``````

This is really great!

We are really close to having a fully featured expression parser. We can’t support unary operators yet, and ternary is a little ways off, but it won’t be long now. Stay tuned for the next article in the series!

posted 15 Nov 2014 by Sean
tags: shunting yard, javascript