Two-Phase Parsing in the Shift JavaScript Parser

Today, we merged the two-phase-parsing branch of the Shift Parser. This branch implemented an experiment with a significant change to the parser’s architecture. To understand it, we first need to understand ECMAScript early errors, and how they are typically handled.

Early Errors

The ECMAScript language is defined using a formal grammar, specifically a context-free grammar. A context-free grammar consists of a number of productions, each associating a symbol called a nonterminal with a (possibly empty) sequence of nonterminal and terminal symbols. Section 5.1 of the ECMAScript 6 specification explains the meaning of its grammars in great detail, but I will summarise it below.

There are two ECMAScript grammars: the lexical grammar, given in section 11, and the syntactic grammar, given in sections 11 through 15. The terminal symbols for the lexical grammar are Unicode code points (the text of your program), while the terminal symbols for the syntactic grammar are tokens, sequences of Unicode code points in the language defined by the lexical grammar.

When a program is parsed as a Script or Module (the two goal symbols of the syntactic grammar), it is first converted to a stream of tokens by repeatedly applying the lexical grammar to the remainder of the program. The token stream only represents an ECMAScript program if the token stream can be parsed by a single application of the syntactic grammar with no tokens left over.

An early error is an additional condition that must hold when a grammar production is matched. From section 5.3 of the ECMAScript 6 specification,

A conforming implementation must, prior to the first evaluation of a Script, validate all of the early error rules of the productions used to parse that Script. If any of the early error rules are violated the Script is invalid and cannot be evaluated.

Let’s take a look at an early error definition from the specification.

12.5.1 Static Semantics: Early Errors

UnaryExpression :
  ++ UnaryExpression
  -- UnaryExpression

It is an early Reference Error if IsValidSimpleAssignmentTarget of UnaryExpression is false.

This early error prevents constructions like ++0 because the named static semantic rule IsValidSimpleAssignmentTarget is true only for identifiers (a), static member access (a.b), computed member access (a[b]), and any of the previous productions enclosed in parentheses ((a)). Notably absent is Literal, the production that matches 0.

The final version of the ECMAScript 6 specification has over one hundred early errors. I know this because I had to write tests for every single one of them. It took a while.

Typical Early Error Handling

Not all early errors are as simple as the one above, where you can immediately know of the existence of an early error. For example, an object literal must not have more than one __proto__ data property. A label cannot be contained within another label with the same name (without crossing function boundaries). As you can imagine, tracking all of the information required to make these decisions can require a lot of complexity. Typical parsers like esprima and acorn use the following strategies:

Lookup Tables

Esprima has an object called state with a field named labelSet. Label names are added to this object when encountered, but if the label already exists, an early error is thrown. Because this state is unique only per parser instance, a stack of labelSet objects must be maintained (in this case, they are using the call stack) so that a fresh labelSet may be generated as each function is entered.

Wrapped Return Values

When parsing function parameters, some early errors must be collected and saved for later, in case it is determined that the parameters are in strict mode and violate a strict-mode-only early error. Remember that a function with a "use strict" directive in its body is strict mode code – including its parameters. Because of this, the function that parses a parameter must wrap all of its return values in a wrapper object that contains both the parameter node and any early errors that would be produced in a strict mode context.

Out Parameters

To avoid wrapper objects in some esprima parsing functions, an object like { value: false } will be passed to the function as an argument. The parsing function will then be able to mutate its parameter as well as return a value, essentially returning more than one value. This strategy is used when the same object can be re-used multiple times, such as the indicator that a data property with the name __proto__ has been parsed by the object property parsing function (the mutation hasProto.value = true is idempotent).

Two-Phase Parsing

We initially implemented our Shift parser in the same way as these others, but it ended up just as unruly and difficult to maintain. So we decided to remove all of the early error handling from the parser, and add an optional second pass that would detect the early errors. This means that the first pass would produce an AST for any input that passed the syntactic grammar. There were some risks with this: we needed to make sure we preserved enough information in the AST to determine if there should be an early error, and we needed to make sure the AST could represent all grammatically correct programs that would otherwise be invalid due to early errors.

The first step was to separate the failure tests that failed due to early errors into their own test suite. We then ensured that we had failure tests for each and every early error in the specification. We then used our new Shift reducer 2.0 to reduce the AST in a single pass into a list of early errors. The reduction state object needed fewer states than we had initially thought it would. It tracks label names, break/continue statements, labeled break/continue statements, new.target expressions, names in binding position, lexical declarations, vardeclarations, var declarations in for-of heads, function declarations, exported names, exported bindings, super calls, and super member accesses. All early errors relate to one or more of those pieces of information and some context.

There weren’t really many difficulties with this approach. There were a few cases that we labeled as “early grammar errors”, which were grammatically correct productions that our AST could not represent: 0 = 0because AssignmentExpression requires a Binding on its left-hand side, (...a) because ArrowExpression must have a body, etc. Additionally, we labeled some early errors as “early tokenisation errors”, including a\u0000, an identifier with a unicode escape sequence whose value is disallowed by an early error.

What We’ve Gained

So was all this trouble worth it? We certainly think it was! First off, the parser is much cleaner, easier to read, and easier to maintain. But if we wanted to, we wouldn’t even have to maintain the parser. Now that there’s no special grammar exceptions, we can replace the entire hand-written parser with one generated by a parser generator. This should be even more easily maintainable and more performant.

Speaking of performance, the parser with early errors disabled is over twice as fast as it was before. In fact, it is faster than any other parser we tested. For our business use, we will almost always want to run in this “loose” mode – we don’t have to worry about being very strict with the inputs we accept. Now one can choose between extremely correct or extremely fast parsing (but even in fast mode, you will never fail to parse a valid program).

Error reporting has also improved with the two-phase parser, since the separate early error checker will collect and report all problems due to early errors in a syntactically valid program instead of just reporting the first one. Finally, this opens up the possibility for the first phase to be extensible, allowing one to add support for JSX, macros, etc. through plugins instead of forks!

Summary

We took a pretty big risk with an experimental new parsing strategy, and it has paid off. We’re pretty excited here at Shape about the new possibilities this has opened for us, and will be working with the maintainers of other popular parsers to popularise this approach.

Author: Shape Security

Shape Security defends Global 2000 corporations from increasingly sophisticated automated cyber-attacks, including large-scale account takeover, credential stuffing, content scraping and content aggregation attacks on web and mobile applications. Shape has deflected over $1B in fraud losses for major retailers, financial institutions, airlines, and government agencies. Shape Security is headquartered in Silicon Valley and backed by Kleiner Perkins Caufield & Byers, Norwest Venture Partners, Venrock, Baseline Ventures, Google Ventures, and other prominent investors. Read our blog to get insights.