Announcing SuperPack

Shape Security is proud to announce the release of SuperPack, a language-agnostic schemaless binary data serialisation format.

First of all, what does it mean to be schemaless?

Data serialisation formats like JSON or MessagePack encode values in a way that the structure of those values (schema) can be determined by simply observing the encoded value. These formats, like SuperPack, are said to be “schemaless”.

In contrast, a schema-driven serialisation format such as Protocol Buffers makes use of ahead-of-time knowledge of the schema to pack the encoded values into one exteremely efficent byte sequence free of any schema markers. Schema-driven encodings have some obvious downsides. The schema must remain fixed (ignoring versioning), and if the encoding party is not also the decoding party, the schema must be shared among them and kept in sync.

Choose the right tool for the job. Usually, it is better to choose a schema-driven format if it is both possible and convenient. For other occasions, we have a variety of schemaless encodings.

What separates it from the others?

In short, SuperPack payloads are very compact without losing the ability to represent any type of data you desire.

Extensibility

The major differentiator between SuperPack and JSON or bencode is that it is extensible. Almost everyone has had to deal with JSON and its very limited set of data types. When you try to JSON serialise a JS undefined value, a regular expression, a date, a typed array, or countless other more exotic data types through JSON, your JSON encoder will either give you an error or give you an encoding that will not decode back to the input value. You will never have that problem with SuperPack.

SuperPack doesn’t have a very rich set of built-in data types. Instead, it is extensible. Say we wanted to encode/decode (aka transcode) regular expressions, a data type that is not natively supported by SuperPack. This is all you have to do:

SuperPackTranscoder.extend(
  // extension point: 0 through 127
  0,
  // detect values which require this custom serialisation
  x => x instanceof RegExp,
  // serialiser: return an intermediate value which will be encoded instead
  r => [r.pattern, r.flags],
  // deserialiser: from the intermediate value, reconstruct the original value
  ([pattern, flags]) => RegExp(pattern, flags),
);

And if we want to transcode TypedArrays:

SuperPackTranscoder.extend(
  1,
  ArrayBuffer.isView,
  a => [a[Symbol.toStringTag], a.buffer],
  ([ctor, buffer]) => new self[ctor](buffer),
);

Compactness

The philosophy behind SuperPack is that, even if you cannot predict your data’s schema in advance, the data likely has structures or values that are repeated many times in a single payload. Also, some values are just very common and should have efficient representations.

Numbers between -15 and 63 (inclusive) are a single byte; so are booleans, null, undefined, empty arrays, empty maps, and empty strings. Strings which don’t contain a null (\0) character can avoid storing their length by using a C-style null terminator. Boolean-valued arrays and maps use a single bit per value.

When an encoder sees multiple strings with the same value, it will store them in a lookup table, and each reference will only be an additional two bytes. Note that this string deduplication optimisation could have been taken further to allow deduplication of arbitrary structures, but that would allow encoders to create circular references, which is something we’d like to avoid.

When an encoder sees multiple maps with the same set of keys, it can make an optional optimisation that is reminiscent of the schema-directed encoding approach but with the schema included in the payload. Instead of storing the key names once for each map, it can use what we call a “repeated keyset optimisation” to refer back to the object shape and encode its values as a super-efficient contiguous byte sequence.

The downside of this compactness is that, unlike JSON, YAML, or edn, SuperPack payloads are not human-readable.

Conclusion

After surveying existing data serialisation formats, we knew we could design one that would be better suited to our particular use case. And our use case is not so rare as to make SuperPack only useful to us; it is very much a general purpose serialisation format. If you want to create very small payloads for arbitrary data of an unknown schema in an environment without access to a lossless data compression algorithm, SuperPack is for you. If you want to see a more direct comparison to similar formats, see the comparison table in the specification.

I’m sold. How do I use it?

As of now, we have an open-source JavaScript implementation of SuperPack.

$ npm install --save superpack

Announcing Bandolier

Today Shape Security is releasing Bandolier, a Java library that bundles JavaScript written with ES2015 module syntax.

Bandolier takes JavaScript code like this:

import { b } from './foo.js'
console.log(42 + b);

where the foo module is defined as:

// foo.js
export var b = 100;

and produces a single script without ES2015 module syntax that can run in a JavaScript environment that does not yet support import/export:

(function(global) {
  "use strict";

  function require(file, parentModule) {
    // eliding the definition of require
    // ...
  }

  require.define("1", function(module, exports, __dirname, __filename) {
    var __resolver = require("2", module);
    var b = __resolver["b"];
    console.log(42 + b);
  });
  require.define("2", function(module, exports, __dirname, __filename) {
    var b = 100;
    exports["b"] = b;
  });
  return require("1");
}.call(this, this));

Bandolier is a good example of a non-trivial project built using the Shift AST; Bandolier essentially takes a bunch of Module ASTs that contain import and export declarations and appropriately merges them into a single Script AST.

Bandolier works by first parsing the given JavaScript file into a Module AST using the Shift Java parser. It then transforms the AST by resolving each import declaration’s module specifier (e.g. converting import foo from "some/module" to import foo from "/full/path/to/some/module"). Once all the imports are resolved, each imported module is recursively loaded and stored in memory.

Finally, the bundled script is created by generating the module loading boilerplate (the function wrapper and the require function) and then each loaded module is transformed by changing import declarations to require calls and export declarations to updates to the exports object.

One particularly useful feature of Bandolier is that both the resolving and loading phases are pluggable. Bandolier comes with a few choices built-in including:

  • a FileSystemResolver that just normalizes relative paths
  • a NodeResolver that follows the node require.resolve algorithm
  • a FileLoader for loading resources from the file system
  • a ClassResourceLoader for loading resources inside a JAR.

Writing your own custom loader or resolver is as simple as implementing the IResolver and IResourceLoader interfaces.

Note that Bandolier is not a full transpiler like babel; it only transforms import and export statements. That said, the Shift parser fully supports ES2015 so you can, for example, use ES2015 classes and the bundled output will work in any JavaScript environment that supports classes (e.g. recent versions of node).

Also note that Bandolier only bundles ES2015 modules so if you need to do something more complex, like bundling CommonJS modules, you will probably be more happy with something like browserify, CommonJS Everywhere, or webpack.

What sets Bandolier apart from similar projects, and why we built it at Shape, is that it allows you to easily integrate JavaScript bundling into a Java application. We use it to dynamically generate and bundle our JavaScript resources on-the-fly inside a Java server. So, if you have similar needs (or are just interested in how to use the Shift AST) check out the project on github.

Salvation is Coming (to CSP)

CSP (Content Security Policy) is a W3C candidate recommendation for a policy language that can be used to declare content restrictions for web resources, commonly delivered through the Content-Security-Policy header. Serving a CSP policy helps to prevent exploitation of cross-site scripting (XSS) and other related vulnerabilities. CSP has wide browser support according to caniuse.com.

Content-Security-Policy-1.0

Content-Security-Policy-Level-2

There’s no downside to starting to use CSP today. Older browsers that do not recognise the header or future additions to the specification will safely ignore them, retaining the current website behaviour. Policies that use deprecated features will also continue to work, as the standard is being developed in a backward compatible way. Unfortunately, our results of scanning the Alexa top 50K websites for CSP headers align with other reports which show that only major web properties like Twitter, Dropbox, and Github have adopted CSP. Smaller properties are not as quick to do so, despite how relatively little effort is needed for a potentially significant security benefit. We would be happy to see CSP adoption grow among smaller websites.

Writing correct content security policies is not always straightforward, and mistakes make it into production. Browsers will not always tell you that you’ve made a typo in your policy. This can provide a false sense of security.

Announcing Salvation

Today, Shape Security is releasing Salvation, a FOSS general purpose Java library for working with Content Security Policy. Salvation can help with:

  • parsing CSP policies into an easy-to-use representation
  • answering questions about what a CSP policy allows or restricts
  • warning about nonsensical CSP policies and deprecated or nonstandard features
  • safely creating, manipulating, and merging CSP policies
  • rendering and optimising CSP policies

We created Salvation with the goal of being the easiest and most reliable standalone tool available for managing CSP policies. Using this library, you will not have to worry about tricky cases you might encounter when manipulating CSP policies. Working on this project helped us to identify several bugs in both the CSP specification and its implementation in browsers.

Try It Out In Your Browser

We have also released cspvalidator.org, which exposes a subset of Salvation’s features through a web interface. You can validate and inspect policies found on a public web page or given through text input. Additionally, you can try merging CSP policies using one of the two following strategies:

  • Intersection combines policies in such a way that the result will behave similar to how browsers enforce each policy individually. To better understand how it works, try to intersect default-src a b with default-src; script-src *; style-src c.
  • Union, which is useful when crafting a policy, starting with a restrictive policy and allowing each resource that is needed. See how union merging is not simply concatenation by merging script-src * with script-src a in the validator.

Contribute

You can check out the source code for Salvation on Github or start using it today by adding a dependency from Maven Central. We welcome contributions to this open source project.

Contributors

Two-Phase Parsing in the Shift JavaScript Parser

Today, we merged the two-phase-parsing branch of the Shift Parser. This branch implemented an experiment with a significant change to the parser’s architecture. To understand it, we first need to understand ECMAScript early errors, and how they are typically handled.

Early Errors

The ECMAScript language is defined using a formal grammar, specifically a context-free grammar. A context-free grammar consists of a number of productions, each associating a symbol called a nonterminal with a (possibly empty) sequence of nonterminal and terminal symbols. Section 5.1 of the ECMAScript 6 specification explains the meaning of its grammars in great detail, but I will summarise it below.

There are two ECMAScript grammars: the lexical grammar, given in section 11, and the syntactic grammar, given in sections 11 through 15. The terminal symbols for the lexical grammar are Unicode code points (the text of your program), while the terminal symbols for the syntactic grammar are tokens, sequences of Unicode code points in the language defined by the lexical grammar.

When a program is parsed as a Script or Module (the two goal symbols of the syntactic grammar), it is first converted to a stream of tokens by repeatedly applying the lexical grammar to the remainder of the program. The token stream only represents an ECMAScript program if the token stream can be parsed by a single application of the syntactic grammar with no tokens left over.

An early error is an additional condition that must hold when a grammar production is matched. From section 5.3 of the ECMAScript 6 specification,

A conforming implementation must, prior to the first evaluation of a Script, validate all of the early error rules of the productions used to parse that Script. If any of the early error rules are violated the Script is invalid and cannot be evaluated.

Let’s take a look at an early error definition from the specification.

12.5.1 Static Semantics: Early Errors

UnaryExpression :
  ++ UnaryExpression
  -- UnaryExpression

It is an early Reference Error if IsValidSimpleAssignmentTarget of UnaryExpression is false.

This early error prevents constructions like ++0 because the named static semantic rule IsValidSimpleAssignmentTarget is true only for identifiers (a), static member access (a.b), computed member access (a[b]), and any of the previous productions enclosed in parentheses ((a)). Notably absent is Literal, the production that matches 0.

The final version of the ECMAScript 6 specification has over one hundred early errors. I know this because I had to write tests for every single one of them. It took a while.

Typical Early Error Handling

Not all early errors are as simple as the one above, where you can immediately know of the existence of an early error. For example, an object literal must not have more than one __proto__ data property. A label cannot be contained within another label with the same name (without crossing function boundaries). As you can imagine, tracking all of the information required to make these decisions can require a lot of complexity. Typical parsers like esprima and acorn use the following strategies:

Lookup Tables

Esprima has an object called state with a field named labelSet. Label names are added to this object when encountered, but if the label already exists, an early error is thrown. Because this state is unique only per parser instance, a stack of labelSet objects must be maintained (in this case, they are using the call stack) so that a fresh labelSet may be generated as each function is entered.

Wrapped Return Values

When parsing function parameters, some early errors must be collected and saved for later, in case it is determined that the parameters are in strict mode and violate a strict-mode-only early error. Remember that a function with a "use strict" directive in its body is strict mode code – including its parameters. Because of this, the function that parses a parameter must wrap all of its return values in a wrapper object that contains both the parameter node and any early errors that would be produced in a strict mode context.

Out Parameters

To avoid wrapper objects in some esprima parsing functions, an object like { value: false } will be passed to the function as an argument. The parsing function will then be able to mutate its parameter as well as return a value, essentially returning more than one value. This strategy is used when the same object can be re-used multiple times, such as the indicator that a data property with the name __proto__ has been parsed by the object property parsing function (the mutation hasProto.value = true is idempotent).

Two-Phase Parsing

We initially implemented our Shift parser in the same way as these others, but it ended up just as unruly and difficult to maintain. So we decided to remove all of the early error handling from the parser, and add an optional second pass that would detect the early errors. This means that the first pass would produce an AST for any input that passed the syntactic grammar. There were some risks with this: we needed to make sure we preserved enough information in the AST to determine if there should be an early error, and we needed to make sure the AST could represent all grammatically correct programs that would otherwise be invalid due to early errors.

The first step was to separate the failure tests that failed due to early errors into their own test suite. We then ensured that we had failure tests for each and every early error in the specification. We then used our new Shift reducer 2.0 to reduce the AST in a single pass into a list of early errors. The reduction state object needed fewer states than we had initially thought it would. It tracks label names, break/continue statements, labeled break/continue statements, new.target expressions, names in binding position, lexical declarations, vardeclarations, var declarations in for-of heads, function declarations, exported names, exported bindings, super calls, and super member accesses. All early errors relate to one or more of those pieces of information and some context.

There weren’t really many difficulties with this approach. There were a few cases that we labeled as “early grammar errors”, which were grammatically correct productions that our AST could not represent: 0 = 0because AssignmentExpression requires a Binding on its left-hand side, (...a) because ArrowExpression must have a body, etc. Additionally, we labeled some early errors as “early tokenisation errors”, including a\u0000, an identifier with a unicode escape sequence whose value is disallowed by an early error.

What We’ve Gained

So was all this trouble worth it? We certainly think it was! First off, the parser is much cleaner, easier to read, and easier to maintain. But if we wanted to, we wouldn’t even have to maintain the parser. Now that there’s no special grammar exceptions, we can replace the entire hand-written parser with one generated by a parser generator. This should be even more easily maintainable and more performant.

Speaking of performance, the parser with early errors disabled is over twice as fast as it was before. In fact, it is faster than any other parser we tested. For our business use, we will almost always want to run in this “loose” mode – we don’t have to worry about being very strict with the inputs we accept. Now one can choose between extremely correct or extremely fast parsing (but even in fast mode, you will never fail to parse a valid program).

Error reporting has also improved with the two-phase parser, since the separate early error checker will collect and report all problems due to early errors in a syntactically valid program instead of just reporting the first one. Finally, this opens up the possibility for the first phase to be extensible, allowing one to add support for JSX, macros, etc. through plugins instead of forks!

Summary

We took a pretty big risk with an experimental new parsing strategy, and it has paid off. We’re pretty excited here at Shape about the new possibilities this has opened for us, and will be working with the maintainers of other popular parsers to popularise this approach.

Reducing with the Shift Reducer

What is a Reducer?

A reducer is an actor that takes something large and turns it into something smaller. In programming it is a construct that recursively applies a function over a data structure in order to produce a single value.

In JavaScript, you could reduce an array of integers to a single sum with the following code.

var integers = [1, 2, 3, 4, 5, 6];
var sum = integers.reduce(function(memo, next){ return memo + next; }, 0);
// sum === 21

Shift Reducer

Shape Security has provided a reducer to use in building tooling for the Shift format AST. The reducer folds a Shift format AST into a summary value, much like Array.prototype.reduce folds an array. Of course, reducing an array is much less complex than reducing an AST. Only one function is required to reduce an array, while reducing an AST requires one function for each different type of node.

Shape’s reducer library exposes a single function that runs the reduction, and two base structures that are meant to be extended with your own reducing behaviors: Reducer and MonoidalReducer.

Reducer

Use Reducer when the reduction requires unique behavior for each different type of node. It is a clean slate. Extending Reducer requires every single reduction method (reduceXXX) to be overridden. Code generation or AST serialisation are examples of when it is appropriate to base your reducer on Reducer.

MonoidalReducer

The majority of Shift implementations will benefit from basing their reducer off of MonoidalReducer. Extending MonoidalReducer requires that the summary value that each reduction method returns is a Monoid. Its default implementations of the reduction methods take advantage of the monoidal structure of your summary value so that only the reduction methods for the pertinent nodes need to be overridden by you. For all others, the Monoid’s identity will be used.

That may have been a lot to take in. Don’t worry if you’re not familiar with the terminology! As a programmer, you likely run into Monoids every single day, but the term can cause confusion. Let’s see if we can clear up the term a little bit.

Monoids

A monoid is structure that relates the elements in a set with a closed, associative, binary operation that we will call append, coupled with one special element in that set that we will call the identity element.

Let’s break monoids down a little further.

What is a binary operation?

A binary operation is an operation that operates on two inputs.

0 + 1; // + is a binary operator
function append(a, b){} // append is a binary function

What is a closed operation?

An operation is closed if performing the operation on members of a set always produces a member of the same set.

What is associativity?

We learned the concept of associativity back in elementary school and it is core to our understanding of algebraic operations. Associativity, as the name implies, means that grouping of operations does not affect the result.

(a + b) + c === a + (b + c)

Remember, associativity is not commutativity. That would mean that the order of the values given to the operation does not affect the result.

a + b + c === c + b + a

What is an identity?

An identity is a value for a specific operation that when passed to an operator with any other value returns the other value. You may remember this via the additive identity, 0, or the multiplicative identity, 1.

x + 0 === x;
0 + x === x;

x * 1 === x;
1 * x === x;

append(x, identity) === x;
append(identity, x) === x;

Putting it all together

Using the above examples we could write out the Sum Monoid in arithmetic expressions.

sumIdentity = 0
sumAppend(x, y) = x + y

Or we could write the Sum Monoid JavaScript implementation. For this, we will use the conventional Fantasy Land names, empty and concat.

//es6
class Sum {
  constructor(number) {
    this.value = number;
  }
  // always return the identity element
  static empty() {
    return new Sum(0);
  }
  // the binary operation acts on its `this` value and its parameter
  concat(other) {
    return new Sum(this.value + other.value);
  }
}

new Sum(5).concat(new Sum(2)).value; // 7

Walkthrough: Making something with the MonoidalReducer

Now that we understand monoids, let’s walk through making a small program with the Shift MonoidalReducer that counts how many identifiers are in a program.

Setup

Install dependencies.

$ npm install --save shift-reducer shift-parser 6to5

Making an Identifier counter

First we need to flesh out our basic program.

//es6
import parse from "shift-parser";
import reduce, {MonoidalReducer} from "shift-reducer";

// a monoid over integers and addition
class Sum() {
  constructor(number) {
    this.value = number;
  }
  // by default reduce any node to the identity, zero
  static empty() {
    return new Sum(0);
  }
  // combine Sum instances by summing their values
  concat(other) {
    return new Sum(this.value + other.value);
  }
}

class IdentifierCounter extends MonoidalReducer {
  constructor() {
    // let MonoidalReducer know that we're going to use Sum as our monoid
    super(Sum)
  }

  // a convenience function for performing the reduction and extracting a result
  static count(program) {
    return reduce(new this, program).value;
  }

  // add 1 to the count for each IdentifierExpression node
  reduceIdentifierExpression(node) {
    return new Sum(1);
  }

  /*
    In this case, the only node we care about overriding is the
    IdentifierExpression node; the rest can be reduced using the default
    methods from MonoidalReducer.
  */
}

// test program code
var program = "function f() { hello(world); }";
console.dir(IdentifierCounter.count(parse(program)));

Run it!

$ node_modules/.bin/6to5-node count-identifiers.js

Wrapping Up

Let’s walk through what’s been done. We’ve created a new Reducer by extending the MonoidalReducer, overridden the necessary reduction methods (in this case only reduceIdentifierExpression), and parsed and run our new reducer over a program.

We wrote this example in ES6 because we believe it’s clearer. An ES5 version of the identifier counter is available in this gist.

Taking it Further

At this point, we’ve used a fairly trivial example in order to expose the fundamentals of using the MonoidalReducer. Next, we will look at the design of a more significant project that makes use of theMonoidalReducer: the Shift Validator.

Shift Validator

The Shift Validator validates a Shift format AST from the bottom up. But how does it do this when many of the restrictions it is enforcing are context sensitive? The ValidationContext object that the Validator uses allows possible errors to be registered on it and, if we determine (with new information about the possible error’s context) that the error actually does not apply, it can clear possible errors as well. Only when we are certain an error will not be cleared do we move it from its temporary error list to the official errors list in the ValidationContext object. Let’s look at a concrete example:

When the Validator reduces a ReturnStatement, we call the addFreeReturnStatement helper method of our ValidationContext state object, giving it an error that this ReturnStatement must be contained within a function (top-level return is illegal in JavaScript). We don’t know whether this ReturnStatement is actually in an illegal position, but we assume it is until we better understand its context. In the reduction methods for FunctionDeclaration, FunctionExpression, Getter, and Setter nodes, we then call the clearFreeReturnStatements helper method of our ValidationContext state object clearing out all of the ReturnStatement errors we collected while reducing ReturnStatement nodes below us in the AST. Finally, when we reduce a Script node (the head of the AST), we move the ReturnStatement errors from their temporary holding list to the confirmed errors list using theenforceFreeReturnStatementErrors helper method. We do this at this point because we know we won’t be reducing any more functions that will cancel out a ReturnStatement error.

Final Round Up

To pull it all together, we’ve gone over the Shift Reducer and MonoidalReducer. Both can be used to build tooling based on the Shift AST. We’ve gone over the fundamentals behind the MonoidalReducer and explored both a simple MonoidalReducer example, as well as a more complex example, the Shift Validator. Hopefully, now you feel comfortable building your own tools based on Shift’s AST.

A Technical Comparison of the Shift and SpiderMonkey AST Formats

Since publishing our announcement of the Shift AST specification, many developers have asked for more details about how the Shift AST format compares to the SpiderMonkey AST format. We should first enumerate what we consider to be the properties of a good AST format.

A good AST format…

  • minimizes the number of inhabitants that do not represent a program.
  • is at least partially homogenous to allow for a simple and efficient visitor.
  • does not impede moving, copying, or replacing subtrees.
  • discourages duplication in code that operates on it.

Improvements

The following is a list of differences that we consider improvements over the SpiderMonkey AST format.

  • The top-level node returned from any successful parse is named Script, not Program, to ease upgrade to ECMAScript 6. ECMAScript 6 parsers need two modes: one mode that produces a Script and one mode that produces a Module. Modules allow import/export declarations at the top level and are always in strict mode.
  • Functions (including getters/setters) represent their body using a FunctionBody, not a BlockStatement, to support directives and because a function’s body is neither a generic statement position nor a block.
  • Script contains a FunctionBody, not a [Statement], to support top-level directives and for uniform handling of the shared FunctionBody structure.
  • The concepts of BlockStatement and Block have been separated. A BlockStatement contains a Block, not a [Statement]. A Block contains a [Statement]. This is better for transformation: a BlockStatement may be replaced by any other Statement, but a Block must be replaced only by another Block. Block is also used to represent the body and finalizer of a TryFinallyStatement and body of a CatchClause (all of which cannot be arbitrary statements).
  • Similarly, the concepts of VariableDeclarationStatement and VariableDeclaration have been separated. A VariableDeclaration is used within for and for-in statements (both of which cannot contain arbitrary statements in that position).
  • The VariableDeclaration declarators list is required to be non-empty.
  • The concepts of IdentifierExpression and Identifier have been separated. An IdentifierExpression contains an Identifier in expression position. Identifiers are also used for function names, break labels, and static member access property names.
  • MemberExpression has been separated into StaticMemberExpression and ComputedMemberExpression so that the computed flag and the type of property cannot conflict.
  • SwitchStatementWithDefault has been separated out of SwitchStatement to guarantee that a SwitchStatement has no more than one default clause.
  • TryStatement has been separated into TryFinallyStatement (try/catch/finally and try/finally) and TryCatchStatement (try/catch) to disallow a TryStatement with no handler and no finalizer.
  • SequenceExpression and LogicalExpression are just BinaryExpressions. AssignmentExpression remains separate in preparation for ECMAScript 6, where its left operand will need to be a Binding.
  • Separated Literal into LiteralBooleanExpression, LiteralNullExpression, LiteralNumericExpression, LiteralRegExpExpression, and LiteralStringExpression. The SpiderMonkey Literal node is overloaded to the point that it is not used anywhere without qualifying that only a subset of its values may be used.
  • LiteralRegExpExpression is represented by a string, not a RegExp. This allows for JSON serialization and a simpler equivalence definition.
  • Property has been separated into Getter, Setter, and DataProperty, each of which have a PropertyName. PropertyName has a kind (“identifier”, “string”, or “number”) and string value. FunctionExpressions are much too permissive to represent getters/setters, and the Property kind could conflict with the value.
  • Added UseStrictDirective and UnknownDirective nodes to represent directives. These nodes will be replaced with a single Directive node in the future.
  • Removed support for SpiderMonkey-specific language extensions (expression closures, multiple catch clauses, for-each-in, etc.) other than block-scoped declarations.

Insignificant Differences

The following is a list of differences that we believe are insignificant improvements over the SpiderMonkey AST format.

  • SourceLocation format. Source position information was not originally part of the Shift specification because it was not important for any of Shape Security’s usages. Support for source position tracking was only recently added with this experimental interface. If a use case for tracking end position without source content is identified, that information may be added to SourceLocation.
  • Names. We tried to be internally consistent with names like binding, value, and body. We made no effort to carry over SpiderMonkey naming conventions.
  • UpdateExpression and UnaryExpression are replaced by PrefixExpression and PostfixExpression. Ignoring the fact that the prefix flag on SpiderMonkey’s UnaryExpression is unnecessary, there are pros and cons to each way this set of operations is grouped. For example, during scope analysis, it is easier to group the increment/decrement operators together to generate write references, but during code generation it is easier to group the prefix/postfix operators separately.

Deviations From ECMAScript 5

The following is a list of intentional supported extensions to ECMAScript 5.

  • VariableDeclarationKind contains let and const, which should only be allowed in ECMAScript 6, but popular implementations had widespread support for these declaration kinds long before they had support for any other ECMAScript 6 feature. Because of this, many people consider them to be an unofficial extension to ECMAScript 5.
  • Similarly, FunctionDeclarations in arbitrary Statement position were allowed by many ECMAScript 5 implementations (with varying semantics), so the Declaration interface was removed, and FunctionDeclaration was moved to Statement.

Remaining Problems

The following is a list of restrictions that must be checked in addition to the structural correctness to ensure that a Shift AST represents an ECMAScript program. Ideally, this list would be as small as possible, but because of the context sensitivity inherent in the design of the ECMAScript language, these additional restrictions are either impossible or infeasible to enforce in the AST structure. The reason we want this list to be small is because each program that operates on a Shift format AST needs to either be aware of all of these restrictions and handle them gracefully or require consumers to guarantee that the input AST is valid.

Luckily for developers, at the time of the initial Shift specification announcement we made available the Shift validator, which both validates and enumerates validation errors for a given Shift format AST. This makes it very easy to ensure that a Shift format AST does not include any of the below listed problems, as well as debugging problems when they are detected.

  • BreakStatement without a label must be nested within an IterationStatement, SwitchCase, or SwitchDefault. BreakStatement with a label must be nested within a LabeledStatement with an equivalent label (without crossing a function boundary).
  • ContinueStatement without a label must be nested within an IterationStatement. ContinueStatement with a label must be nested within an IterationStatement that is labeled with an equivalent label (without crossing a function boundary).
  • LiteralRegExpExpression value must represent a valid RegExp.
  • Identifier name must always be an IdentifierName, and must not be a reserved word in any position other than a StaticMemberExpression property.
  • An IfStatement with alternate must not have another IfStatement without alternate nested within its consequent in a way that does not represent a valid program. See isProblematicIfStatement in estools/esutils for more details.
  • LabeledStatement must not be nested within a LabeledStatement with the same label name.
  • LiteralNumericExpression value must be non-negative, finite, and non-NaN.
  • ObjectExpression cannot contain data/getter or data/setter properties with the same name.
  • ObjectExpression cannot contain more than one data property with the name `__proto__`.
  • A PropertyName with kind of “number” can have a non-numeric value, and a PropertyName with a kind of “identifier” can have a non-IdentifierName value. It is possible this may one day be fixed.
  • ReturnStatement must be nested within a FunctionExpression, FunctionDeclaration, Getter, or Setter.
  • VariableDeclaration as ForInStatement left must have exactly one declarator. This can (and likely will) be fixed.
  • In strict mode (in other words, nested within a FunctionBody that has a UseStrictDirective), function names, function parameters, catch bindings, setter parameters, prefix/postfix increment/decrement operands, assignment bindings, and variable declaration bindings must not be restricted words (arguments or eval).
  • In strict mode, function parameters must be unique within their containing parameter list.
  • In strict mode, Identifier name must not be a future reserved word in expression position.
  • In strict mode, a PrefixExpression must not have both a delete operator and an Identifier operand.
  • In strict mode, WithStatement is disallowed.

Hopefully this has cleared up the questions you had about the Shift AST. If you think of anything we haven’t, or if you have additional questions, leave a comment below so that others may benefit from the discussion.

Edit: A previous version of this post did not distinguish BreakStatement/ContinueStatement nodes with a label from those without.

Announcing the Shift JavaScript AST Specification

In time for the holidays, we are happy to release Shape Security’s first open source contributions: a new JavaScript AST specification named Shift, and a suite of tools to help you get started working with it.

What is an AST?

An Abstract Syntax Tree is simply a tree representation of a program’s source code. The nodes in an AST represent individual aspects of the language such as identifiers, statements, and literals. This structure is commonly the result of a successful parse of source code.

What can I do with it?

Having an easy to use data structure that represents a program’s source code allows you to write programs that treat code as they would any other piece of data. You can reliably generate new source, transform between languages, replace subtrees, analyze, lint, and auto-format code. ASTs are used by anything that needs to operate on code: IDEs, parsers, linters, analyzers, optimizers, compilers, and more. AST formats that are publicly standardized enable developers to centralize their efforts over a common structure, reducing duplicate work and allowing tools to be composed together.

This doesn’t exist already?

Mozilla exposed the SpiderMonkey Reflect.parse API in 2010 to encourage better tooling for JavaScript. This proved to be incredibly useful to the JavaScript community, enabling the creation of parsers like Esprima and Acorn and catalyzing a vast ecosystem of tools. Hundreds of projects rely upon these tools, including eslint, plato, istanbul, jscs, browserify, and many more.

However, the SpiderMonkey AST format was not specifically created for these tools. The SpiderMonkey AST originated as the internal representation of a JavaScript program in the SpiderMonkey engine, which was intended to be used only for interpretation. As tools were created and more use cases for a standard AST were recognized, many difficulties in dealing with SpiderMonkey format ASTs surfaced.

The SpiderMonkey AST and its ecosystem of tools and parsers is formidable and we don’t take deviation lightly. Our work at Shape Security has presented us with many problems that involve deep analysis and transformation of JavaScript. We have been forced to rethink what it means to represent and transform a JavaScript program, and in doing so developed this alternative AST format. The main advantages of using the Shift AST format are that it makes it much more difficult to accidentally perform a transformation that creates an invalid AST, and the nodes align more closely to the syntactic features they represent.

More than just the AST

An AST specification doesn’t have much value without a surrounding ecosystem. We’ve open-sourced JavaScript and Java implementations of the foundational tooling necessary to foster development of a supporting ecosystem around the Shift AST format. The following tools have been made available for both environments.

  • AST Node Constructors
  • Parser
  • Code Generator
  • Reducer
  • Validator
  • Scope Analyzer

In addition, we’ve released a tool for converting back and forth between the Shift and SpiderMonkey AST formats. All of these are available on the Shape Security Github account.

The road forward

We will continue to develop tooling based on the Shift AST format and will iterate on the existing libraries, optimize for performance, and add ECMAScript 6 support.

The Shift AST format was developed with ECMAScript 6 in mind. The es6 branches of both the specification and the JavaScript AST constructors already include full support for ECMAScript 6, and we plan to add support to all of the tooling we have released so far. Contributors

Some of the developers behind the Shift AST format and associated tools are active contributors and maintainers of JavaScript language tools that are popular in the JavaScript community. Work on those tools is not ending, nor does the work here immediately affect any future plans for those tools.