Abstract Syntax Tree Patterns

1.) Matching nodes

When dealing with AST's (Abstract Syntax Trees) one usually finds himself writing Visitor classes that walk the trees and have methods that look similar to the one below:
    

public Object visit(ASTCompositeReference node, Object data){
SimpleNode scope = this.getScope();

if(scope instanceof ASTProgram){
if(node.jjtGetNumChildren() == 3){
SimpleNode aNode = (SimpleNode) node.jjtGetChild(0);

if((aNode instanceof ASTIdentifier) && (((ASTIdentifier) aNode).getName().equals("CXClass"))){
aNode = (SimpleNode) node.jjtGetChild(1);

if(aNode instanceof ASTPropertyIdentifierReference){
aNode = (SimpleNode) aNode.jjtGetChild(0);

if((aNode instanceof ASTIdentifier)
&& ((((ASTIdentifier) aNode).getName().equals("declare")) ||
(((ASTIdentifier) aNode).getName().equals("declareInterface")))){
aNode = (SimpleNode) node.jjtGetChild(2);

if(aNode instanceof ASTFunctionCallParameters){
aNode = (SimpleNode) aNode.jjtGetChild(0);

if(aNode instanceof ASTLiteral){
String className = (String) ((ASTLiteral) aNode).getValue();

Comment comment = this.getComment((SimpleNode) node);
JavaClass javaClass = this.getJavaClass(className);

if(comment != null){
javaClass.setComment(comment);
}

javaClass.setOriginatingFile((JSFile) data);
}
}
}
}
}
}
}

Most of the code above deals with finding out if the node currently visited is interesting for data extraction and/or further manipulation. Let's explain exactly what it means when a node "is interesting". Nodes in an AST belong to one of the AST types generated by jjtree (the tree generating tool of javacc), they are inside a given scope, have or don't have children, have or don't have a parent (of a certain type), have a given value for an attribute etc. All these properties are differentiating one node from another. Visitors usually want to operate only on nodes with properties that fullfill some defined conditions. Example:

A visitor that wants to print out all the local variables of a program. It is therefore interested in AST nodes whose properties match the following conditions:

- node type is ASTVariableDeclaration
- node scope is of type ASTBlock (that's the scope for local stuff)
- number of children is 3
- first child is of type ASTIdentifier
- etc

The if-statements in the example above test for conditions like these. But all these nested conditions look awkward and are probably hard to modify and to maintain.

There is clearly a need to simplify the task of matching nodes according to defined conditions.

The idea behind tree patterns is defining the conditions in a declarative fashion in a simple language. These declarations get transformed by a parser into tree template data structures that are used by a general tree matcher to match nodes. The result is that the if-statements get replaced by an easier to read declaration of conditions and the tree matching code of the general tree matcher gets reused in all instances where locating nodes according to conditions is needed.

2.) TreeTemplates

TreeTemplates are the data structures that the general tree matcher uses to match nodes.

A tree template consists of a map of key-value pairs and a list of child tree templates. The map defines how the node should look like and the list of child templates defines how its children look like.

The following keys are available for the node map:

- type : the type of the node, value is an ASTLiteral with the unqualified class name of the node as string
- parent : a tree template or ASTLiteral.NULL
- previous : a tree template or ASTLiteral.NULL
- next : a tree template or ASTLiteral.NULL
- scope : the scope in which the node lives, value is the string "local" or "global"
- value : the node's value (only defined for nodes representing literals or identifiers), value is an ASTLiteral

If the tree template is defining a child (ie it is in the list of child templates of another tree template) then one more key is available:
- multi : a quantifier specifying how many children should match this template, value is an instance of Multi

The keys listed above all define tree template conditions that a node has to match. They are content keys.

In addition to the content keys there are two other keys available. These two keys don't define conditions but control the behavior of the tree matcher and are therefore control keys :
- name : identifies a node from the subtree that matched a template, useful to easily get to children, grandchildren etc of matched nodes
- stop : tells the tree matcher to not go deeper (ie ignore the children of this node)

Example:

Same visitor that prints out local variables of a program:

// define template

ASTLiteral literal;

TreeTemplate treeTemplate = new TreeTemplate();
treeTemplate.node = new HashMap();

literal = new ASTLiteral();
literal.setValue("ASTVariableDeclaration");
treeTemplate.node.put("type", literal);

treeTemplate.node.put("scope", "global");

TreeTemplate childTemplate;

treeTemplate.children = new ArrayList();

childTemplate = new TreeTemplate();
childTemplate.node = new HashMap();

literal = new ASTLiteral();
literal.setValue("ASTIdentifier");
childTemplate.node.put("type", literal);

literal = new ASTLiteral();
literal.setValue("theIdentifier");
childTemplate.node.put("name", literal);

childTemplate.node.put("stop", Boolean.TRUE);

treeTemplate.children.add(childTemplate);

// etc.....

// match
TreeMatcher treeMatcher = new TreeMatcher();

HashMap matchedNodes = new HashMap();
if(treeMatcher.match(simpleNode, scope, treeTemplate, matchedNodes)){
//matched
ASTIdentifier theIdentifier = (ASTIdentifier) matchedNodes.get("theIdentifier");
}




Note how the "name" key was used to mark a child node and then retrieve it from the matchedNodes map. For templates with the multi key in them the matchedNodes map contains a list of nodes under the name specified in the template.

3.) TreePattern language

It's not hard to see that constructing the TreeTemplate data structures by hand can get tiresome very quickly. An easier method is to define the tree templates in a simple and concise language and let a parser construct the tree template data structure from a "program" in that language. What follows is an informal description of the tree pattern language. For a more detailed and exact definition look at the javacc grammar file (TreePattern.jj) describing the language.

The lexical tokens of the language include all the javascript literals:
- strings (supported are both quoting types "foo", 'foo', just like in javascript)
- numbers
- true, false, null
- regular expression literals

White-space is the same as in javascript, but no comments are supported.

The reserved words are value, type, name, stop , scope, parent, next, previous, local , global and multi.

The separators are {, }, [, ], (, ), : and comma itself.

The operators are =, <, >, !=, /=,<=, >=, + and *.

The grammatical productions are

Pattern ::= ( ( <LPAREN> Node ( <COMMA> ( Pattern )? )* <RPAREN> ) | Node )
Node ::= <LBRACE> ( FieldList )? <RBRACE>
FieldList ::= Field ( <COMMA> ( Field )? )*
Field ::= FieldKey Comparator Literal

| <MULTI> MExpr

| NodeRefKey <EQUAL> ( Pattern | <NULL> )

| <STOP>

| <NAME> <COLON> Literal

| <SCOPE> <EQUAL> ( <LOCAL> | <GLOBAL> )
FieldKey ::= ( <TYPE> | <VALUE> )
NodeRefKey ::= ( <PREV> | <NEXT> | <PARENT> )
Comparator ::= ( <EQUAL> | <NEQUAL> | <REGEQUAL> | <GT> | <LT> | <LE> | <GE> )
MExpr ::= ( ( <GT> | <LT> | <GE> | <LE> ) <DECIMAL_LITERAL> | <EQUAL> MultiEqualFactor )
MultiEqualFactor ::= <DECIMAL_LITERAL>

| <PLUS>

| <STAR>

| <LBRACKET> <DECIMAL_LITERAL> <COMMA> <DECIMAL_LITERAL> <RBRACKET>
Literal ::= ( <DECIMAL_LITERAL> | <HEX_LITERAL> | <FLOATING_POINT_LITERAL> | <STRING_LITERAL> | <TRUE> | <FALSE> | <NULL> | <REGEX_LITERAL> | <UNTERMINATED_STRING_LITERAL> )


Let's see some examples:

First, the local variable example again:

( { type = 'ASTVariableDeclaration' , scope = local }, { type = 'ASTIdentifier', name : 'theIdentifier', stop}, {multi = *, stop} )

The last child template basically tells the matcher "match up any remaining - zero or more - children, we're not interested in them".

Here's a more complex example: we want function declaration nodes that have between 2 and 5 formal arguments and that represent functions named "foo":

( {type='ASTFunctionDeclaration'}, {type='ASTIdentifier', value='foo'}, ( {type= 'ASTFormalArgumentList '}, {multi=[2,5], stop} ), {type='ASTBlock' , stop})


4.) Multi

A word about the quantifier specification used with key multi.

The tree matcher doesn't use a backtracking algorithm to resolve multi constraints, instead it uses a greedy behavior and matches as many child nodes as possible to the current child template that has a multi key, and then proceeds to the next child template (it is a little more intelligent than that, if the multi constraint for the current child template has been met it tries to leave a minimum amount of children for the following templates). That means that one has to be a little careful when specifying patterns with multi in them.

A good rule of thumb to deal with the present tree matcher is to define as detailed as possible each child node template that has multi quantifiers. That minimizes the possibility of ambiguous mappings of children to child templates.