Pattern Matching in Trees

This article aims to describe the technique of using Aho-Corasick to perform tree matching, from Pattern Matching in Trees by Hoffmann and O’Donnell.

The Idea

The primary idea of the algorithm is that you can apply string matching algorithms to structures that can be characterised by strings. Trees, for example, are characterised by their root-to-leaf paths.

We will adopt a prefix notation when describing labelled trees. For example, consider the matching problem and solutions in the diagram below:

The diagram above illustrates the matching of pattern tree a(a(b, _), c) against subject tree f(a(a(b, a(a(b, d), c)), c), z). The highlighted regions signify a match, rooted at the highest node in the region. Notice that patterns involving wildcards can be arbitrarily nested in a subject tree.

Algorithm

The general steps of the algorithm are outlined below.

Construct a trie from a pattern tree.

For this, I collected the root-to-leaf paths using a depth-first traversal and then populated the trie. Below is the trie for the pattern a(a(b, _), c).

Note that, while you can actually construct a trie directly from a pattern tree, I chose not to take this approach because the root-to-leaf solution is more easily adapted to supporting multiple different pattern trees (just requires more book-keeping).

Construct an Aho-Corasick automaton from the trie.

This follows exactly from the original paper, Efficient String Matching: An Aid to Bibliographic Search.

For example, the automaton computed from the trie above is:

For clarity in presentation, I’ve omitted all edges that go to the root node (labelled 0 in the above diagram). So, you can assume all missing transitions go to the root node.

Traverse the subject tree with a traversal stack, using the automaton to identify matches.

The algorithm uses a stack to traverse the subject tree so that matches can be annotated where they’re rooted (as opposed to the node where a match is discovered). To do this, upon entering a matching state (a state in the automaton with outputs), each output is iterated and its length used to index the traversal stack. In the original paper, advances on the child indices are not recorded on the stack - so the length of the corresponding output path string, without the index symbols, is used to index the stack. If by the end, the number of matches equals the number of path strings (leaves in the pattern tree), then the entire pattern tree matches at the relevant node in the subject tree.

Applications

Instruction selection: In Code generation using tree matching and dynamic programming, Aho et al. combine the algorithm described in this article with dynamic programming to produce a tree-matching instruction selection framework called TWIG. The algorithm could also be used more generally to implement something similar to Go .rules (example here) or GCC machine description (.md) files (example here). In general, term rewriting can be implemented with this algorithm quite easily.
Signature scanning: game cheats often employ Aho-Corasick to identify linear code sequences (as bytes of their opcodes) or structures in memory. Since this usage of the algorithm is aimed at trees, you can imagine that structures whose members refer to other structures (by means of pointers) can be incorporated.
Matching JSON and other records: the algorithm can easily be adapted to ignore child sub-tree order and exhaustivity by normalisation. For example, you could accept a JSON record as a pattern and then inform the algorithm to sort each field lexicographically and only traverse fields of the subject tree that are present as fields in the pattern tree. The effect of this would be that you could list the fields in any order and attain partial matches (omitting of fields).

Demo

Below you can play around with the algorithm to see it work.

Strict

The “strict” mode changes the algorithm to incorporate the arity of nodes in their labels. The effect of this is that trees must match exactly. It’s tempting to think that the prefix notation f(a, b, c) implies must have exactly 3 children, however the original algorithm is more general, it would find a match in f(a, b, c, d) (since such a sub-tree is actually present). The semantics of this are not what people assume would be the case from the prefix notation, so I’ve enabled a stricter matching criterion by default.

Implementation

Basic implementations of this algorithm can be found on my GitHub. One, in OCaml (powering this blog article’s demo), here, and a Java version, here.