Efficient Top-k Approximate Subtree Matching in Small Memory
Nikolaus Augsten, Denilson Barbosa, Michael Böhlen, Themis Palpanas - TKDE'11
We consider the Top-k Approximate Subtree Matching
(TASM
) problem: finding the k best matches of a small query
tree within a large document tree using the canonical tree edit distance
as a similarity measure between subtrees. Evaluating the tree edit distance
for large XML trees is difficult: the best known algorithms have cubic
runtime and quadratic space complexity, and, thus, do not scale.
Our solution is TASM
-postorder, a memory-efficient and scalable
TASM
algorithm. We prove an upper bound for the maximum
subtree size for which the tree edit distance needs to be evaluated. The
upper bound depends on the query and is independent of the document size
and structure. A core problem is to efficiently prune subtrees that are
above this size threshold. We develop an algorithm based on the prefix ring
buffer that allows us to prune all subtrees above the threshold in a single
postorder scan of the document. The size of the prefix ring buffer is
linear in the threshold. As a result, the space complexity of
TASM
-postorder depends only on k and the query size,
and the runtime of TASM
-postorder is linear in the size of the
document. Our experimental evaluation on large synthetic and real XML
documents confirms our analytic results.