Top-k Subtree Similarity Queries for Unordered Trees

(master thesis or bachelor project)

A top-k subtree similarity query finds the k subtrees in a large document tree that are most similar to a given reference tree (the query tree). The similarity between two trees is assessed using the so-called tree edit distance (TED), which is the minimum number of node edit operations that transform one tree into another. The tree edit distance for ordered trees (OTED) can be computed in cubic time and quadratic space in the number of tree nodes. A tree is ordered if all siblings follow a specific order, otherwise the tree is considered unordered.

The Database research group recently proposed an efficient solution for top-k subtree similarity queries for ordered labeled trees [1]. However, the problem remains unsolved for unordered trees. In general, computing the TED for unordered trees (UTED) is NP-complete. The purpose of this project is to develop an algorithm for finding a top-k ranking for unordered trees based on our solution for ordered trees [1]. We plan to leverage the fact that the OTED is an upper bound for the UTED and to avoid as many UTED computations as possible. The project also involves an extensive evaluation of the solution with respect to different characteristics (e.g., time and space requirements, effectiveness).

There are also other possibilities to extend on our recent work, for example:

Improve our approach to exploit the structure of subtrees
Design a parallel/distributed version of our algorithm
Improve the underlying data structure to be more cache-friendly
Adapt the algorithm to take a parameter that bounds the max. amount of memory to be consumed

Prerequisites and Interests:

Basic knowledge in C++
Basic knowledge in the design of algorithms and data structures
Master's class "Similarity Search" is of advantage (but not a must)

Contact: Daniel Kocher, Nikolaus Augsten

Algorithmik

Programmierung

Benchmarking

Innovation

[1] Kocher and Augsten. A Scalable Index for Top-k Subtree Similarity Queries. ACM SIGMOD. 2019.