## Top-k Subtree Similarity Queries for Unordered Trees

### (master thesis or bachelor project)

A *top-k subtree similarity query* finds the k subtrees in a large
document tree that are most similar to a given reference tree (the query
tree). The similarity between two trees is assessed using the so-called
*tree edit distance* (TED), which is the minimum number of node
edit operations that transform one tree into another. The tree edit
distance for ordered trees (OTED) can be computed in cubic time and
quadratic space in the number of tree nodes. A tree is *ordered* if
all siblings follow a specific order, otherwise the tree is considered
*unordered*.

The Database research group recently proposed an efficient solution for top-k subtree similarity queries for ordered labeled trees [1]. However, the problem remains unsolved for unordered trees. In general, computing the TED for unordered trees (UTED) is NP-complete. The purpose of this project is to develop an algorithm for finding a top-k ranking for unordered trees based on our solution for ordered trees [1]. We plan to leverage the fact that the OTED is an upper bound for the UTED and to avoid as many UTED computations as possible. The project also involves an extensive evaluation of the solution with respect to different characteristics (e.g., time and space requirements, effectiveness).

There are also other possibilities to extend on our recent work, for example:

- Improve our approach to exploit the structure of subtrees
- Design a parallel/distributed version of our algorithm
- Improve the underlying data structure to be more cache-friendly
- Adapt the algorithm to take a parameter that bounds the max. amount of memory to be consumed

Prerequisites and Interests:

- Basic knowledge in C++
- Basic knowledge in the design of algorithms and data structures
- Master's class "Similarity Search" is of advantage (but not a must)

**Contact:**
Daniel Kocher,
Nikolaus Augsten

[1] Kocher and Augsten. A Scalable Index for Top-k Subtree Similarity Queries. ACM SIGMOD. 2019.