Evaluating non-standard algorithms for the tree edit distance

Trees are very common structures to represent data of hierarchical nature. Think of, for example, genealogical trees. Each data item is a node. Nodes have parents, children, and siblings. Such tree structures are used to represent a variety of data: RNA structures, neuronal cells, carbohydrates, sentence structures, websites, and many others. A very interesting problem is to find similarities between tree structures. A standard measure to do that is the tree edit distance.

The tree edit distance says how to modify one tree into another using very basic operations: delete a node, insert a node and rename the label of a node. The more operations have to be performed, the more dissimilar the trees are.

There are many algorithms to compute the tree edit distance and they vary in many aspects. Some of them implement a very old (1979) recursive formula. These algorithms are well studied and it is possible to judge which of them performs best under what circumstances. However, there are algorithms that deviate from the standard approach. They come from the algorithmic domain and use various tricks to improve the performance (either on the runtime or the memory consumption). Unfortunately, some of these algorithms have not yet been implemented and compared to the standard approaches.

The goal of the thesis is to evaluate how good are the non-standard algorithms compared to the classical approach. This involves several steps:

getting a deep understanding of the algorithms while implementing them based on the original research papers (usually implementation details are not given and have to be discovered),
setting up an evaluation environment choosing appropriate datasets and test cases,
performing the experiments and explaining the results (which is very often the most interesting part).

Contact: Mateusz Pawlik, Nikolaus Augsten

Algorithmik

Programmierung

Benchmarking

Innovation