Experimental comparison of upper bounds for the tree edit distance

(master thesis or bachelor project)

The tree edit distance measure is usually the first choice for comparing tree-structured data. It is defined as the minimum number of simple edit operations (delete, insert, and relabel) to transform one tree into another. However, due to its cubic runtime complexity it is disregarded in some applications. Efficient approximation techniques are required, especially for large input instances. Examples of large trees (having often millions of nodes) are merger trees of galaxies in astrophysics, or assets and bills of material in enterprise resource planning.

More efficient upper bound approximations of the tree edit distance have been proposed. They impose additional restrictions on when and where in the trees the edit operations are allowed. These solutions have quadratic runtime complexity and return a distance value which is larger or equal to the tree edit distance. Most upper bounds have been developed in the algorithmic community which results in no, or very simplified, empirical evaluation. This makes it hard to choose a suitable algorithm.

The main goal of this thesis is to implement the upper bound algorithms for the tree edit distance and perform an exhaustive experimental evaluation. The following tasks are foreseen:

Contact: Mateusz Pawlik, Nikolaus Augsten