Experimental comparison of upper bounds for the tree edit distance
(master thesis or bachelor project)
The tree edit distance measure is usually the first choice for comparing tree-structured data. It is defined as the minimum number of simple edit operations (delete, insert, and relabel) to transform one tree into another. However, due to its cubic runtime complexity it is disregarded in some applications. Efficient approximation techniques are required, especially for large input instances. Examples of large trees (having often millions of nodes) are merger trees of galaxies in astrophysics, or assets and bills of material in enterprise resource planning.
More efficient upper bound approximations of the tree edit distance have been proposed. They impose additional restrictions on when and where in the trees the edit operations are allowed. These solutions have quadratic runtime complexity and return a distance value which is larger or equal to the tree edit distance. Most upper bounds have been developed in the algorithmic community which results in no, or very simplified, empirical evaluation. This makes it hard to choose a suitable algorithm.
The main goal of this thesis is to implement the upper bound algorithms for the tree edit distance and perform an exhaustive experimental evaluation. The following tasks are foreseen:
- study the related work concerning the upper bounds for the tree edit distance and deeply understand differences between the proposed methods,
- implement all algorithms in a common way which allows for objective testing,
- specify suitable measures and evaluate the algorithms against artificial and real-world datasets.
Contact: Mateusz Pawlik, Nikolaus Augsten