Visualization of edit mappings for large trees

(master thesis or bachelor project)

In the world of rapid data growth, similarity search has emerged. Interesting questions are for example: Do we have in our dataset two different data items which represent the same real world entity? Which data items from two different datasets represent the same real world object? Are two data items similar and, if yes, how?

The similarity search is of interest in various domains. In the context of the thesis we focus on the similarity of tree-structured data. Trees are used to represent data spanning from RNA structures and carbohydrates to websites and sentence syntax. One way to show the similarity of trees is the tree edit mapping. Once computed, it specifies how to transform one tree into another by the means of simple edit operations on nodes. Excessive nodes are deleted, missing nodes are inserted and similar nodes are mapped together.

There are many algorithms to compute the tree edit mapping. One of them is the tree edit distance which computes the minimal mapping in the number of required edit operations. Unfortunately, many of these algorithms result only in raw (not visual) data. There is little research of how to represent the mappings and how to visualize them. In many domains, like biology, it would be interesting to see the actual results.

As for small trees the mappings can be easily shown, the larger the trees the more difficult it becomes. Another interesting problem is to decide on which edit mapping to show. One tree can be transformed to another in multiple ways.

The goal of the thesis is to exploit the above problems and develop a visualization tool which scales for large inputs. In particular the following steps are foreseen:

Contact: Mateusz Pawlik, Nikolaus Augsten