##
The *pq*-Gram Distance between Ordered Labeled Trees

*Nikolaus Augsten, Michael Böhlen, Johann Gamper* - TODS'10

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching has to be approximate since the representations in the sources differ.

We propose *pq*-grams to approximately match hierarchical data from
autonomous sources and define the *pq*-gram distance between
ordered labeled trees as an effective and efficient approximation of the
fanout weighted tree edit distance. We prove that the *pq*-gram
distance is a lower bound of the fanout weighted tree edit distance and
give a normalization of the *pq*-gram distance for which the
triangle inequality holds. Experiments on synthetic and real world data
(residential addresses and XML) confirm the scalability of our approach
and show the effectiveness of *pq*-grams.