The pq-Gram Distance between Ordered Labeled Trees

Repeatability package

Nikolaus Augsten, Michael Böhlen, Johann Gamper. ACM Transactions on Database Systems (TODS) , 35(1):1-36, 2010

Here you find the package to repeat the experiments from our TODS'10 article.

Quickstart

: Repeat our experiments in five simple steps.
  1. Download and unzip the repeatability package.
  2. Set host, database, username, and password for your database in the configuration file augsten-tods10/config.txt.
  3. Change to the directory of the experiment that you would like to repeat (augsten-tods10/exp/*).
  4. Execute the following commands in this order:
    ./load.sh   # load experimental data to database
    ./run.sh    # run the experiment
    ./plot.sh   # draw the figures
    Note: Some experiments do not have a load.sh or a plot.sh command.
  5. The result of each experiment is stored in the respective "log" directory, the figures are stored in the "eps" directory.
System requirements: The experiments run on Linux with Java 1.6. Some of the experiments require a MySQL database. The figures are drawn with gnuplot.

The package contains a readme file with further information.

Download: Repeatability Package [ZIP] [README]

Experimental Data

Here you can download our real world datasets.

Bolzano Address Trees:

The residential address data of the real world experiment in Section 9.3 is owned by the Municipality of Bolzano and was provided to the authors in the context of the eBZ Initialtive. By courtesy of the Municipality of Bolzano you may download the Bolzano Address Trees under the following conditions:

  1. You use the data for research purpose only.
  2. You explicitly acknowledge the Municipality of Bolzano.

The Bolzano Address Trees come in two text files (L.trees, R.trees) encoded with braces.
For example, 30:{cesare abba strasse{1}{2}{3{{1}{3}}}{11}} is the address tree with ID 30, its root node has the label "cesare abba strasse" and the children of the root are labeled 1, 2, 3, 11; 3 has a child with an empty string label, which in turn has two children with labels 1 and 3. The IDs of L are aligned to R by hand such that matching address trees have the same ID. All street names are lowercased.

Download: Bolzano Address Trees [ZIP]

XML:

In Section 9.4 we use three large XML databases, for example, the DBLP bibliography. You can download the snapshot that we used in our experiments and you find a link to the original dataset.

Dataset Snapshot Original Dataset
DBLP dblp.xml.zip [55M] http://dblp.uni-trier.de/xml
SwissProt sprot.xml.zip [14M] http://www.expasy.ch/sprot/
Treebank treebank.xml.zip [31M] http://www.cis.upenn.edu/~treebank

A short description for each of the data sets can be found at the UW XML Repository.

Source Code

The source code of our implementation comes with the jar files tods10.jar and approxlib_v1.0.jar that are included in the the repeatability package above.

  1. tods10.jar contains the executables that run the experiments. tods10.jar requires approxlib_v1.0.jar.
  2. approxlib_v1.0.jar is our approximate matching library that implements the pq-gram distance and the tree edit distance. More info about this library can be found here.

Extract the source code from the jar files with the following commands:

unzip -x tods10.jar *.java -d tods10
unzip -x approxlib_v1.0.jar *.java -d approxlib_v1.0