The pq-Gram Distance between Ordered Labeled Trees
Repeatability package
Nikolaus Augsten, Michael Böhlen, Johann Gamper. ACM Transactions on Database Systems (TODS) , 35(1):1-36, 2010
Here you find the package to repeat the experiments from our TODS'10 article.
Quickstart
: Repeat our experiments in five simple steps.- Download and unzip the repeatability package.
-
Set host, database, username, and password for your database in the
configuration file
augsten-tods10/config.txt
. - Change to the directory of the experiment that you would like to repeat (augsten-tods10/exp/*).
-
Execute the following commands in this order:
Note: Some experiments do not have a load.sh or a plot.sh command../load.sh # load experimental data to database ./run.sh # run the experiment ./plot.sh # draw the figures
- The result of each experiment is stored in the respective "log" directory, the figures are stored in the "eps" directory.
The package contains a readme file with further information.
Download: Repeatability Package [ZIP] [README]
Experimental Data
Here you can download our real world datasets.
Bolzano Address Trees:
The residential address data of the real world experiment in Section 9.3 is owned by the Municipality of Bolzano and was provided to the authors in the context of the eBZ Initialtive. By courtesy of the Municipality of Bolzano you may download the Bolzano Address Trees under the following conditions:
- You use the data for research purpose only.
- You explicitly acknowledge the Municipality of Bolzano.
The Bolzano Address Trees come in two text files (L.trees
,
R.trees
) encoded with braces.
For example, 30:{cesare abba strasse{1}{2}{3{{1}{3}}}{11}}
is
the address tree with ID 30, its root node has the label
"cesare abba strasse"
and the children of the root
are labeled 1, 2, 3, 11
; 3
has a child with an
empty string label, which in turn has two children with labels
1
and 3
. The IDs of L
are aligned to
R
by hand such that matching address trees have the same ID.
All street names are lowercased.
Download: Bolzano Address Trees [ZIP]
XML:
In Section 9.4 we use three large XML databases, for example, the DBLP bibliography. You can download the snapshot that we used in our experiments and you find a link to the original dataset.
Dataset | Snapshot | Original Dataset |
---|---|---|
DBLP | dblp.xml.zip [55M] | http://dblp.uni-trier.de/xml |
SwissProt | sprot.xml.zip [14M] | http://www.expasy.ch/sprot/ |
Treebank | treebank.xml.zip [31M] | Not available. |
A short description for each of the data sets can be found at the UW XML Repository.
Source Code
The source code of our implementation comes with the jar files
tods10.jar
and approxlib_v1.0.jar
that are
included in the the repeatability package above.
-
tods10.jar
contains the executables that run the experiments.tods10.jar
requiresapproxlib_v1.0.jar
. -
approxlib_v1.0.jar
is our approximate matching library that implements the pq-gram distance and the tree edit distance. More info about this library can be found here.
Extract the source code from the jar files with the following commands:
unzip -x tods10.jar *.java -d tods10
unzip -x approxlib_v1.0.jar *.java -d approxlib_v1.0