----------------------------------------------------------- README for augsten-tods10.zip published on: http://www.inf.unibz.it/~augsten/publ/tods10/ ----------------------------------------------------------- ========= OVERVIEW: ========= This ZIP contains the data and the code that allow you to repeat the experiments of the following paper: The pq-Gram Distance between Ordered Labeled Trees N. Augsten, M. Böhlen, and J. Gamper. ACM Transactions on Database Systems (TODS), 2009. =========== QUICKSTART: =========== 1. Unzip augsten-tods10.zip 2. Set host, database, username, and password for your database in the configuration file "augsten-tods10/config.txt". 3. Change to the directory of the experiment that you would like to repeat (augsten-tods10/exp/*, for a list of experiments see below). 4. Execute the following commands in this order: ./clean.sh ./load.sh ./run.sh ./plot.sh Note: Some experiments do not have a load.sh or a plot.sh command. 5. The result of each experiment is stored in the respective "log" directory, the figures are stored in the "eps" directory. List of Experiments: - 9.1 Scalability augsten-tods10/exp/scalability - 9.2 Sensitivity to Structure Change augsten-tods10/exp/structure - 9.3 Real World: Street Matching augsten-tods10/exp/streetmatching - 9.4 Real World: Matching XML Data augsten-tods10/exp/xmlmatching ==================== SYSTEM REQUIREMENTS: ==================== You need Linux to run the shell scripts (e.g., ./run.sh). The Java code is written for Sun Java 1.6 and the relational database MySQL 5.0 (http://dev.mysql.com). We use gnuplot to draw the figures. We access MySQL with the JDBC driver v3.0.11 (included) and use Xerces 2.9.1 to parse XML files (included). If you are an Ubuntu or Debian user, execute the following shell command to install all required software: sudo apt-get install sun-java6-jdk mysql-server gnuplot ============ SOURCE CODE: ============ Source Code The source code of our implementation comes with the jar files tods10.jar and approxlib_v1.0.jar that are included in this ZIP file. - tods10.jar contains the executables that run the experiments. tods10.jar requires approxlib_v1.0.jar. - approxlib_v1.0.jar is our approximate matching library that implements the pq-gram distance and the tree edit distance. More info about this library can be found at http://www.inf.unibz.it/~augsten/src. Extract the source code from the jar files with the following commands: unzip -x tods10.jar *.java -d tods10 unzip -x approxlib_v1.0.jar *.java -d approxlib_v1.0 ========= CONTENTS: ========= README.TXT This file. config.txt Here you configure host, database, user, and password of your MySQL database. The syntax is host= db= user= pwd= lib/ All jar files required to run the experiments. For tods10.jar and approxlib_v1.0.jar see Section "Source Code". exp/* Directories for the individual experimetns. They all have a similar structure: data/ experimental data log/ log files (experimental results) eps/ eps figures clean.sh remove log files and figures load.sh load the experimental data from files to the database run.sh execute the experiment (results written to log files) plot.sh use gnuplot to draw eps figures from log files Note: config.txt and lib/ are symbolic links. ========================= RESIDENTIAL ADDRESS DATA: ========================= (section included from exp/streetmatching/data/README) The residential address data (Bolzano Address Trees) of the real world experiment in Section 9.3 (streematching) is owned by the Municipality of Bolzano and was provided to the authors in the context of the eBZ Initiative. By courtesy of the Municipality of Bolzano you may download the Bolzano Address Trees under the following conditions: 1. You use the data for research purpose only. 2. You explicitly acknowledge the Municipality of Bolzano. The Bolzano Address Trees come in two text files (L.trees, R.trees) encoded with braces. For example, 30:{cesare abba strasse{1}{2}{3{{1}{3}}}{11}} is the address tree with ID 30, its root node has the label "cesare abba strasse" and the children of the root are labeled 1, 2, 3, 11; 3 has a child with an empty string label, which in turn has two children with labels 1 and 3. The IDs of L.trees are aligned to R.trees by hand such that matching address trees have the same ID. All street names are lowercased. ================== DOWNLOAD XML DATA: ================== (section included from exp/xmlmatching/data/README.TXT) For the experiment "xmlmatching" in Section 9.4 we use large XML files. This ZIP includes only templates for the large files. The experiments can be executed with the templates, but the result is of course different. You can download the exact version of the files that we use in the paper from: http://www.inf.unibz.it/~augsten/publ/tods10/ Place the XML files into the directory augsten-tods10/xmlmatching/data replacing the symbolic links dblp.xml, sprot.xml, and treebank.xml, respectively.