## 2019

#### Abstract

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the- art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required. We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly. In an extensive empirical evaluation on synthetic and real- world data, our technique consistently outperforms the state- of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to five orders of magnitude.

#### Abstract

The tree similarity join computes all similar pairs in a collection of trees. Two trees are similar if their edit distance falls within a user-defined threshold. Previous algorithms, which are based on a filter-verify approach, suffer from the following two issues. First, ineffective filters produce a large number of candidates that must be further verified. Second, the candidates are verified by computing the tree edit distance, which is cubic in the number of tree nodes. Thus, these techniques fail to scale to large tree collections and are infeasible even for small collections when the trees are large.

In this paper, we present a scalable solution for the tree similarity join that is based on (1) an effective indexing technique that leverages both the labels and the structure of trees to reduce the number of candidates, (2) an efficient upper bound filter that moves many of the candidates directly to the join result without additional verification, (3) a linear time verification technique for the remaining candidates that avoids the expensive tree edit distance. We are the first to propose an efficient, non-trivial upper bound and linear time verification for the tree edit distance. Unlike previous solutions, our approach scales to collections with millions of trees. We improve the overall join time by up to two orders of magnitude w.r.t. the state of the art.

## 2018

#### Abstract

Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in anon-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.

#### Abstract

Despite many research efforts, similarity queries are still poorly supported by current systems. We analyze the main stream research in processing similarity queries and argue that a general-purpose query processor for similarity queries is required. We identify three goals for the evaluation of similarity queries (declarative, efficient, combinable) and identify the main research challenges that must be solved to achieve these goals.

## 2017

#### Abstract

The tree edit distance (TED), defined as the minimum-cost sequence of node operations that transform one tree into another, is a well-known distance measure for hierarchical data. Thanks to its intuitive definition, TED has found a wide range of diverse applications like software engineering, natural language processing, and bioinformatics. The state-of-the-art algorithms for TED recursively decompose the input trees into smaller subproblems and use dynamic programming to build the result in a bottom-up fashion. The main line of research deals with efficient implementations of a recursive solution introduced by Zhang in the late 1980s. Another more recent recursive solution by Chen found little attention. Its relation to the other TED solutions has never been studied and it has never been empirically tested against its competitors. In this paper we fill the gap and revisit Chen’s TED algorithm. We analyse the recursion by Chen and compare it to Zhang’s recursion. We show that all subproblems generated by Chen can also origin from Zhang’s decomposition. This is interesting since new algorithms that combine the features of both recursive solutions could be developed. Moreover, we revise the runtime complexity of Chen’s algorithm and develop a new traversal strategy to reduce its memory complexity. Finally, we provide the first experimental evaluation of Chen’s algorithm and identify tree shapes for which Chen’s solution is a promising competitor.

#### Abstract

In this paper an application is developed that functions similar to a recommender system and allows to find appropriate Open-StreetMap (OSM) tags by querying co-occurring keys and tags, as well as similar sets of tags in the database. A user may enter key(s) or key-value pair(s), even using wildcard substitution for both, in order to find keys or key-value pairs that are used in combination with the entered ones. Moreover, the top-k matching tag sets are also presented. The results are then top-k ranked, based on the frequency of the occurrence of each distinct set in the database. This information may enable a user to find the most comprehensive and best fitting tag set for an OSM element. This assumption is examined in an evaluation where the precision and recall metrics for both approaches are computed and compared. Our approach helps discovering combinations of tags and their usage frequency in contrast to common recommender systems that focus on classifying or clustering elements and finding the most accurate (single) class or cluster rather than sets of tags.

#### Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-k most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-k result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-k result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-k result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-k are invalidated and must be replaced. It is not enough to maintain the k most similar pairs, as less similar pairs may eventually be promoted to the top-k result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-k result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-k result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.

## 2016

#### Abstract

Given a query point in a spatial network, the reachability query retrieves all points of interest that are reachable from the query point in a specific amount of time respecting net- work constraints. Reachability queries have a number of interesting applications, and for road networks efficient solutions have been proposed. Road networks are time-independent, i.e., the cost for traversing an edge is constant over time. Efficient algorithms for road networks heavily rely on pre- computing shortest paths. In public transport networks, however, the edge costs depend on schedules, which renders most solutions for road networks inefficient. In particular, shortest paths between node pairs cannot be easily precom- puted because they change over time.

The goal of this work is to develop efficient solutions for reachability queries in public transport networks. The core idea is to partition the network into cells and compute time upper and lower bounds to traverse a cell. At query time, the reachable region is expanded cell by cell (instead of expanding edge by edge). All points of interest that are reachable using upper bound expansion are part of the result; all points that are not reachable in a lower bound expansion can safely be discarded; all other nodes are candidates and must be verified. This paper presents the expansion algorithm and and discusses interesting research directions regarding good network partitions, effective bounds, and efficient candidate verification.

#### Abstract

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

#### Abstract

Hierarchical data are often modelled as trees. An interesting query
identifies pairs of similar trees. The standard approach to tree
similarity is the tree edit distance, which has successfully been
applied in a wide range of applications. In terms of runtime, the
state-of-the-art algorithm for the tree edit distance is RTED, which
is guaranteed to be fast independent of the tree shape. Unfortunately,
this algorithm requires up to twice the memory of its competitors.
The memory is quadratic in the tree size and is a bottleneck for the
tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree
edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs
at least as fast as RTED without trading in memory efficiency. This is
achieved by releasing memory early during the first step of the
algorithm, which computes a decomposition strategy for the actual
distance computation. We show the correctness of our approach and prove
an upper bound for the memory usage. The strategy computed by AP-TED
is optimal in the class of all-path strategies, which subsumes the class
of LRH strategies used in RTED. We further present the AP-TED+ algorithm,
which requires less computational effort for very small subtrees and
improves the runtime of the distance computation. Our experimental
evaluation confirms the low memory requirements and the runtime
efficiency of our approach.

## 2015

#### Abstract

We consider the classical tree edit distance between ordered labelled
trees, which is defined as the minimum-cost sequence of node edit
operations that transform one tree into another. The state-of-the-art
solutions for the tree edit distance are not satisfactory. The main
competitors in the field either have optimal worst-case complexity but
the worst case happens frequently, or they are very efficient for some
tree shapes but degenerate for others. This leads to unpredictable and
often infeasible runtimes. There is no obvious way to choose between the
algorithms.

In this article we present RTED, a robust tree edit distance algorithm.
The asymptotic complexity of our algorithm is smaller than or equal to
the complexity of the best competitors for any input instance, that is,
our algorithm is both efficient and worst-case optimal. This is achieved
by computing a dynamic decomposition strategy that depends on the input
trees. RTED is shown optimal among all algorithms that use LRH
(left-right-heavy) strategies, which include RTED and the fastest tree
edit distance algorithms presented in literature. In our experiments on
synthetic and real-world data we empirically evaluate our solution and
compare it to the state-of-the-art.

## 2014

SIGMOD '14 Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014

#### Abstract

Token similarity joins represent data items as sets of tokens, for
example, strings are represented as sets of q-grams (sub-strings of
length q). Two items are considered similar and match if their token
sets have a large overlap. Previous work on similarity joins in
databases mainly focuses on expressing the overlap computation with
relational operators. The tokens are assumed to preexist in the
database, and the token generation cannot be expressed as part of the
query.

Our goal is to efficiently compute token similarity joins on-the-fly,
i.e., without any precomputed tokens or indexes. We define
*tokenize*, a new relational operator that generates tokens and
allows the similarity join to be fully integrated into relational
databases. This allows us to (1) optimize the token generation as part
of the query plan, (2) provide the query optimizer with cardinality
estimates for tokens, (3) choose efficient algorithms based on the query
context. We discuss algebraic properties, cardinality estimates, and
an efficient iterator algorithm for tokenize. We implemented our
operator in the kernel of PostgreSQL and empirically evaluated its
performance for similarity joins.

#### Abstract

Hierarchical data are often modelled as trees. An interesting query
identifies pairs of similar trees. The standard approach to tree
similarity is the tree edit distance, which has successfully been
applied in a wide range of applications. In terms of runtime, the
state-of-the-art for the tree edit distance is the RTED algorithm, which
is guaranteed to be fast independently of the tree shape. Unfortunately,
this algorithm uses twice the memory of the other, slower algorithms.
The memory is quadratic in the tree size and is a bottleneck for the
tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree
edit distance. Our algorithm runs at least as fast as RTED, but requires
only half the memory. This is achieved by systematically releasing
memory early during the first step of the algorithm, which computes a
decomposition strategy and is the main memory bottleneck. We show the
correctness of our approach and prove an upper bound for the memory
usage. Our empirical evaluation confirms the low memory requirements
and shows that in practice our algorithm performs better than the
analytic guarantees suggest.

#### Abstract

Set similarity joins compute all pairs of similar sets from two
collections of sets. Set similarity joins are typically implemented
in a filter-verify framework: a filter generates candidate pairs,
possibly including false positives, which must be verified to produce
the final join result. Good filters produce a small number of false
positives, while they reduce the time they spend on hopeless candidates.
The best known algorithms generate candidates using the so-called prefix
filter in conjunction with length- and position-based filters.

In this paper we show that the potential of length and position have
only partially been leveraged. We propose a new filter, the
*position-enhanced length filter*, which exploits the matching
position to incrementally tighten the length filter; our filter
identifies hopeless candidates and avoids processing them. The filter
is very efficient, requires no change in the data structures of most
prefix filter algorithms, and is particularly effective for foreign
joins, i.e., joins between two different collections of sets.

## 2013

#### Abstract

State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low.

#### Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.

#### Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

CIKM '13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013

#### Abstract

The problem of generating a cost-minimal edit script between two trees
has many important applications. However, finding such a cost-minimal
script is computationally hard, thus the only methods that scale are
approximate ones. Various approximate solutions have been proposed
recently. However, most of them still show quadratic or worse runtime
complexity in the tree size and thus do not scale well either. The only
solutions with log-linear runtime complexity use simple matching
algorithms that only find corresponding subtrees as long as these
subtrees are equal. Consequently, such solutions are not robust at all,
since small changes in the leaves which occur frequently can make all
subtrees that contain the changed leaves unequal and thus prevent the
matching of large portions of the trees. This problem could be avoided
by searching for similar instead of equal subtrees but current
similarity approaches are too costly and thus also show quadratic
complexity. Hence, currently no robust log-linear method exists.

We propose the random walks similarity (RWS) measure which can be used
to find similar subtrees rapidly. We use this measure to build the
RWS-Diff algorithm that is able to compute an approximately cost-minimal
edit script in log-linear time while having the robustness of a
similarity-based approach. Our evaluation reveals that random walk
similarity indeed increases edit script quality and robustness
drastically while still maintaining a runtime comparable to simple
matching approaches.

## 2012

#### Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

#### Abstract

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.

#### Abstract

MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.

#### Abstract

The popularity of MapReduce systems for processing large data sets in
both industry and science has increased drastically over the last years.
While sample applications often found in literature, for example, word
count, are rather simple, e-science applications tend to be complex,
thereby posing new challenges to MapReduce systems. The high runtime
complexity of e-science applications on the one hand, and skewed data
distributions often encountered in scientific data sets on the other
hand, lead to highly varying reducer execution times. These, in turn,
cause high overall execution times and poor resource utilisation.

In this paper, we tackle the challenge of balancing the workload on
the reducers, considering both complex reduce tasks and skewed data.
We define the partition cost model which takes into account non-linear
reducer tasks, and provide an algorithm for efficient cost estimation
in a distributed environment. Finally, we present two load balancing
approaches, fine partitioning and dynamic fragmentation, based on our
partition cost model. Both these approaches can be seamlessly integrated
into existing MapReduce systems like Hadoop. We evaluate our solutions
using both synthetic, and real e-science data.

## 2011

#### Abstract

We consider the classical tree edit distance between ordered labeled
trees, which is defined as the minimum-cost sequence of node edit
operations that transform one tree into another. The state-of-the-art
solutions for the tree edit distance are not satisfactory. The main
competitors in the field either have optimal worst-case complexity, but
the worst case happens frequently, or they are very efficient for some
tree shapes, but degenerate for others. This leads to unpredictable and
often infeasible runtimes. There is no obvious way to choose between
the algorithms.

In this paper we present RTED, a robust tree edit distance algorithm.
The asymptotic complexity of RTED is smaller or equal to the complexity
of the best competitors for any input instance, i.e., RTED is both
effcient and worst-case optimal. We introduce the class of LRH
(Left-Right-Heavy) algorithms, which includes RTED and the fastest tree
edit distance algorithms presented in literature. We prove that RTED
outperforms all previously proposed LRH algorithms in terms of runtime
complexity. In our experiments on synthetic and real world data we
empirically evaluate our solution and compare it to the
state-of-the-art.

#### Abstract

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

#### Abstract

In many applications, for example, in data integration scenarios,
strings must be matched if they are similar. String similarity joins,
which match all pairs of similar strings from two datasets, are of
particular interest and have recently received much attention in the
database research community. Most approaches, however, assume a global
similarity threshold; all string pairs that exceed the threshold form
a match in the join result. The global threshold approach has two major
problems: (a) the threshold depends on the (mostly unknown) data
distribution, (b) often there is no single threshold that is good for
all string pairs.

In this paper we propose the *PG-Join* algorithm, a novel string
similarity join that requires no configuration and uses an
*adaptive threshold*. PG-Join computes a so-called proximity
graph to derive an individual threshold for each string. Computing the
proximity graph efficiently is essential for the scalability of PG-Join.
To this end we develop a new and fast algorithm, PG-I, that computes
the proximity graph in two steps: First an efficient approximation is
computed, then the approximation error is fixed incrementally until
the adaptive threshold is stable. Our extensive experiments on
real-world and synthetic data show that PG-I is up to five times faster
than the state-of-the-art algorithm and suggest that PG-Join is a useful
and effective join paradigm.

#### Abstract

MapReduce systems have become popular for processing large data sets
and are increasingly being used in e-science applications. In contrast
to simple application scenarios like word count, e-science applications
involve complex computations which pose new challenges to MapReduce
systems. In particular, (a) the runtime complexity of the reducer task
is typically high, and (b) scientific data is often skewed. This leads
to highly varying execution times for the reducers. Varying execution
times result in low resource utilisation and high overall execution
time since the next MapReduce cycle can only start after all reducers
are done.

In this paper we address the problem of efficiently processing MapReduce
jobs with complex reducer tasks over skewed data. We define a new cost
model that takes into account non-linear reducer tasks and we provide
an algorithm to estimate the cost in a distributed environment. We
propose two load balancing approaches, fine partitioning and dynamic
fragmentation, that are based on our cost model and can deal with both
skewed data and complex reduce tasks. Fine partitioning produces a fixed
number of data partitions, dynamic fragmentation dynamically splits
large partitions into smaller portions and replicates data if necessary.
Our approaches can be seamlessly integrated into existing MapReduce
systems like Hadoop. We empirically evaluate our solution on both
synthetic data and real data from an e-science application.

#### Abstract

String data is omnipresent and appears in a wide range of applications.
Often string data must be partitioned into clusters of similar strings,
for example, for cleansing noisy data. A promising string clustering
approach is the recently proposed Graph Proximity Cleansing (GPC). A
distinguishing feature of GPC is that it automatically detects the
cluster borders without knowledge about the underlying data, using the
so-called proximity graph. Unfortunately, the computation of the
proximity graph is expensive. In particular, the runtime is high for
long strings, thus limiting the application of the state-of-the-art
GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that
efficiently compute the GPC cluster borders and scale to long strings.
PG-Skip follows a prefix pruning strategy and does not need to compute
the full proximity graph to detect the cluster border. PG-Skip is much
faster than the state-of-the-art algorithm, especially for long strings,
and computes the exact GPC borders. We show the optimality of PG-Skip
among all prefix pruning algorithms. PG-Binary is an efficient
approximation algorithm, which uses a binary search strategy to detect
the cluster border. Our extensive experiments on synthetic and
real-world data confirm the scalability of PG-Skip and show that
PG-Binary approximates the GPC clusters very effectively.

## 2010

#### Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that
automatically detects cluster borders and has been successfully used
for string cleansing. For each potential cluster a so-called proximity
graph is computed, and the cluster border is detected based on the
proximity graph. Unfortunately, the computation of the proximity graph
is expensive and the state-of-the-art GPC algorithms only approximate
the proximity graph using a sampling technique.

In this paper we propose two efficient algorithms for the exact
computation of proximity graphs. The first algorithm, PG-DS, is based
on a divide-skip technique for merging inverted lists, the second
algorithm, PG-SM, uses a sort-merge join strategy to compute the
proximity graph. While the state-of-the-art solutions only approximate
the correct proximity graph, our algorithms are exact. We experimentally
evaluate our solution on large real world datasets and show that our
algorithms are faster than the sampling-based approximation algorithms,
even for very small sample sizes.

#### Abstract

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

#### Abstract

String data is omnipresent and appears in a wide range of applications.
Often string data must be partitioned into clusters of similar strings,
for example, for cleansing noisy data. A promising string clustering
approach is the recently proposed Graph Proximity Cleansing (GPC). A
distinguishing feature of GPC is that it automatically detects the
cluster borders without knowledge about the underlying data, using the
so-called proximity graph. Unfortunately, the computation of the
proximity graph is expensive. In particular, the runtime is high for
long strings, thus limiting the application of the state-of-the-art
GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that
efficiently compute the GPC cluster borders and scale to long strings.
PG-Skip follows a prefix pruning strategy and does not need to compute
the full proximity graph to detect the cluster border. PG-Skip is much
faster than the state-of-the-art algorithm, especially for long strings,
and computes the exact GPC borders. We show the optimality of PG-Skip
among all prefix pruning algorithms. PG-Binary is an efficient
approximation algorithm, which uses a binary search strategy to detect
the cluster border. Our extensive experiments on synthetic and
real-world data confirm the scalability of PG-Skip and show that
PG-Binary approximates the GPC clusters very effectively.

## 2008

#### Abstract

In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.

#### Abstract

The goal of this thesis is to design, develop, and evaluate new methods
for the approximate matching of hierarchical data represented as labeled
trees. In approximate matching scenarios two items should be matched
if they are similar. Computing the similarity between labeled trees is
hard as in addition to the data values also the structure must be
considered. A well-known measure for comparing trees is the tree edit
distance. It is computationally expensive and leads to a prohibitively
high run time.

Our solution for the approximate matching of hierarchical data are
*pq-grams*. The *pq*-grams of a tree are all its
subtrees of a particular shape. Intuitively, two trees are similar if
they have many *pq*-grams in common. The *pq*-gram
distance is an efficient and effective approximation of the tree edit
distance. We analyze the properties of the *pq*-gram distance
and compare it with the tree edit distance and alternative
approximations. The *pq*-grams are stored in the
*pq*-gram index which is implemented as a relation and represents
a *pq*-gram by a fixed-length string. The *pq*-gram index
offers efficient approximate lookups and reduces the approximate
*pq*-gram join to an equality join on strings. We formally proof
that the *pq*-gram index can be incrementally updated based on
the log of edit operations without reconstructing intermediate tree
versions. The incremental update is independent of the data size and
scales to a large number of changes in the data. We introduce windowed
*pq*-grams for the approximate matching of unordered trees.
Windowed *pq*-grams are generated from sorted trees in linear
time. Sorting trees is not possible for common ordered tree distances
such as the edit distance.

We present the address connector, a new system for synchronizing
residential addresses that implements a *pq*-gram based distance
between streets, introduces a global greedy matching that guarantees
stable pairs, and links addresses that are stored with different
granularity. The connector has been successfully tested with public
administration databases. Our extensive experiments on both synthetic
and real world data confirm the analytic results and suggest that
*pq*-grams are both useful and scalable.

## 2006

#### Abstract

Several recent papers argue for approximate lookups in hierarchical
data and propose index structures that support approximate searches in
large sets of hierarchical data. These index structures must be updated
if the underlying data changes. Since the performance of a full index
reconstruction is prohibitive, the index must be updated incrementally.

We propose a *persistent* and *incrementally*
maintainable index for approximate lookups in hierarchical data. The
index is based on small tree patterns, called pq-grams. It supports
efficient updates in response to structure and value changes in
hierarchical data and is based on the log of tree edit operations. We
prove the correctness of the incremental maintenance for sequences of
edit operations. Our algorithms identify a small set of
*pq*-grams that must be updated to maintain the index. The
experimental results with synthetic and real data confirm the
scalability of our approach.

## 2005

#### Abstract

When integrating data from autonomous sources, exact matches of data
items that represent the same real world object often fail due to a lack
of common keys. Yet in many cases structural information is available
and can be used to match such data. As a running example we use
residential address information. Addresses are hierarchical structures
and are present in many databases. Often they are the best, if not only,
relationship between autonomous data sources. Typically the matching
has to be approximate since the representations in the sources differ.

We propose *pq*-grams to approximately match hierarchical
information from autonomous sources. We define the *pq*-gram
distance between ordered labeled trees as an effective and efficient
approximation of the well-known tree edit distance. We analyze the
properties of the *pq*-gram distance and compare it with the edit
distance and alternative approximations. Experiments with synthetic
and real world data confirm the analytic results and the scalability
of our approach.

## 2004

#### Abstract

The integration of data from distributed sources within and among different public authorities is an important aspect on the agenda of e-government. In this paper we present a specific data integration scenario at the Municipality of Bozen-Bolzano, where data from different sources have to be joined via addresses of citizens. We analyze in detail the problems which arise in this scenario. We show that addresses can be represented in a tree structure, and hence, the problem of matching addresses can be reduced to approximate tree matching.

## 2003

#### Abstract

Since a few years digital government is becoming an active research area with lots of promises to revolutionise government and its interaction with citizens and businesses. A crucial point for the success of e-government is the integration and sharing of services and information provided by different authorities. We argue that Web services are a promising technology to solve this problem.