## 2023

#### Abstract

The w-event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of 𝑤-event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism, and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent. To this end, we formalize an empirical study of 𝑤-event mechanisms by re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected – yet explainable – result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers.

#### Abstract

Given two collections of sets, the set similarity join reports all pairs of sets that are within a given distance threshold. State-of-the-art solutions employ an inverted list index and several heuristics to compute the join result efficiently. Prefix-based solutions benefit from infrequent set elements, known as tokens, and spend considerable time scanning long lists if the token frequency is not sufficiently skewed. Partition-based methods are less sensitive to the token distribution but suffer from a significantly larger memory footprint, limiting their applicability as the threshold or the set sizes grow. Solutions from the domain of metric-based similarity search are designed to reduce the overall number of distance computations. Generic metric techniques cannot compete with state-of-the-art similarity joins tailored to sets, which in turn do not exploit metric filter opportunities. We propose MetricJoin, the first exact set similarity join technique that leverages the metric properties of set distance functions. In contrast to its competitors, MetricJoin is robust, i.e., datasets with different characteristics can be joined efficiently in terms of runtime and memory. Our algorithm embeds sets in vector space, organizes long inverted lists in spatial indexes, and employs an effective metric filter to prune unqualified sets. MetricJoin requires only linear space in the collection size and substantially reduces the number of sets that must be considered. In our performance studies, MetricJoin outperforms state-of-the-art solutions by up to an order of magnitude in runtime and generates up to five orders of magnitude fewer candidates.

#### Abstract

Density-based clustering is a popular concept to find groups of similar objects (i.e., clusters) in a dataset. It is applied in various domains, e.g., process mining and anomaly detection and comes with two user parameters (𝜀, 𝑀𝑖𝑛𝑃𝑡𝑠) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. This requires efficient algorithms, which are currently only available for specific data models, e.g., vector space data. We identify the following limitations of data model-independent approaches: (a) Expensive neighborhood computations are ineffectively pruned. (b) Existing indexes only return approximations, where objects are falsely labeled noise. (c) Existing indexes are inflexible as they restrict users to specify density only via 𝜀 whereas 𝑀𝑖𝑛𝑃𝑡𝑠 is constant, which limits explorable clusterings. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexity of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient, flexible, and data model-independent. Moreover, FINEX respects the orginal and straightforward notion of density-based clustering. In our experiments with 8 large real-world datasets from various domains, FINEX shows runtime improvements of at least one order of magnitude over the state-of-the-art technique for exact clustering.

#### Abstract

Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

#### Abstract

We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defined similarity function, e.g., cosine similarity between embeddings. Common techniques like token indexes fail for semantic search since similar elements may be unrelated at the character level. Further, verifying candidates is expensive (cubic versus linear for syntactic overlap), calling for highly selective filters. We propose KOIOS, the first exact and efficient algorithm for semantic overlap search. KOIOS leverages sophisticated filters to minimize the number of required graph-matching calculations. Our experiments show that for medium to large sets less than 5% of the candidate sets need verification, and more than half of those sets are further pruned without requiring the expensive graph matching. We show the efficiency of our algorithm on four real datasets and demonstrate the improved result quality of semantic over vanilla set similarity search.

#### Abstract

Data Science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like “freedom” or “democracy”, both for today’s society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical workloads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure the various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

## 2022

#### Abstract

Today, continuous publishing of differentially private query results is the de-facto standard. However, even today's most advanced privacy frameworks for streams are not customizable enough to consider that privacy goals of humans change as quickly as human life. We name this time-dependent relevance of privacy goals. Instead, upon design time, one needs to estimate the worst case. Then, one hopes that this protection is sufficient and accepts that one protects against this case all the time, even if it is currently not relevant. Designing a privacy framework being aware of time-dependent relevance implies two effects, which - properly exploited { allow to tune data utility beyond incremental design of a novel privacy mechanism for an existing framework. In this paper, we propose such a new framework, named Swellfish Privacy. We also introduce two tools for designing Swellfish-private mechanisms, namely, time-variant sensitivity and a composition theorem, each implying one effect a mechanism can exploit for improving data utility. In a realistic case study, we show that exploiting both effects improves data utility by one to three orders of magnitude compared to state-of-the-art w-event DP mechanisms. Finally, we generalize the case study by showing how to estimate the strength of the effects for arbitrary use cases.

#### Abstract

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data.
In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all JSON documents that are within τ from the query document. Due to its recursive definition, JSON data is naturally represented as trees. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document. This feature poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON documents. We develop an algorithm, called QuickJEDI, for computing JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and a highly effective upper bound based on tree sorting. Our algorithm for the upper bound runs in O(n τ) time and O(n + τ log n) space, which substantially improves the previous best bound of O(n^{2}) time and space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and can handle large JSON trees with tens of thousands of nodes.

#### Abstract

Many RDF stores treat graphs as simple sets of vertices and edges without a conceptual schema. The statistics in the schema-less RDF stores are based on the cardinality of the keys representing the constants in triple patterns. In this paper, we explore the effects of storing knowledge graphs in an RDF store on the structure of the space of queries and, consequently, on the definition of the framework for the computation of the statistics. We propose a formal framework for an RDF store with a complete conceptual schema. The poset of schema triples defines the structure of the types of triple patterns and, therefore, the structure of the query space. The set of schema triples, together with the ontology of classes and predicates, form the conceptual schema of a knowledge graph, referred to as a schema graph. We present an algorithm for computing the statistics of a schema graph that consists of the schema triples from the stored schema graph and the schema triples that are more general/specific than the stored schema triples up to a user-defined level.

#### Abstract

Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution.

#### Abstract

#### Abstract

Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each method, we consider a plethora of parameter configurations, optimizing it with respect to recall and precision. For each dataset, we consider both schema-agnostic and schema-based settings. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, demonstrating the superiority of blocking workflows and string similarity joins.

## 2021

#### Abstract

Finding similar trajectories is an important task in moving object databases. However, classical similarity models face several limitations, including scalability and robustness. Recently, an approach named t2vec proposed transforming trajectories into points in a high dimensional vector space, and this transformation approximately keeps distances between trajectories. t2vec overcomes that scalability limitation: Now it is possible to cluster millions of trajectories. However, the semantics of the learned similarity values – and whether they are meaningful – is an open issue. One can ask: How does the configuration of t2vec affect the similarity values of trajectories? Is the notion of similarity in t2vec similar, different, or even superior to existing models? As for any neural-network-based approach, inspecting the network does not help to answer these questions. So the problem we address in this paper is how to assess the meaningfulness of similarity in deep trajectory representations. Our solution is a methodology based on a set of well-defined, systematic experiments. We compare t2vec to classical models in terms of robustness and their semantics of similarity, using two real-world datasets. We give recommendations which model to use in possible application scenarios and use cases. We conclude that using t2vec in combination with classical models may be the best way to identify similar trajectories. Finally, to foster scientific advancement, we give the public access to all trained t2vec models and experiment scripts. To our knowledge, this is the biggest collection of its kind.

#### Abstract

Efficient knn computation for high-dimensional data is an important,
yet challenging task. Today, most information systems use a
column-store back-end for relational data. For such systems,
multi-dimensional indexes accelerating selections are known. However,
they cannot be used to accelerate knn queries. Consequently, one
relies on sequential scans, specialized knn indexes, or trades result
quality for speed. To avoid storing one specialized index per query
type, we envision multipurpose indexes allowing to efficiently
compute multiple query types. In this paper, we focus on additionally
supporting knn queries as first step towards this goal. To this end,
we study how to exploit total orders for accelerating knn queries
based on the *sub-space distance equalities observation*. It
means that non-equal points in the full space, which are projected to
the same point in a sub space, have the same distance to every other
point in this sub space. In case one can easily find these equalities
and tune storage structures towards them, this offers two effects one
can exploit to accelerate knn queries. The first effect allows
pruning of point groups based on a cascade of lower bounds. The
second allows to re-use previously computed sub-space distances
between point groups. This results in a worst-case execution bound,
which is independent of the distance function. We present knn
algorithms exploiting both effects and show how to tune a storage
structure already known to work well for multi-dimensional
selections. Our investigations reveal that the effects are robust to
increasing, e.g., the dimensionality, suggesting generally good knn
performance. Comparing our knn algorithms to well-known competitors
reveals large performance improvements up to one order of magnitude.
Furthermore, the algorithms deliver at least comparable performance
as the next fastest competitor suggesting that the algorithms are
only marginally affected by the curse of dimensionality.

#### Abstract

Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method “takes away” more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character’s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%.

#### Abstract

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e. , aggregates) are usually small reducing data movement costs whose compensation is a major challenge for heterogeneous computing. Generally for group by computation on GPUs, one relies either on sorting or hashing. Today, empirical results suggest that hash-based approaches are superior. However by concept, hashing induces an unpredictable memory access pattern being in conflict with the architecture of GPUs. This motivates studying why current sort-based approaches are generally inferior. Our results indicate that current sorting solutions cannot exploit the full parallel power of modern GPUs. Experimentally, we show that the issue arises from the need to synchronize parallel threads that access the shared memory location containing the aggregates via atomics. Our quantification of the optimal performance motivates us to investigate how to minimize the overhead of atomics. This results in different variants using atomics, where the best variants almost mitigate the atomics overhead entirely. The results of a large-scale evaluation reveal that our approach achieves a 3x speed-up over existing sort-based approaches and up to 2x speed-up over hash-based approaches.

#### Abstract

Many people take the same path every day, such as taking a specific autobahn to get home from work. However, one needs to frequently divert from this path, e.g., to visit a Point of Interest (POI) from a category like the category of restaurants or ATMs. Usually, people want to minimize not only their overall travel cost but also their detour cost, i.e., one wants to return to the known path as fast as possible. Finding such a POI minimizing both costs efficiently is highly challenging in case one considers time-dependent road networks which are the case in real-world scenarios. For such road networks time decency means the time a user needs to traverse a road, heavily depends on the user's arrival time on that road. Prior works have several limitations, such as assuming that travel costs are coming from a metric space and do not change over time. Both assumptions hardly match real-world requirements: Just think of traffic jams at the rush hour. To overcome these limitations, we study how to solve this problem considering time-dependent road networks relying on linear skylines. Our main contribution is an efficient algorithm called STACY to find all non-dominated paths. A large-scale empirical evaluation on real-world data reveals that STACY is accurate, efficient and effective in real-world settings.

#### Abstract

We study techniques for clustering large collections of sets into \mbox{DBSCAN} clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements. We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm that can leverage asymmetric indexes in linear space. Our empirical evaluation suggests that our algorithm combines the best of two worlds: it achieves the runtime performance of materialization-based approaches while retaining the memory efficiency of non-materializing techniques.

## 2020

#### Abstract

Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

#### Abstract

Language is constantly evolving. As part of diachronic linguistics,
semantic change analysis examines how the meanings of words evolve
over time. Such semantic awareness is important to retrieve content
from digital libraries. Recent research on semantic change analysis
relying on word embeddings has yielded significant improvements over
previous work. However, a recent, but somewhat neglected observation
so far is that the rate of semantic shift negatively correlates with
word-usage frequency. In this article, we therefore propose SCAF,
*Semantic Change Analysis with Frequency*. It abstracts from the
concrete embeddings and includes word frequencies as an orthogonal
feature. SCAF allows using different combinations of embedding type,
optimization algorithm and alignment method. Additionally, we
leverage existing approaches for time series analysis, by using
change detection methods to identify semantic shifts. In an
evaluation with a realistic setup, SCAF achieves better detection
rates than prior approaches, 95% instead of 51%. On the Google Books
Ngram data set, our approach detects both known and yet unknown
shifts for popular words.

#### Abstract

In the sciences and elsewhere, the use of relational databases has become ubiquitous. An important challenge is finding hot spots of user interests. In principle, one can discover user interests by clustering the queries in the query log. Such a clustering requires a notion of query similarity. This, in turn, raises the question of what features of SQL queries are meaningful. We have studied the query representations proposed in the literature and corresponding similarity functions and have identified shortcomings of all of them. To overcome these limitations, we propose new similarity functions for SQL queries. They rely on the so-called access area of a query and, more specifically, on the overlap and the closeness of the access areas. We have carried out experiments systematically to compare the various similarity functions described in this article. The first series of experiments measures the quality of clustering and compares it to a ground truth. In the second series, we focus on the query log from the well-known SkyServer database. Here, a domain expert has interpreted various clusters by hand. We conclude that clusters obtained with our new measures of similarity seem to be good indicators of user interests.

#### Abstract

Reproducibility and generalizability are important criteria for today’s data management society. Hence, stand-alone solutions that work well in isolation, but cannot convince at system level lead to a frustrating user experience. As a consequence, in our demo, we take the step of accelerating queries on scientific data by integrating the multi-dimensional index structure Elf into the main-memory-optimized database management system MonetDB. The overall intention is to show that the stand-alone speed ups of using Elf can also be observed when integrated into a holistic system storing scientific data sets. In our prototypical implementation, we demonstrate the performance of an Elf-backed MonetDB on the standard OLAP-benchmark, TPC-H, and the genomic multi-dimensional range query benchmark from the scientific data community. Queries can be run live on both benchmarks by the audience, while they are able to create different indexes to accelerate selection performance.

#### Abstract

Analytical queries often require a mixture of relational and linear algebra operations applied to the same data. This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementation in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performs well for mixed workloads.

#### Abstract

There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS.

#### Abstract

Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.

#### Abstract

Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution.

#### Abstract

Background

Molecular characters have been added in integrative taxonomic approaches in recent years. Nevertheless, taxon diagnoses are still widely restricted to morphological characters. The inclusion of molecular characters into taxon diagnoses is not only hampered by problems, such as their definition and the designation of their positions in a reference alignment, but also by the technical effort.

Results

DeSignate is a tool for character-based taxon diagnoses that includes a novel ranking scheme. It detects and classifies individual and combined signature characters (diagnostic molecular characters) based on so-called character state vectors. An intuitive web application guides the user through the analysis process and provides the results at a glance. Further, formal definitions and a uniform terminology of characters are introduced.

Conclusions

DeSignate facilitates the inclusion of diagnostic molecular characters and their positions to complement taxon diagnoses. Compared to previous solutions, the tool simplifies the workflow and improves reproducibility and traceability of the results. The tool is freely available as a web application at (https://designate.dbresearch.uni-salzburg.at/) and is open source (https://github.com/DatabaseGroup/DeSignate/).

## 2019

#### Abstract

Although many works in the database community use open data in their experimental evaluation, repeating the empirical results of previous works remains a challenge. This holds true even if the source code or binaries of the tested algorithms are available. In this paper, we argue that providing access to the raw, original datasets is not enough. Real-world datasets are rarely processed without modification. Instead, the data is adapted to the needs of the experimental evaluation in the data preparation process. We showcase that the details of the data preparation process matter and subtle differences during data conversion can have a large impact on the outcome of runtime results. We introduce a data reproducibility model, identify three levels of data reproducibility, report about our own experience, and exemplify our best practices.

#### Abstract

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the- art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required. We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly. In an extensive empirical evaluation on synthetic and real- world data, our technique consistently outperforms the state- of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to five orders of magnitude.

#### Abstract

The tree similarity join computes all similar pairs in a collection of trees. Two trees are similar if their edit distance falls within a user-defined threshold. Previous algorithms, which are based on a filter-verify approach, suffer from the following two issues. First, ineffective filters produce a large number of candidates that must be further verified. Second, the candidates are verified by computing the tree edit distance, which is cubic in the number of tree nodes. Thus, these techniques fail to scale to large tree collections and are infeasible even for small collections when the trees are large.

In this paper, we present a scalable solution for the tree similarity join that is based on (1) an effective indexing technique that leverages both the labels and the structure of trees to reduce the number of candidates, (2) an efficient upper bound filter that moves many of the candidates directly to the join result without additional verification, (3) a linear time verification technique for the remaining candidates that avoids the expensive tree edit distance. We are the first to propose an efficient, non-trivial upper bound and linear time verification for the tree edit distance. Unlike previous solutions, our approach scales to collections with millions of trees. We improve the overall join time by up to two orders of magnitude w.r.t. the state of the art.

## 2018

#### Abstract

Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in anon-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.

#### Abstract

Despite many research efforts, similarity queries are still poorly supported by current systems. We analyze the main stream research in processing similarity queries and argue that a general-purpose query processor for similarity queries is required. We identify three goals for the evaluation of similarity queries (declarative, efficient, combinable) and identify the main research challenges that must be solved to achieve these goals.

## 2017

#### Abstract

The tree edit distance (TED), defined as the minimum-cost sequence of node operations that transform one tree into another, is a well-known distance measure for hierarchical data. Thanks to its intuitive definition, TED has found a wide range of diverse applications like software engineering, natural language processing, and bioinformatics. The state-of-the-art algorithms for TED recursively decompose the input trees into smaller subproblems and use dynamic programming to build the result in a bottom-up fashion. The main line of research deals with efficient implementations of a recursive solution introduced by Zhang in the late 1980s. Another more recent recursive solution by Chen found little attention. Its relation to the other TED solutions has never been studied and it has never been empirically tested against its competitors. In this paper we fill the gap and revisit Chen’s TED algorithm. We analyse the recursion by Chen and compare it to Zhang’s recursion. We show that all subproblems generated by Chen can also origin from Zhang’s decomposition. This is interesting since new algorithms that combine the features of both recursive solutions could be developed. Moreover, we revise the runtime complexity of Chen’s algorithm and develop a new traversal strategy to reduce its memory complexity. Finally, we provide the first experimental evaluation of Chen’s algorithm and identify tree shapes for which Chen’s solution is a promising competitor.

#### Abstract

In this paper an application is developed that functions similar to a recommender system and allows to find appropriate Open-StreetMap (OSM) tags by querying co-occurring keys and tags, as well as similar sets of tags in the database. A user may enter key(s) or key-value pair(s), even using wildcard substitution for both, in order to find keys or key-value pairs that are used in combination with the entered ones. Moreover, the top-k matching tag sets are also presented. The results are then top-k ranked, based on the frequency of the occurrence of each distinct set in the database. This information may enable a user to find the most comprehensive and best fitting tag set for an OSM element. This assumption is examined in an evaluation where the precision and recall metrics for both approaches are computed and compared. Our approach helps discovering combinations of tags and their usage frequency in contrast to common recommender systems that focus on classifying or clustering elements and finding the most accurate (single) class or cluster rather than sets of tags.

#### Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-k most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-k result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-k result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-k result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-k are invalidated and must be replaced. It is not enough to maintain the k most similar pairs, as less similar pairs may eventually be promoted to the top-k result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-k result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-k result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.

## 2016

#### Abstract

Given a query point in a spatial network, the reachability query retrieves all points of interest that are reachable from the query point in a specific amount of time respecting net- work constraints. Reachability queries have a number of interesting applications, and for road networks efficient solutions have been proposed. Road networks are time-independent, i.e., the cost for traversing an edge is constant over time. Efficient algorithms for road networks heavily rely on pre- computing shortest paths. In public transport networks, however, the edge costs depend on schedules, which renders most solutions for road networks inefficient. In particular, shortest paths between node pairs cannot be easily precom- puted because they change over time.

The goal of this work is to develop efficient solutions for reachability queries in public transport networks. The core idea is to partition the network into cells and compute time upper and lower bounds to traverse a cell. At query time, the reachable region is expanded cell by cell (instead of expanding edge by edge). All points of interest that are reachable using upper bound expansion are part of the result; all points that are not reachable in a lower bound expansion can safely be discarded; all other nodes are candidates and must be verified. This paper presents the expansion algorithm and and discusses interesting research directions regarding good network partitions, effective bounds, and efficient candidate verification.

#### Abstract

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

#### Abstract

Hierarchical data are often modelled as trees. An interesting query
identifies pairs of similar trees. The standard approach to tree
similarity is the tree edit distance, which has successfully been
applied in a wide range of applications. In terms of runtime, the
state-of-the-art algorithm for the tree edit distance is RTED, which
is guaranteed to be fast independent of the tree shape. Unfortunately,
this algorithm requires up to twice the memory of its competitors.
The memory is quadratic in the tree size and is a bottleneck for the
tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree
edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs
at least as fast as RTED without trading in memory efficiency. This is
achieved by releasing memory early during the first step of the
algorithm, which computes a decomposition strategy for the actual
distance computation. We show the correctness of our approach and prove
an upper bound for the memory usage. The strategy computed by AP-TED
is optimal in the class of all-path strategies, which subsumes the class
of LRH strategies used in RTED. We further present the AP-TED+ algorithm,
which requires less computational effort for very small subtrees and
improves the runtime of the distance computation. Our experimental
evaluation confirms the low memory requirements and the runtime
efficiency of our approach.

## 2015

#### Abstract

We consider the classical tree edit distance between ordered labelled
trees, which is defined as the minimum-cost sequence of node edit
operations that transform one tree into another. The state-of-the-art
solutions for the tree edit distance are not satisfactory. The main
competitors in the field either have optimal worst-case complexity but
the worst case happens frequently, or they are very efficient for some
tree shapes but degenerate for others. This leads to unpredictable and
often infeasible runtimes. There is no obvious way to choose between the
algorithms.

In this article we present RTED, a robust tree edit distance algorithm.
The asymptotic complexity of our algorithm is smaller than or equal to
the complexity of the best competitors for any input instance, that is,
our algorithm is both efficient and worst-case optimal. This is achieved
by computing a dynamic decomposition strategy that depends on the input
trees. RTED is shown optimal among all algorithms that use LRH
(left-right-heavy) strategies, which include RTED and the fastest tree
edit distance algorithms presented in literature. In our experiments on
synthetic and real-world data we empirically evaluate our solution and
compare it to the state-of-the-art.

## 2014

SIGMOD '14 Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014

#### Abstract

Token similarity joins represent data items as sets of tokens, for
example, strings are represented as sets of q-grams (sub-strings of
length q). Two items are considered similar and match if their token
sets have a large overlap. Previous work on similarity joins in
databases mainly focuses on expressing the overlap computation with
relational operators. The tokens are assumed to preexist in the
database, and the token generation cannot be expressed as part of the
query.

Our goal is to efficiently compute token similarity joins on-the-fly,
i.e., without any precomputed tokens or indexes. We define
*tokenize*, a new relational operator that generates tokens and
allows the similarity join to be fully integrated into relational
databases. This allows us to (1) optimize the token generation as part
of the query plan, (2) provide the query optimizer with cardinality
estimates for tokens, (3) choose efficient algorithms based on the query
context. We discuss algebraic properties, cardinality estimates, and
an efficient iterator algorithm for tokenize. We implemented our
operator in the kernel of PostgreSQL and empirically evaluated its
performance for similarity joins.

#### Abstract

Hierarchical data are often modelled as trees. An interesting query
identifies pairs of similar trees. The standard approach to tree
similarity is the tree edit distance, which has successfully been
applied in a wide range of applications. In terms of runtime, the
state-of-the-art for the tree edit distance is the RTED algorithm, which
is guaranteed to be fast independently of the tree shape. Unfortunately,
this algorithm uses twice the memory of the other, slower algorithms.
The memory is quadratic in the tree size and is a bottleneck for the
tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree
edit distance. Our algorithm runs at least as fast as RTED, but requires
only half the memory. This is achieved by systematically releasing
memory early during the first step of the algorithm, which computes a
decomposition strategy and is the main memory bottleneck. We show the
correctness of our approach and prove an upper bound for the memory
usage. Our empirical evaluation confirms the low memory requirements
and shows that in practice our algorithm performs better than the
analytic guarantees suggest.

#### Abstract

Set similarity joins compute all pairs of similar sets from two
collections of sets. Set similarity joins are typically implemented
in a filter-verify framework: a filter generates candidate pairs,
possibly including false positives, which must be verified to produce
the final join result. Good filters produce a small number of false
positives, while they reduce the time they spend on hopeless candidates.
The best known algorithms generate candidates using the so-called prefix
filter in conjunction with length- and position-based filters.

In this paper we show that the potential of length and position have
only partially been leveraged. We propose a new filter, the
*position-enhanced length filter*, which exploits the matching
position to incrementally tighten the length filter; our filter
identifies hopeless candidates and avoids processing them. The filter
is very efficient, requires no change in the data structures of most
prefix filter algorithms, and is particularly effective for foreign
joins, i.e., joins between two different collections of sets.

## 2013

#### Abstract

State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low.

#### Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.

#### Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

CIKM '13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013

#### Abstract

The problem of generating a cost-minimal edit script between two trees
has many important applications. However, finding such a cost-minimal
script is computationally hard, thus the only methods that scale are
approximate ones. Various approximate solutions have been proposed
recently. However, most of them still show quadratic or worse runtime
complexity in the tree size and thus do not scale well either. The only
solutions with log-linear runtime complexity use simple matching
algorithms that only find corresponding subtrees as long as these
subtrees are equal. Consequently, such solutions are not robust at all,
since small changes in the leaves which occur frequently can make all
subtrees that contain the changed leaves unequal and thus prevent the
matching of large portions of the trees. This problem could be avoided
by searching for similar instead of equal subtrees but current
similarity approaches are too costly and thus also show quadratic
complexity. Hence, currently no robust log-linear method exists.

We propose the random walks similarity (RWS) measure which can be used
to find similar subtrees rapidly. We use this measure to build the
RWS-Diff algorithm that is able to compute an approximately cost-minimal
edit script in log-linear time while having the robustness of a
similarity-based approach. Our evaluation reveals that random walk
similarity indeed increases edit script quality and robustness
drastically while still maintaining a runtime comparable to simple
matching approaches.

## 2012

#### Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

#### Abstract

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.

#### Abstract

MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.

#### Abstract

The popularity of MapReduce systems for processing large data sets in
both industry and science has increased drastically over the last years.
While sample applications often found in literature, for example, word
count, are rather simple, e-science applications tend to be complex,
thereby posing new challenges to MapReduce systems. The high runtime
complexity of e-science applications on the one hand, and skewed data
distributions often encountered in scientific data sets on the other
hand, lead to highly varying reducer execution times. These, in turn,
cause high overall execution times and poor resource utilisation.

In this paper, we tackle the challenge of balancing the workload on
the reducers, considering both complex reduce tasks and skewed data.
We define the partition cost model which takes into account non-linear
reducer tasks, and provide an algorithm for efficient cost estimation
in a distributed environment. Finally, we present two load balancing
approaches, fine partitioning and dynamic fragmentation, based on our
partition cost model. Both these approaches can be seamlessly integrated
into existing MapReduce systems like Hadoop. We evaluate our solutions
using both synthetic, and real e-science data.

## 2011

#### Abstract

We consider the classical tree edit distance between ordered labeled
trees, which is defined as the minimum-cost sequence of node edit
operations that transform one tree into another. The state-of-the-art
solutions for the tree edit distance are not satisfactory. The main
competitors in the field either have optimal worst-case complexity, but
the worst case happens frequently, or they are very efficient for some
tree shapes, but degenerate for others. This leads to unpredictable and
often infeasible runtimes. There is no obvious way to choose between
the algorithms.

In this paper we present RTED, a robust tree edit distance algorithm.
The asymptotic complexity of RTED is smaller or equal to the complexity
of the best competitors for any input instance, i.e., RTED is both
effcient and worst-case optimal. We introduce the class of LRH
(Left-Right-Heavy) algorithms, which includes RTED and the fastest tree
edit distance algorithms presented in literature. We prove that RTED
outperforms all previously proposed LRH algorithms in terms of runtime
complexity. In our experiments on synthetic and real world data we
empirically evaluate our solution and compare it to the
state-of-the-art.

#### Abstract

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

#### Abstract

In many applications, for example, in data integration scenarios,
strings must be matched if they are similar. String similarity joins,
which match all pairs of similar strings from two datasets, are of
particular interest and have recently received much attention in the
database research community. Most approaches, however, assume a global
similarity threshold; all string pairs that exceed the threshold form
a match in the join result. The global threshold approach has two major
problems: (a) the threshold depends on the (mostly unknown) data
distribution, (b) often there is no single threshold that is good for
all string pairs.

In this paper we propose the *PG-Join* algorithm, a novel string
similarity join that requires no configuration and uses an
*adaptive threshold*. PG-Join computes a so-called proximity
graph to derive an individual threshold for each string. Computing the
proximity graph efficiently is essential for the scalability of PG-Join.
To this end we develop a new and fast algorithm, PG-I, that computes
the proximity graph in two steps: First an efficient approximation is
computed, then the approximation error is fixed incrementally until
the adaptive threshold is stable. Our extensive experiments on
real-world and synthetic data show that PG-I is up to five times faster
than the state-of-the-art algorithm and suggest that PG-Join is a useful
and effective join paradigm.

#### Abstract

MapReduce systems have become popular for processing large data sets
and are increasingly being used in e-science applications. In contrast
to simple application scenarios like word count, e-science applications
involve complex computations which pose new challenges to MapReduce
systems. In particular, (a) the runtime complexity of the reducer task
is typically high, and (b) scientific data is often skewed. This leads
to highly varying execution times for the reducers. Varying execution
times result in low resource utilisation and high overall execution
time since the next MapReduce cycle can only start after all reducers
are done.

In this paper we address the problem of efficiently processing MapReduce
jobs with complex reducer tasks over skewed data. We define a new cost
model that takes into account non-linear reducer tasks and we provide
an algorithm to estimate the cost in a distributed environment. We
propose two load balancing approaches, fine partitioning and dynamic
fragmentation, that are based on our cost model and can deal with both
skewed data and complex reduce tasks. Fine partitioning produces a fixed
number of data partitions, dynamic fragmentation dynamically splits
large partitions into smaller portions and replicates data if necessary.
Our approaches can be seamlessly integrated into existing MapReduce
systems like Hadoop. We empirically evaluate our solution on both
synthetic data and real data from an e-science application.

#### Abstract

String data is omnipresent and appears in a wide range of applications.
Often string data must be partitioned into clusters of similar strings,
for example, for cleansing noisy data. A promising string clustering
approach is the recently proposed Graph Proximity Cleansing (GPC). A
distinguishing feature of GPC is that it automatically detects the
cluster borders without knowledge about the underlying data, using the
so-called proximity graph. Unfortunately, the computation of the
proximity graph is expensive. In particular, the runtime is high for
long strings, thus limiting the application of the state-of-the-art
GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that
efficiently compute the GPC cluster borders and scale to long strings.
PG-Skip follows a prefix pruning strategy and does not need to compute
the full proximity graph to detect the cluster border. PG-Skip is much
faster than the state-of-the-art algorithm, especially for long strings,
and computes the exact GPC borders. We show the optimality of PG-Skip
among all prefix pruning algorithms. PG-Binary is an efficient
approximation algorithm, which uses a binary search strategy to detect
the cluster border. Our extensive experiments on synthetic and
real-world data confirm the scalability of PG-Skip and show that
PG-Binary approximates the GPC clusters very effectively.

## 2010

#### Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that
automatically detects cluster borders and has been successfully used
for string cleansing. For each potential cluster a so-called proximity
graph is computed, and the cluster border is detected based on the
proximity graph. Unfortunately, the computation of the proximity graph
is expensive and the state-of-the-art GPC algorithms only approximate
the proximity graph using a sampling technique.

In this paper we propose two efficient algorithms for the exact
computation of proximity graphs. The first algorithm, PG-DS, is based
on a divide-skip technique for merging inverted lists, the second
algorithm, PG-SM, uses a sort-merge join strategy to compute the
proximity graph. While the state-of-the-art solutions only approximate
the correct proximity graph, our algorithms are exact. We experimentally
evaluate our solution on large real world datasets and show that our
algorithms are faster than the sampling-based approximation algorithms,
even for very small sample sizes.

#### Abstract

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

#### Abstract

String data is omnipresent and appears in a wide range of applications.
Often string data must be partitioned into clusters of similar strings,
for example, for cleansing noisy data. A promising string clustering
approach is the recently proposed Graph Proximity Cleansing (GPC). A
distinguishing feature of GPC is that it automatically detects the
cluster borders without knowledge about the underlying data, using the
so-called proximity graph. Unfortunately, the computation of the
proximity graph is expensive. In particular, the runtime is high for
long strings, thus limiting the application of the state-of-the-art
GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that
efficiently compute the GPC cluster borders and scale to long strings.
PG-Skip follows a prefix pruning strategy and does not need to compute
the full proximity graph to detect the cluster border. PG-Skip is much
faster than the state-of-the-art algorithm, especially for long strings,
and computes the exact GPC borders. We show the optimality of PG-Skip
among all prefix pruning algorithms. PG-Binary is an efficient
approximation algorithm, which uses a binary search strategy to detect
the cluster border. Our extensive experiments on synthetic and
real-world data confirm the scalability of PG-Skip and show that
PG-Binary approximates the GPC clusters very effectively.

## 2008

#### Abstract

In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.

#### Abstract

The goal of this thesis is to design, develop, and evaluate new methods
for the approximate matching of hierarchical data represented as labeled
trees. In approximate matching scenarios two items should be matched
if they are similar. Computing the similarity between labeled trees is
hard as in addition to the data values also the structure must be
considered. A well-known measure for comparing trees is the tree edit
distance. It is computationally expensive and leads to a prohibitively
high run time.

Our solution for the approximate matching of hierarchical data are
*pq-grams*. The *pq*-grams of a tree are all its
subtrees of a particular shape. Intuitively, two trees are similar if
they have many *pq*-grams in common. The *pq*-gram
distance is an efficient and effective approximation of the tree edit
distance. We analyze the properties of the *pq*-gram distance
and compare it with the tree edit distance and alternative
approximations. The *pq*-grams are stored in the
*pq*-gram index which is implemented as a relation and represents
a *pq*-gram by a fixed-length string. The *pq*-gram index
offers efficient approximate lookups and reduces the approximate
*pq*-gram join to an equality join on strings. We formally proof
that the *pq*-gram index can be incrementally updated based on
the log of edit operations without reconstructing intermediate tree
versions. The incremental update is independent of the data size and
scales to a large number of changes in the data. We introduce windowed
*pq*-grams for the approximate matching of unordered trees.
Windowed *pq*-grams are generated from sorted trees in linear
time. Sorting trees is not possible for common ordered tree distances
such as the edit distance.

We present the address connector, a new system for synchronizing
residential addresses that implements a *pq*-gram based distance
between streets, introduces a global greedy matching that guarantees
stable pairs, and links addresses that are stored with different
granularity. The connector has been successfully tested with public
administration databases. Our extensive experiments on both synthetic
and real world data confirm the analytic results and suggest that
*pq*-grams are both useful and scalable.

## 2006

#### Abstract

Several recent papers argue for approximate lookups in hierarchical
data and propose index structures that support approximate searches in
large sets of hierarchical data. These index structures must be updated
if the underlying data changes. Since the performance of a full index
reconstruction is prohibitive, the index must be updated incrementally.

We propose a *persistent* and *incrementally*
maintainable index for approximate lookups in hierarchical data. The
index is based on small tree patterns, called pq-grams. It supports
efficient updates in response to structure and value changes in
hierarchical data and is based on the log of tree edit operations. We
prove the correctness of the incremental maintenance for sequences of
edit operations. Our algorithms identify a small set of
*pq*-grams that must be updated to maintain the index. The
experimental results with synthetic and real data confirm the
scalability of our approach.

## 2005

#### Abstract

When integrating data from autonomous sources, exact matches of data
items that represent the same real world object often fail due to a lack
of common keys. Yet in many cases structural information is available
and can be used to match such data. As a running example we use
residential address information. Addresses are hierarchical structures
and are present in many databases. Often they are the best, if not only,
relationship between autonomous data sources. Typically the matching
has to be approximate since the representations in the sources differ.

We propose *pq*-grams to approximately match hierarchical
information from autonomous sources. We define the *pq*-gram
distance between ordered labeled trees as an effective and efficient
approximation of the well-known tree edit distance. We analyze the
properties of the *pq*-gram distance and compare it with the edit
distance and alternative approximations. Experiments with synthetic
and real world data confirm the analytic results and the scalability
of our approach.

## 2004

#### Abstract

The integration of data from distributed sources within and among different public authorities is an important aspect on the agenda of e-government. In this paper we present a specific data integration scenario at the Municipality of Bozen-Bolzano, where data from different sources have to be joined via addresses of citizens. We analyze in detail the problems which arise in this scenario. We show that addresses can be represented in a tree structure, and hence, the problem of matching addresses can be reduced to approximate tree matching.

## 2003

#### Abstract

Since a few years digital government is becoming an active research area with lots of promises to revolutionise government and its interaction with citizens and businesses. A crucial point for the success of e-government is the integration and sharing of services and information provided by different authorities. We argue that Web services are a promising technology to solve this problem.