Funded Research Projects

DESQ - Declarative and Efficient Similarity Queries

Duration: 2022 - 2026
Funding: 394'616 EUR (Austrian Science Fund - FWF, P 34962-N Einzelprojekte)
DESQ -- Declarative and Efficient Similarity Queries

Database systems, which deal with storing and querying large amounts of data, are indispensable in almost any application software context. A salient feature of many database systems is declarative query processing: the user describes the answer to the query (what?) rather than the techniques to compute the answer (how?). When a query is based on exact matches (e.g., find all orders of a customer given her customer ID), the database system transparently translates the query into an efficient program that computes the required answer. Unfortunately, this development has not happened for so-called similarity queries.

In a similarity query, two data objects "match" if they are similar. Similarity queries are required in scenarios where equality and exact matches are not effective, for example, when dealing with errors and inconsistencies in the data. This frequently happens when data must be integrated from multiple sources, for example, when in-house data should be enriched from external sources.

Past research efforts have focused on developing effective similarity measures for various application domains and the efficient processing of specific types of similarity queries. However, these techniques remain isolated solutions and their integration into a database system has received little attention. Thus, applications that require advanced similarity features cannot rely on general-purpose systems that transparently handle data storage and querying. Instead, similarity queries must be dealt with in an ad-hoc way, for example, by manually extending the database system or developing custom software. Both approaches are cumbersome, cost-intensive, and inefficient.

In this project we bridge this gap and study similarity queries from a broader systems perspective. We want to develop a deep understanding of all aspects of similarity queries that are required to build a general-purpose query processor for this query type. The overall goal is the integration of similarity queries into declarative databases and their efficient processing in a systems context.

The key ideas for integrating similarity queries into systems are (a) the decomposition of the similarity queries into small, atomic operators, (b) the automatic generation of alternative query plans using efficient processing techniques available in the database, (c) the cost assessment of the plan alternatives and the execution of the cheapest plan.

When successful, this project will provide a basis for building general-purpose database systems that can efficiently deal with declarative similarity queries. Database users will no longer need to write ad-hoc code. Instead, similarity queries can be efficiently answered also by non-expert users.

BOSS 1.0: Biblical Online Synopsis Salzburg 1.0

Duration: 2021 - 2024
Partners: Kristin De Troyer (FB Bibelwissenschaft und Kirchengeschichte, Univ. Salzburg),
Funding: 249'721 EUR (Federal State of Salzburg, Austria)
BOSS 1.0: Biblical Online Synopsis Salzburg 1.0

Dieses Projekt ist Teil eines internationalen Vorhabens ein open-access online Tool (BOS) zu entwickeln, welches es erlaubt, alte Versionen biblischer Schriften auf Text, Syntax und Semantikebene miteinander zu vergleichen. Bisher wurden Synopsen, deren Informationsinhalt auf eine Doppelseite begrenzt ist, nur in Manuskriptform oder als gedruckte Edition erstellt. BOS wird nicht nur der erste frei-online zugängliche Hub für alle digitalisierten Bibeltexte werden, sondern bietet selbst Analysemöglichkeiten auf Basis neuester Ansätze aus dem Bereich des maschinellen Lernens. Im Projekt wird anfänglich eine manuelle Synopse für das Buch Josua erstellt. Die Synopse bildet die Grundlage für den Kooperationspartner aus der Informatik. Die Idee besteht darin, Teile der manuellen Analyse mithilfe maschinellen Lernens sowie statistischen Methoden zu automatisieren und dann mittels der manuell erzeugten Ergebnisse zu verifizieren. Derartige Ansätze wurden bereits bei der Auswertung philosophischer Texte getestet, jedoch noch nie systematisch auf Bibeltexte angewendet. Die Schwierigkeit liegt darin, dass die Texte in verschiedenen altertümlichen Sprachen vorliegen und zwischen ihrer Entstehung hunderte von Jahren liegen. Dies erfordert eine systematische Evaluierung existierender Techniken und, mit großer Wahrscheinlichkeit, eine Erweiterung um neuartige Ansätze. Diese Problematik ist weit verbreitet in den Humanities Disziplinen. Im Kern sollen die Ergebnisse aus der Informatik den Forschungsprozess zum einen massiv beschleunigen und zum anderen Analysen ermöglichen, die derzeit aufgrund der begrenzten menschlichen Lesegeschwindigkeit und Merkvermögens nicht möglich sind. Das Projektergebnis wir dabei sowohl die Art verändern, wie Gelehrte die biblischen Texte betrachten, als auch die Vielfältigkeit der Texte einer breiten Öffentlichkeit zugänglich machen. Somit leisten wir einen wichtigen Beitrag dazu, die Bibelwissenschaften in das digitale Zeitalter zu bringen.

FFTED - Fast and Flexible Tree Edit Distance

Duration: 2017 - 2022
Funding: 399'699 EUR (Austrian Science Fund - FWF, P 29859 Einzelprojekte)
FFTED - Fast and Flexible Tree Edit Distance

Data that are structured in hierarchies are called trees. An example of tree data is the organizational hierarchy of a company with departments, their managers, and subordinate employees. Other examples include enterprise assets, bills of material, natural or computer language syntax trees, directories in file systems, or molecular structures in bioinformatics.

When tree data change, new tree versions are generated. An important task is to compute the difference (called diff) between two version of a tree. A good diff is as concise as possible. A common approach to compute diffs for trees is the so-called tree edit distance (TED), which computes the minimal diff.

The computation of the minimal diff is challenging. Current solutions suffer from two problems that make them impractical in many scenarios: (1) Efficiency: The computation of the minimal diff is too slow. When the tree size doubles, the runtime increase by a factor of four or more. Current techniques are applicable to hierarchies with about ten thousand elements, but many interesting trees have millions of elements (e.g., bills of material). (2) Flexibility: TED computes the most concise diff and disregards any application semantics, which may lead to non-intuitive results. For example, TED may turn a department into an employee to keep the diff small.

Various attempts have been made to overcome these problems, but all solutions suffer from at least one of the following issues: quality is traded off against speed (i.e., the diffs are too verbose), often without any guarantee about the deviation from the optimal result; the application semantics are hard-wired in the solution, thus restricting its scope to very specific applications; or only one of the issues (efficiency or flexibility) is addressed.

The goal of this project is to solve the efficiency and flexibility problems of TED without trading off quality. We will allow users to specify constraints that fit their application semantics as part of the input (e.g., departments cannot be turned into employees), and the minimal diff that respects these constraints will be computed. This will make our solution applicable to a wide range of different domains. In order to gain efficiency, we will exploit two observations that have received little attention in the past. First, changes between tree versions typically affect only a small portion of the tree; by focusing the computational effort on the parts that actually changed we hope to achieve considerable runtime improvements. Second, user constraints limit the number of allowable diffs, and fewer options must be considered; we plan to leverage this fact as well to gain performance.

If successful, our solution will be the first one to deal with user-defined constraints and compute the minimal diff efficiently.

DK GIScience

Website: dk-giscience.zgis.net
Duration: 2015 - 2019
Partners: Z_GIS, Dept. of Geographie and Geology (Univ. Salzburg), ZAMG, Louisiana State University
Funding: 2'133'000 EUR (Austrian Science Fund - FWF)

A doctoral college in Geographic Information Sciences at the University of Salzburg. In the context of this project, the database research group works on extend GIS capabilities of database systems with a focus on two main topics:

Isochrone Queries for SQL

An isochrone in a spatial network is a possibly disconnected subgraph that covers all space points from which a specific point of interest is reachable within a given time period at a specific point in time. Isochrones have interesting applications, for example, geomarketing (e.g., positioning of franchise stores) or urban planing (e.g., finding spots in the city that are hard to reach by public transportation). Isochrones are particularly interesting as part of a larger query that combines spatial and non-spatial aspects. Answering such queries is challenging, in particular when multimodal networks are involved. The goal of the project is to extend SQL with the notion of isochrones and develop efficient evaluation strategies for isochrones in multimodal networks.

Large Scale Similarity Queries for Geocoding

Data often have an implicit spatial aspect (e.g., a restaurant has a location even if the location is not stored in the database). Enriching data with explicit spatial references, called geocoding, adds value to the data. In order to introduce spatial references into a data set, non-spatial attributes must be used to link the non-geocoded data to pre-existing geocoded data, i.e., a join must be computed. Since in geocoding the joined datasets often originate from different sources, there may be no common key value. Then, computing the join is challenging: exact join conditions (which are efficient and well studied) will fail since data items that represent the same real world object may differ. The goal of this project is to advance the state of the art in processing similarity queries on large data volumes in GIS-enabled relational database systems.


Software Projects

Tree Edit Distance

On the following website, we list our publications on similarity queries over hierarchical data with a focus on node edit distances as a similarity measure for trees. Further, the websites provides links to the datasets and source that were used in our experimental evaluations.

Website: tree-edit-distance.dbresearch.uni-salzburg.at

SSJoin

We empirically evaluated set similarity join techniques. We provide an extended version of our PVLDB paper as well as the runnable source code for the experiments.

Website: ssjoin.dbresearch.uni-salzburg.at