DESQ - Declarative and Efficient Similarity Queries
Abstract
Database systems, which deal with storing and querying large amounts of data, are indispensable in almost any application software context. A salient feature of many database systems is declarative query processing: the user describes the answer to the query (what?) rather than the techniques to compute the answer (how?). When a query is based on exact matches (e.g., find all orders of a customer given her customer ID), the database system transparently translates the query into an efficient program that computes the required answer. Unfortunately, this development has not happened for so-called similarity queries.
In a similarity query, two data objects "match" if they are similar. Similarity queries are required in scenarios where equality and exact matches are not effective, for example, when dealing with errors and inconsistencies in the data. This frequently happens when data must be integrated from multiple sources, for example, when in-house data should be enriched from external sources.
Past research efforts have focused on developing effective similarity measures for various application domains and the efficient processing of specific types of similarity queries. However, these techniques remain isolated solutions and their integration into a database system has received little attention. Thus, applications that require advanced similarity features cannot rely on general-purpose systems that transparently handle data storage and querying. Instead, similarity queries must be dealt with in an ad-hoc way, for example, by manually extending the database system or developing custom software. Both approaches are cumbersome, cost-intensive, and inefficient.
In this project we bridge this gap and study similarity queries from a broader systems perspective. We want to develop a deep understanding of all aspects of similarity queries that are required to build a general-purpose query processor for this query type. The overall goal is the integration of similarity queries into declarative databases and their efficient processing in a systems context.
The key ideas for integrating similarity queries into systems are (a) the decomposition of the similarity queries into small, atomic operators, (b) the automatic generation of alternative query plans using efficient processing techniques available in the database, (c) the cost assessment of the plan alternatives and the execution of the cheapest plan.
When successful, this project will provide a basis for building general-purpose database systems that can efficiently deal with declarative similarity queries. Database users will no longer need to write ad-hoc code. Instead, similarity queries can be efficiently answered also by non-expert users.