Courses
Thesis Proposals
A top-k subtree similarity query finds the k subtrees in a large
document tree that are most similar to a given reference tree (the query
tree). The similarity between two trees is assessed using the so-called
tree edit distance (TED), which is the minimum number of node
edit operations that transform one tree into another. The tree edit
distance for ordered trees (OTED) can be computed in cubic time and
quadratic space in the number of tree nodes. A tree is ordered if
all siblings follow a specific order, otherwise the tree is considered
unordered.
The Database research group recently proposed an efficient solution for
top-k subtree similarity queries for ordered labeled trees [1]. However,
the problem remains unsolved for unordered trees.
[...Read more...]
This thesis is in collaboration with Munich-based company CELONIS, world market
leader in process mining. CELONIS motivates the problem as follows:
Celonis entwickelt eine in C++ und Java geschriebene Process Mining Engine, die von weltweit
führenden Unternehmen eingesetzt wird, um Ihre Unternehmensprozesse zu analysieren und
zu verbessern. Die Engine verarbeitet Queries in der von uns entwickelten, proprietären
DSL PQL, die für Process Mining und die dabei üblicherweise verwendeten Snowflake-Schemata
optimiert wurde.
Um die Engine für weitere Tools zu öffnen, möchten wir im Rahmen einer
Master- oder Bachelorarbeit die Möglichkeiten zur Unterstützung von SQL
evaluieren.
[...Read more...]
Former Thesis Proposals
This thesis is in collaboration with FINDOLOGIC, a search engine for
online shops. FINDOLOGIC motivates the problem as follows:
Eine Suche nach "Hemd" in einem Bekleidungs-Shop findet massenhaft Hemden,
die alle gleich relevant sind. Im schlechtesten Fall kommen auf der ersten
Seite der Ergebnisse nur blaue Herren-Hemden. Um für solche
allgemeinen Anfragen einen guten Überblick über das
Artikelsortiment zu liefern ist eine Durchmischung der Artikel
wünschenswert, so dass z.B. Hemden in verschiedenen Farben gelistet
werden.
[...Read more...]
In the world of rapid data growth, similarity search has emerged.
Interesting questions are for example: Do we have in our dataset two
different data items which represent the same real world entity? Which
data items from two different datasets represent the same real world
object? Are two data items similar and, if yes, how?
[...Read more...]
The tree edit distance measure is usually the first choice for comparing
tree-structured data. It is defined as the minimum number of simple edit
operations (delete, insert, and relabel) to transform one tree into
another. However, due to its cubic runtime complexity it is disregarded
in some applications. Efficient approximation techniques are required,
especially for large input instances. Examples of large trees (having
often millions of nodes) are merger trees of galaxies in astrophysics,
or assets and bills of material in enterprise resource planning.
[...Read more...]
Many applications deal with hierarchical data. Examples are XML and JSON
files which can be represented as trees. In data integration, a join
matches elements that are common to two data sources. Since elements are
represented slightly different in each source, an approximate join must
be used to do the matching. Such a join searches for pairs of similar
items. When dealing with massive datasets or large tree instances,
efficient techniques must be developed to compute the join in reasonable
time.
[...Read more...]
The tree edit distance (TED) is a commonly used measure for comparing
rooted ordered labelled trees. TED is defined as the minimum number of
simple node operations (delete, insert, relabel) that transform one tree
into another. The first recursive formula for TED has been introduced in
the late 1970's by Tai. Since then, several algorithms implementing that
formula have been proposed. The recent and most efficient development is
the RTED algorithm by Pawlik and Augsten (2011). Unfortunately, Demaine
et al. have shown that all algorithms that implement the recursive formula
by Tai have at least cubic runtime in the worst case.
[...Read more...]
This thesis is in collaboration with FINDOLOGIC, a search engine for
online shops. FINDOLOGIC motivates the problem as follows:
Eine große Herausforderung bei der Suche in Online-Shops und der
Suche allgemein ist es, die Absicht des Suchenden zu verstehen. Für
einen Menschen ist klar, dass bei einer Suche nach "Smartphone" das
Samsung Galaxy oder iPhone gewünscht ist. Vielfach werden aber
Artikel wie "Smartphone Cover" und andere Zubehörteile gefunden, da
in diesen das Wort "Smartphone" im Titel vorkommt, beim iPhone nicht.
Gleiches gilt für "Festplatte" (findet den "Computer mit Festplatte",
aber nicht die "Festplatte") oder den "Anzug" (findet den "Schlafanzug",
obwohl sicherlich ein "Herren-Anzug" gemeint ist). Die spannende Frage
ist hier, wie erkannt werden kann, dass ein iPhone ein "Smartphone" ist
und der "Bugatti blau" ein "Herren-Anzug". Lösungen hierfür
böten einen extremen Mehrwert für Online-Shops.
[...Read more...]