Similarity Search in Large Databases (2021 WS)

Similarity Search in Large Databases

Lecture:

Nikolaus Augsten

Proseminar:

Nikolaus Augsten

Language:

German / English

Office hours:

Slack

Semester:

WS 2021/2022

Course Evaluations:

WS 2015/16

PlusOnline:

Lecture
Proseminar

News

Question answering session on Wed Jan 19, 10:15, online in a Webex meeting.
Report deadline extended: the proseminar reports are due within 3 weeks after the talk.
A template for the reports is now available.
The program of the proseminar presentations is online.

Lecture

The lecture will be held online. Videos with explainations on the slides will be made available in Blackboard.

Questions and discussions

For questions and discussions (also among students) regarding course specific topics please use the Slack channel #ssdb-vo-2021ws (Workspace dbteaching.slack.com).

Slack registration: Students register with their university email hier: https://dbteaching.slack.com/signup

Slides

The lecture will be held online. Videos with explainations on the slides will be made available.

Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide has been treated in the lecture only bug fixes will be applied. Slide sets have a version (date) on the title page.

The slides and their discussion during the lecture are essential for the exam.

Note: The slide version of last year is already online to give you an overview, but this version may be subject to changes. The chapter numbers refer to the book "Database System Concepts" (see above).

	Thema	Folien
0.	Course Organisation and Demo	[1up] [4up]
1.	Introduction to Similarity Search	[1up] [4up]
2.	Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants	[1up] [4up]
3.	q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments	[1up] [4up]
4.	Trees: Tree Definition	[1up] [4up]
5.	Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity	[1up] [4up]
6.	Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound	[1up] [4up]
7.	Reference Sets: Pruning with Reference Sets	[1up] [4up]
8.	Binary Branch Distance: Algorithm, Lower Bound Proof, Complexity	[1up] [4up]
9.	pq-Gram Distance: Algorithm and Lower Bound Theorem	[1up] [4up]

Literature

The following book treats selected lecture topics:

N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.
Synthesis lectures on data management. Morgan Claypool Publishers, 2013.

The book is available online from our university library.

Exam

The exam will be oral (by default). If an oral exam is not feasible due to the high number of registrations, the exam mode will be changed to written. In this case, the students will be timely informed.

The language of the oral exam can be chosen by the student (English or German).

Oral exam rules

Oral exam: The exam lasts for about 30 minutes. The questions cover all topics treated during the lecture, and some of them are very specific: You are expected to proof theorems, show the execution steps of algorithms on example instances (that you come up with), and explain each of the steps.

The grading considers:

the technical correctness of the answer,
the profoundnes of the answer, i.e., the level of detail that the answer provides,
the completeness of the answer, i.e., have all relevant aspects be covered?,
correct use of the terminology,
the ability to answer in-depth questions.

No documents are allowed during the oral exam.

Ablauf mündliche Prüfung

Mündliche Prüfung Die Prüfung dauert ca. 30 Minuten. Die Fragen decken den gesamten Vorlesungsstoff ab und sind zum Teil sehr konkret: Es wird erwartet, dass Sie Beweise führen sowie Algorithmen an Beispielen (die Sie teilweise selbst erfinden müssen) durchrechnen und erklären können.

Bewertet werden

die technische Korrektheit der Antwort,
der Detailgrad der Antwort, d.h. wie tief-schürfend wurde das Thema behandelt,
die Vollständigkeit der Antwort, d.h. wurden alle Aspekte berücksichtigt,
der Gebrauch der passenden Fachterminologie,
die Fähigkeit, auf vertiefende Fragen einzugehen.

Zur mündlichen Prüfung können keine Unterlagen verwendet werden.

You may register or cancel the registration up to 48 hours before the exam. If you do not show up for a registered exam, you will be blocked for this exam according to the university regulations §15(7).

Proseminar

The slides of the kickoff meeting outline the organization of the proseminar.

Important dates

~~Wed Oct 13: form groups~~
~~Thu Oct 14: online meeting (Webex)~~
~~Wed Oct 20: groups pick topic (register topic on Slack #ssdb-ps-2021ws - first come, first served)~~
Presentations start on Nov 4, 2021: detailed program

List of papers

Exact set similarity techniques

(AllPairs) R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. WWW, pages 131–140, 2007
(PartEnum) Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929
(Partition) D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360–371, 2015
(SkipJoin) X. Wang, L. Qin, X. Lin, Y. Zhang, and L. Chang. Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment, 10(9), 2017
(SizeAware) Dong Deng, Yufei Tao, Guoliang Li: Overlap Set Similarity Joins with Theoretical Guarantees. SIGMOD Conference 2018: 905-920
(JOSIE) Erkang Zhu, Dong Deng, Fatemeh Nargesian, Renée J. Miller: JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. SIGMOD Conference 2019: 847-864
(Top-k join) Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927
(Exact LSH) Rasmus Pagh: Locality-sensitive Hashing without False Negatives. SODA 2016: 1-9
(Set Clustering) Daniel Kocher, Nikolaus Augsten, Willi Mann: Scaling Density-Based Clustering to Large Collections of Sets. EDBT 2021: 109-120

Approximate set similarity techniques

(MinHash) Andrei Z. Broder: On the resemblance and containment of documents. SEQUENCES 1997: 21-29
(LSH foundations) Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613
(ChosenPath – theory) Tobias Christiani, Rasmus Pagh: Set similarity search beyond MinHash. STOC 2017: 1094-1107
(ChosenPath – applied) Tobias Christiani, Rasmus Pagh, Johan Sivertsen: Scalable and Robust Set Similarity Join. ICDE 2018: 1240-1243

Tree similarity techniques

(Join) Thomas Hütter, Mateusz Pawlik, Robert Loschinger, Nikolaus Augsten: Effective Filters and Linear Time Verification for Tree Similarity Joins. ICDE 2019: 854-865
(Top-K) Daniel Kocher, Nikolaus Augsten: A Scalable Index for Top-k Subtree Similarity Queries. SIGMOD Conference 2019: 1624-1641
(JSON Diff) Thomas Hütter, Nikolaus Augsten, Christoph Kirsch, Michael J. Carey, Chen Li. Similarity Joins for Large Scale Document Stores. Technical Report.

Presentation Program

There will be eleven presentations, the first presentation will take place on Nov 4, 2021: detailed program

Report Template

Please use this template for your report. You find the instructions in the template.

Grading Scheme

You can earn points from the presentation, the reports, and the participation in the discussion after the talks and in the Slack channel.

Presentation: max 10 points
Reports: max 12 points (6 points per graded report)
Participation: max 2 points

Grading scheme: For a positive grade you must (1) give a group presentation, (2) submit at least 4 reports (=extended abstracts), (3) attend at least 8 talks (including your own ← this is new). Then the grade is computed based on the points that your earned as follows:

13-15 points: 4
16-18 points: 3
19-21 points: 2
22-24 points: 1