## Similarity Search in Large Databases

## News

- Question answering session on Wed Jan 19, 10:15, online in a Webex meeting.
**Report deadline extended**: the proseminar reports are due**within 3 weeks**after the talk.- A template for the reports is now available.
- The program of the proseminar presentations is online.

## Lecture

The lecture will be held **online**. Videos with explainations on the slides will be made available in Blackboard.

### Questions and discussions

For questions and discussions (also among students) regarding course specific topics please use the Slack channel #ssdb-vo-2021ws (Workspace dbteaching.slack.com).

*Slack registration:* Students register with their university email hier: https://dbteaching.slack.com/signup

### Slides

The lecture will be held online. Videos with explainations on the slides will be made available.

Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide has been treated in the lecture only bug fixes will be applied. Slide sets have a version (date) on the title page.

The slides and their discussion during the lecture are essential for the exam.

*Note:* The slide version of last year is already
online to give you an overview, but this version may be subject to
changes. The chapter numbers refer to the book
"Database System Concepts" (see above).

Thema | Folien | ||||||
---|---|---|---|---|---|---|---|

0. | Course Organisation and Demo |
[1up] [4up] | |||||

1. | Introduction to Similarity Search |
[1up] [4up] | |||||

2. | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants |
[1up] [4up] | |||||

3. | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments |
[1up] [4up] | |||||

4. | Trees: Tree Definition |
[1up] [4up] | |||||

5. | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity |
[1up] [4up] | |||||

6. | Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound |
[1up] [4up] | |||||

7. | Reference Sets: Pruning with Reference Sets |
[1up] [4up] | |||||

8. | Binary Branch Distance: Algorithm, Lower Bound Proof, Complexity |
[1up] [4up] | |||||

9. | pq-Gram Distance: Algorithm and Lower Bound Theorem |
[1up] [4up] |

### Literature

The following book treats selected lecture topics:

N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.

Synthesis lectures on data management. Morgan Claypool Publishers, 2013.

The book is available online from our university library.

### Exam

The exam will be oral (by default). If an oral exam is not feasible due to the high number of registrations, the exam mode will be changed to written. In this case, the students will be timely informed.

The language of the oral exam can be chosen by the student (English or German).

*Oral exam:*
The exam lasts for about 30 minutes. The questions cover all topics treated during the lecture, and some of them are very specific: You are expected to proof theorems, show the execution steps of algorithms on example instances (that you come up with), and explain each of the steps.

The grading considers:

- the technical correctness of the answer,
- the profoundnes of the answer, i.e., the level of detail that the answer provides,
- the completeness of the answer, i.e., have all relevant aspects be covered?,
- correct use of the terminology,
- the ability to answer in-depth questions.

No documents are allowed during the oral exam.

*Mündliche Prüfung*
Die Prüfung dauert ca. 30 Minuten. Die Fragen decken den gesamten Vorlesungsstoff ab und sind zum Teil sehr konkret: Es wird erwartet, dass Sie Beweise führen sowie Algorithmen an Beispielen (die Sie teilweise selbst erfinden müssen) durchrechnen und erklären können.

Bewertet werden

- die technische Korrektheit der Antwort,
- der Detailgrad der Antwort, d.h. wie tief-schürfend wurde das Thema behandelt,
- die Vollständigkeit der Antwort, d.h. wurden alle Aspekte berücksichtigt,
- der Gebrauch der passenden Fachterminologie,
- die Fähigkeit, auf vertiefende Fragen einzugehen.

Zur mündlichen Prüfung können keine Unterlagen verwendet werden.

## Proseminar

The slides of the kickoff meeting outline the organization of the proseminar.

### Important dates

~~Wed Oct 13: form groups~~~~Thu Oct 14: online meeting (Webex)~~~~Wed Oct 20: groups pick topic (register topic on Slack #ssdb-ps-2021ws - first come, first served)~~- Presentations start on Nov 4, 2021: detailed program

### List of papers

#### Exact set similarity techniques

- (AllPairs) R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. WWW, pages 131–140, 2007
- (PartEnum) Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929
- (Partition) D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360–371, 2015
- (SkipJoin) X. Wang, L. Qin, X. Lin, Y. Zhang, and L. Chang. Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment, 10(9), 2017
- (SizeAware) Dong Deng, Yufei Tao, Guoliang Li: Overlap Set Similarity Joins with Theoretical Guarantees. SIGMOD Conference 2018: 905-920
- (JOSIE) Erkang Zhu, Dong Deng, Fatemeh Nargesian, Renée J. Miller: JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. SIGMOD Conference 2019: 847-864
- (Top-k join) Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927
- (Exact LSH) Rasmus Pagh: Locality-sensitive Hashing without False Negatives. SODA 2016: 1-9
- (Set Clustering) Daniel Kocher, Nikolaus Augsten, Willi Mann: Scaling Density-Based Clustering to Large Collections of Sets. EDBT 2021: 109-120

#### Approximate set similarity techniques

- (MinHash) Andrei Z. Broder: On the resemblance and containment of documents. SEQUENCES 1997: 21-29
- (LSH foundations) Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613
- (ChosenPath – theory) Tobias Christiani, Rasmus Pagh: Set similarity search beyond MinHash. STOC 2017: 1094-1107
- (ChosenPath – applied) Tobias Christiani, Rasmus Pagh, Johan Sivertsen: Scalable and Robust Set Similarity Join. ICDE 2018: 1240-1243

#### Tree similarity techniques

- (Join) Thomas Hütter, Mateusz Pawlik, Robert Loschinger, Nikolaus Augsten: Effective Filters and Linear Time Verification for Tree Similarity Joins. ICDE 2019: 854-865
- (Top-K) Daniel Kocher, Nikolaus Augsten: A Scalable Index for Top-k Subtree Similarity Queries. SIGMOD Conference 2019: 1624-1641
- (JSON Diff) Thomas Hütter, Nikolaus Augsten, Christoph Kirsch, Michael J. Carey, Chen Li. Similarity Joins for Large Scale Document Stores. Technical Report.

### Presentation Program

There will be eleven presentations, the first presentation will take place on Nov 4, 2021: detailed program

### Report Template

Please use this template for your report. You find the instructions in the template.

### Grading Scheme

You can earn points from the presentation, the reports, and the participation in the discussion after the talks and in the Slack channel.

- Presentation: max 10 points
- Reports: max 12 points (6 points per graded report)
- Participation: max 2 points

Grading scheme: For a positive grade you must (1) give a group presentation, (2) submit at least 4 reports (=extended abstracts), (3) attend at least 8 talks (including your own ← this is new). Then the grade is computed based on the points that your earned as follows:

- 13-15 points: 4
- 16-18 points: 3
- 19-21 points: 2
- 22-24 points: 1