Similarity Search in Large Databases
News
- The due date for worksheet 06 has been extended to Tuesday Jan 14th 2025 - 23:59.
About this Course
This course comes in two variants:
- Variant A for Computer Science students: combined lecture and lab (termed UV - Übung mit Vorlesung) with 5 ECTS.
- Variant B for Data Science students: lecture (VO, 2.5 ECTS) + lab (PS, 2.5 ECTS)
Data Science students are encouraged to enroll into Variant A (5 ECTS UV rather than VO+PS) to take advantage of the midterms for the lecture part of the course. The recognition of Variant A as Variant B (as required for Data Science students) is ensured. For students enrolled into Variant B the midterms will be part of the lab grade (PS), and an additional final exam will be required to pass the lecture (VO).
Lecture
Questions and discussions
For questions and discussions (also among students) regarding course specific topics please use the Slack channel #ssdb-uv-ps (Workspace dbteaching.slack.com).
Slack registration: Students register with their university email here: https://dbteaching.slack.com/signup
Schedule
Schedule of the course according to PlusOnline. Deviations will be communicated explicitly in the Slack channel #ssdb-uv-ps and/or the course website.Slides
Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide set has been discussed in class, only bug fixes will be applied. Slide sets have a version (date) on the title page.
The slides and their discussion during the lecture are essential for the exam perparation.
Note: The slide version of last year is already online to give you an overview, but this version may be subject to change.
Topics | Slides | ||
---|---|---|---|
0. | Course Organisation and Demo | [1up] [4up] | |
1. | Introduction to Similarity Search | [1up] [4up] | |
2. | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants | [1up] [4up] | |
3. | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments | [1up] [4up] | |
4. | Trees: Tree Definition | [1up] [4up] | |
5. | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity | [1up] [4up] | |
6. | Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound | [1up] [4up] | |
7. | Token-Based Tree Distances: Tree Tokens, Binary Branches, pq-Grams | [1up] [4up] | |
8. | Set Similarity Join: Signatures for Overlap and Hamming Distance, Implementation | [1up] [4up] | |
Literature
The following book treats selected lecture topics:
N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.
Synthesis lectures on data management. Morgan Claypool Publishers, 2013.
The book is available online from our university library.