Similarity Search in Large Databases
News
- The kick-off meeting for both the UV and PS formats will take place on in HS I – Christian Doppler. Attendance is mandatory.
- For students enrolled in the PS (Lab) who are unable to attend at 14:15, an alternative kick-off meeting will be held at the regular PS time on .
About this Course
This course comes in two variants:
- Variant A for Computer Science students: combined lecture and lab (termed UV – Übung mit Vorlesung) with 5 ECTS.
- Variant B for Data Science students: lecture (VO, 2.5 ECTS) + lab (PS, 2.5 ECTS).
Data Science students are encouraged to enroll in Variant A (5 ECTS UV rather than VO+PS) to take advantage of the midterms for the lecture part of the course. Recognition of Variant A as equivalent to Variant B (as required for Data Science students) is ensured. Students enrolled in Variant B must pass the lab (PS), and an additional final exam will be required to pass the lecture (VO).
Questions and discussions
For questions and discussions (also among students) regarding course-specific topics, please use the Slack channel #ssdb-uv-ps (Workspace dbteaching.slack.com).
Slack registration: Students can register with their university email here: https://dbteaching.slack.com/signup
Schedule
Schedule of the course according to PlusOnline. Deviations will be communicated explicitly in the Slack channel #ssdb-uv-ps and/or on the course website.Lecture Material
Slides
Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide set has been discussed in class, only bug fixes will be applied. Slide sets have a version (date) on the title page.
The slides and their discussion during the lecture are essential for exam preparation.
Note: The slide version of last year is already online to give you an overview, but this version may be subject to change.
Topics | Slides | ||
---|---|---|---|
0. | Course Organisation and Demo | [1up] [4up] | |
1. | Introduction to Similarity Search | [1up] [4up] | |
2. | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants | [1up] [4up] | |
3. | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments | [1up] [4up] | |
4. | Trees: Tree Definition | [1up] [4up] | |
5. | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity | [1up] [4up] | |
6. | Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound | [1up] [4up] | |
7. | Token-Based Tree Distances: Tree Tokens, Binary Branches, pq-Grams | [1up] [4up] | |
8. | Set Similarity Join: Signatures for Overlap and Hamming Distance, Implementation | [1up] [4up] | |
Literature
The following book treats selected lecture topics:
N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.
Synthesis lectures on data management. Morgan Claypool Publishers, 2013.
The book is available online from our university library.
Grading Lecture (VO)
The lecture will be graded in a single, final exam.
The exam will be written. In the case of very few registrations for an exam date, the exam may be held as an oral exam; in this case students will be notified before the exam.
The exam dates will be announced on PlusOnline.
How to prepare for the exam?
- Study the relevant chapters in the textbook (Database System Concepts).
- Attend the lecture and study the slides.
- Actively participate and solve the homework exercises in the lab.
Written exam rules: At the written exam you are allowed to use one A4 sheet with your personal notes (both sides, handwritten or printed).
Previous written exams: 30.01.2023, 17.03.2023, 31.01.2024, 28.02.2024
Oral exam rules: The exam lasts for about 30 minutes. The questions cover all topics treated during the lecture, and some of them are very specific: You are expected to prove theorems, show the execution steps of algorithms on example instances (that you provide), and explain each of the steps.
The grading considers:
- the technical correctness of the answer,
- the depth of the answer, i.e., the level of detail that the answer provides,
- the completeness of the answer, i.e., have all relevant aspects been covered?,
- correct use of terminology,
- the ability to answer in-depth questions.
No documents are allowed during the oral exam.
Registration and withdrawal: Registration or withdrawal from the exam is possible up to 48 hours before the scheduled start. Failure to appear at the exam without prior withdrawal will result in a ban in accordance with §15(7) of the university regulations.
Grading Lab (PS) and UV (Combined Lecture and Lab)
The grading of the course is based on:
- Three midterms: You will write three midterm exams.
- Homework: You will solve exercises at home. By presenting your solutions in class, you can earn extra points.
Midterm Exams
The midterm exams take place during the lab hours and are planned for:
- Wed Nov 5, 16:00
- Wed Dec 10, 16:00
- Wed Jan 28, 16:00
The midterm exam lasts for 45 minutes and you can get a total of 90 points for the three midterms.
Cheat sheet: You may use one A4 sheet with your personal notes (single-sided, handwritten or printed).
Homework
You will solve worksheets at home during the semester. The worksheets must be solved by the due date indicated on the worksheet. The homework worksheets will be available at least 6 days before the due date.
Worksheet Presentation Schedule
Date | Worksheet Nr. |
---|---|
2025-10-15 | Worksheet 01 |
2025-10-29 | Worksheet 02 |
2025-11-19 | Worksheet 03 |
2025-12-03 | Worksheet 04 |
2026-01-07 | Worksheet 05 |
2026-01-21 | Worksheet 06 |
Presentations in Class
In class, students can volunteer to present their solution. If no student volunteers, the lecturer will pick a student for each exercise of the worksheet and ask the student to present the solution.
The quality of the presentation will be rated A, B, or C. Quality criteria for the presentation are:
- the technical correctness of the solution,
- the clarity of the presentation,
- the correct use of terminology, and
- the ability to answer in-depth questions.
Quality ratings of the presentation:
- A: excellent presentation and in-depth understanding of the solution
- B: small errors in the solution or presentation issues, but sufficient understanding of the solution
- C: poor presentation or insufficient understanding of the solution
Each worksheet presentation contributes as follows to the overall score:
- If your presentation was rated A, then the presented exercise of the worksheet counts towards the overall score.
- If your presentation was rated B, you will receive only half the points for the exercise you presented.
- If your presentation was rated C, you will not earn any points for the presented exercise.
The maximum achievable score is 10 points.
Note: There is no guarantee that you will be picked when you volunteer to present your solution, even if you need the worksheet points to pass the course. It is wise to earn worksheet points early enough in the semester.
Scores and Grades
The overall score is the sum of the midterm scores and the presentation scores. The maximum overall score is 100.
The final grade is computed from the overall score as follows:
Score | Grade |
---|---|
81-100 | 1 |
71-80 | 2 |
61-70 | 3 |
50-60 | 4 |
0-49 | 5 |
Attendance and unenrolment: Students must attend at least 75% of the lab sessions to achieve a positive grade. Unenrolments are possible only until before the 3rd lab unit, i.e., all students who are still enrolled at the time of the 3rd lab unit will be graded.