Similarity Search in Large Databases

Lecture:
Language:
German / English
Office hours:
Semester:
WS 2024/2025

News

About this Course

This course comes in two variants:

Data Science students are encouraged to enroll into Variant A (5 ECTS UV rather than VO+PS) to take advantage of the midterms for the lecture part of the course. The recognition of Variant A as Variant B (as required for Data Science students) is ensured. For students enrolled into Variant B the midterms will be part of the lab grade (PS), and an additional final exam will be required to pass the lecture (VO).

Lecture

Questions and discussions

For questions and discussions (also among students) regarding course specific topics please use the Slack channel #ssdb-uv-ps (Workspace dbteaching.slack.com).

Slack registration: Students register with their university email here: https://dbteaching.slack.com/signup

Schedule

Schedule of the course according to PlusOnline. Deviations will be communicated explicitly in the Slack channel #ssdb-uv-ps and/or the course website.

Slides

Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide set has been discussed in class, only bug fixes will be applied. Slide sets have a version (date) on the title page.

The slides and their discussion during the lecture are essential for the exam perparation.

Note: The slide version of last year is already online to give you an overview, but this version may be subject to change.

Topics Slides
0. Course Organisation and Demo [1up] [4up]
1. Introduction to Similarity Search [1up] [4up]
2. Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants [1up] [4up]
3. q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments [1up] [4up]
4. Trees: Tree Definition [1up] [4up]
5. Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity [1up] [4up]
6. Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound [1up] [4up]
7. Token-Based Tree Distances: Tree Tokens, Binary Branches, pq-Grams [1up] [4up]
8. Set Similarity Join: Signatures for Overlap and Hamming Distance, Implementation [1up] [4up]

Literature

The following book treats selected lecture topics:

N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.
Synthesis lectures on data management. Morgan Claypool Publishers, 2013.

The book is available online from our university library.

Grading

The grading of the course is based on:

  1. Two midterms: You will write two midterm exams.
  2. Homework: You will solve exercises at home and present your solution in class. The grading depends on the number of homework exercises that you solve and the quality of your presentions in class.

Midterm Exams

The midterm exams take place during the lecture hours and are planned for:

  1. Wed Dez 4th, 14:15
  2. Wed Jan 29th, 14:15

You can get a total of 70 points for the two midterm exams.

Homework

You will solve worksheets at home during the semester. The worksheets must be solved by the due date indicated on the worksheet.

By the due date of the worksheet, you also tick the exercises that you solved in Blackboard. Ticking an exercises means that

Be sure not to miss the deadline for ticking exercises: there will be no extensions.

A score (number of points) is assigned to each homework exercise. The score depends on the complexity and difficulty of the exercise. For a single worksheet, at most 10 points can be achieved.

The homework worksheets will be available at least 6 days before the due date.

Presentations in Class

In class, the lecturer will pick a student for each exercise of the homework and ask the student to present the solution. The lecturer (and the fellow students) will ask questions about the solution. The quality of the presentation will be rated A, B, or C. Quality criteria for the presentation are

Quality ratings of the presentation:

Scores for the Homework

You get points for each of the homework worksheets. The overall score is the sum over the points for all worksheets.

Each worksheet contributes as follows to the overall score:

Scores and Grades

The overall score is the sum of the midterm score and the homework score. The maximum overall score is 140.

You need to achieve a midterm score of at least 35 points and a homework score of at least 35 points to pass the course.

If both the midterm score and the worksheet score are are at least 35 each, the final grade is computed from the overall score as follows:

Percent    Score    Grade
81-100 113-140 1
71-80 99-112 2
61-70 85-98 3
50-60 70-84 4
0-49 0-69 5

Attendance and unenrolment: The students must attend at least 75% of the lab lectures to achieve a positive grade. Unenrolements are possible only until before the 3rd lab unit, i.e., all students that are still enrolled at the time of the 3rd lab unit will be graded.