## Similarity Search in Large Databases

## News

**Lab (PS)**: The lab classes will take place**Wed 12:00-13:00**, room HS I - Christian Doppler. The kick-off meeting will be on**Oct 11**(no lab on Oct 4).**Lecture (VO)**: The lectures will take place**Wed 13:15-15:45**, room HS I - Christian Doppler.

## Curriculum Notes

The winter term 2023/24 is the last one in which the course "Similarity Search in Large Databases" will be held as a lecture (2 VO) with an associated lab (1 PS).

Starting next year (2023/24), the course will be held as an exercise with lecture (3 UV - "Übung mit Vorlesung"): attendance will be mandatory and there will be no exam dates after the end of the course. If you only need one of the two, either the lecture or the lab, please make sure you attend this semester.

The course held this semester (2 VO + 1 PS) can be recognized as 3 UV for those already enrolled in the new MSc Computer Science curriculum (2023).

## Lecture (VO)

### Questions and discussions

For questions and discussions (also among students) regarding course specific topics please use the Slack channel #ssdb-vo-ps (Workspace dbteaching.slack.com).

*Slack registration:* Students register with their university email here: https://dbteaching.slack.com/signup

### Schedule

Schedule of the lecture according to PlusOnline. Deviations will be communicated explicitly in the Slack channel #ssdb-vo-ps and/or the course website. The exam dates will be announced in PlusOnline.### Slides

Each set of slides treats a specific topic area and will be discussed in one or more lecture units. Slides that have not yet been discussed during the lecture may be subject to change. Once a slide set has been discussed in class, only bug fixes will be applied. Slide sets have a version (date) on the title page.

The slides and their discussion during the lecture are essential for the exam perparation.

*Note:* The slide version of last year is already
online to give you an overview, but this version may be subject to
change.

Topics | Slides | ||
---|---|---|---|

0. | Course Organisation and Demo |
[1up] [4up] | |

1. | Introduction to Similarity Search |
[1up] [4up] | |

2. | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants |
[1up] [4up] | |

3. | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments |
[1up] [4up] | |

4. | Trees: Tree Definition |
[1up] [4up] | |

5. | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas, Dynamic Programming Algorithm, Complexity |
[1up] [4up] | |

6. | Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound |
[1up] [4up] | |

7. | Token-Based Tree Distances: Tree Tokens, Binary Branches, pq-Grams |
[1up] [4up] | |

8. | Set Similarity Join: Signatures for Overlap and Hamming Distance, Implementation |
[1up] [4up] | |

### Literature

The following book treats selected lecture topics:

N. Augsten, M. H. Böhlen. Similarity joins in relational database systems.

Synthesis lectures on data management. Morgan Claypool Publishers, 2013.

The book is available online from our university library.

### Exam

The exam will be written. In the case of very few registrations for an exam date, the exam may be held as an oral exam; in this case the students will be notified before the exam.

The exam dates will be announced on PlusOnline.

At the written exam you are allowed to use one A4 sheet with your personal notes (both sides, hand written or printed). During the oral exam no notes are allowed.

The exam may be held online. The kind of exercises, notes allowed during the exam, and the exam duration are not affected by whether the written exam is held online.

You may register or cancel the registration up to 48 hours before the exam. If you do not show up for a registered exam, you will be blocked for this exam according to the university regulations §15(7).

**Previous exams: **
30.01.2023, 17.03.2023

## Lab (PS)

You will solve exercises at home and present your solution in class. The grading depends on the number of homework exercises that you solve and the quality of your presentions in class.

### Homework

You will solve worksheets at home during the semester. The worksheets must be solved by the due date indicated on the worksheet.

By the due date of the worksheet, you also **tick the exercises that you solved** in Blackboard. Ticking an exercises means that

- you have solved the exercise,
- you understand the solution,
- you will be able to present the solution,
- you will be able to answer questions regarding the solution.

A score (number of points) is assigned to each homework exercise. The score depends on the complexity and difficulty of the exercise. For a single worksheet, at most 10 points can be achieved.

The homework worksheets will be available at least 6 days before the due date.

### Presentations in class

In class, the lecturer will pick a student for each exercise of the homework and ask the student to present the solution. The lecturer (and the fellow students) will ask questions about the solution. The quality of the presentation will be rated A, B, or C. Quality criteria for the presentation are

- the technical correctness of the solution,
- the clearness of the presentation,
- the correct use of terminology, and
- the ability to answer in-depth questions.

Quality ratings of the presentation:

**A**: good presentation and in-depth understanding of the solutions**B**: solution not correct or presentation issues, but sufficient understanding of the solution**C**: insufficient understanding of the solution

### Grading

You get points for each of the homework worksheets. The overall score is the sum over the points for all worksheets.

Each worksheet contributes as follows to the overall score:

- If your presentation was rated A
*or*you were not asked to present any of the solutions, then all ticked excercises of the worksheet count towards the overall score. - If your presentation was rated B, you will not receive the points for the exercise you presented, but all other ticked excercises of the worksheet count towards the overall score.
- If your presentation was rated C:
- For the first C-rated presentation, you will get zero points for the worksheet (i.e., none of the excercises of the worksheet counts).
- For further C-rated presentations, the points of all ticked exercises of the worksheet will be subtracted from the overall score.

- If your absent from class, you will not get points for the respective worksheet. If you are not in class while a specific exercise is being presented (coming late or leaving early), the points for that exercise will not count.

The **maximum overall score** is **80**. The final grade is computed from the overall score as follows:

Percent | Score | Grade |
---|---|---|

81-100 | 65-80 | 1 |

71-80 | 57-64 | 2 |

61-70 | 49-56 | 3 |

51-60 | 41-48 | 4 |

0-50 | 0-40 | 5 |

**Attendance and unenrolment:** The students must attend at least 75% of the lab lectures to achieve a positive grade. Unenrolements are possible only until *before* the 3rd lab unit, i.e., all students that are still enrolled at the time of the 3rd lab unit will be graded.