DB2 - WS 2015/16 - Assignment 5 (Estimating selection cost) =========================================================== 1) Download. Download the file 'assignment5.zip' and unpack it. Rename nnnnnnn_5.py to matrikel_5.py, such that if your matrikel number is 1234567 the file is named 1234567_5.py In subdirectory 'test' we provided some test cases for you to evaluate your solutions. There are 8 json files with different parameters. For each json file there are 7 output files with the results for a specific query and attribute values listed in the file name. For example: - all files with prefix '0_' in their file names correspond to the parameters in '0.json', - the value in file '0_6_208000_415000_out' is the minimum number of blocks needed to evaluate query Q6 for atribute values 208000 415000 and parameters in '0.json' file. You may test your code by calling cat in.json | python nnnnnnn_5.py Q a [b] where Q is the query type for which you evaluate the cost, a and b are the attribute values. Depending on the query type, one (Q1-Q3) or two attribute values (Q4-Q7) are required. IMPORTANT: Please stick to the naming convention because your assignments are fetched, processed, and verified automatically. If the file name is incorrect we may miss your assignment and you may not pass. 2) Assignment description. In this assignment we implement a cost estimation for specific query types. The goal is to estimate the number of blocks that have to be accessed in order to evaluate each query type with the most efficient strategy. We limit our scenario to one relation with two attributes, A and B. Both attributes get non-negative integer values with duplicates allowed. As queries, we consider selections only of the following types. Query types: Q1 : A = a Q2 : A >= a Q3 : A <= a Q4 : A < a OR b < A Q5 : a < A AND A < b Q6 : A < a AND B > b Q7 : A = a AND B < b In order to support the cost estimation, several parameters are provided as input. For any possible input, you should consider the best strategy. However, we will not test corner cases, e.g., a relation fits on one block. The parameters are provided in a JSON file. Parameters: relsize : size of the relation rangeA : range of values of attribute A (uniform distribution) rangeB : range of values of attribute B (uniform distribution) indexA : index type on attribute A indexB : index type on attribute B nodedegree : number of pointers in B+-tree blocksize : block size in Bytes tuplesize : tuple size in Bytes We assume that there can be an index on each attribute. If there is an index, it is a B+-tree. We limit the indices to the following types. Index types: sparse : there is a sparse index on the attribute (relation is physically sorted on that attribute) dense : there is a dense index on the attribute (no sorting can be assumed) none : there is no index on the attribute According to the allowed index types, the following index combinations on attributes A and B are possible. The only not allowed case is to have a sparse index on both attributes. Index combinations: | sparse on B | dense on B | none on B sparse on A | - | + | + dense on A | + | + | + none on A | + | + | + The program takes as input a JSON file with parameters, query type as integer, and, depending on the query type, one or two attribute values (see Query types section). The output of the program is the number of blocks required to evaluate the specified query type most efficiently for the provided parameters and attribute values. TIP: The parameters influencing the chosen strategy most are: index type, ranges of values, and the searched attribute values. In some cases using a linear scan may be more profitable than using an index, especially when using an index causes reading the same blocks multiple times. 3) Task: Fill in missing function definitions. You are given a source code with input and output operations already implemented. For each query type, there is a separate method with missing logic. Your task is to implement logic for each query method, that is calculate the minimum number of blocks needed to evaluate the query (using best strategy). TIP: In the evaluation, we will avoid the cases when the number of tuples returned by a *range* query is smaller than the number of relation blocks. This means that, for example for a dense index, you don't have to verify if a linear scan is more efficient than using the index. We recommend that whenever a part of code is used for multiple queries, you delegate it to a separate function. This will reduce the number of possible errors. All programming details are explained directly in the source code. For this assignment you can obtain 2 points: - 1 point for implementing queries Q1, Q2, and Q3, - 1 point for implementing queries Q4, Q5, Q6, and Q7. While evaluating, due to possible rounding differences in the formulas, we will accept solutions in which the number of blocks varies from our implementation by 3 blocks. IMPORTANT: Please do not change the output format of the program. Modifying the output may cause failing the tests, disregarding the correctness of the solution. 4) Submission. Use our dbabgabe system to submit your solutions. Submissions by email are not supported any more. 5) Python. To install python on your personal computer, follow the guidelines on https://www.python.org/ To edit python code, use any text editor. We recommend vim, emacs, sublime text. For our purposes, using any sophisticated IDE is simply overwhelming. Though, we don't discourage to do that. The source code, that we provide for each assignment, should be self-explained and no python-specific knowledge is required to understand it (assuming experience with any other programming language). If you need more help, a great source of information is the python website: - documentation https://docs.python.org/2/ - tutorial https://docs.python.org/2/tutorial/index.html - beginners guide https://wiki.python.org/moin/BeginnersGuide If you do not have python installed (or cannot install) on your personal computer, the machines in the computer lab room and the 'sshstud' machine have it installed.