DB2 - WS 2015/16 - Assignment 5 (Estimating selection cost)
===========================================================

1) Download.

   Download the file 'assignment5.zip' and unpack it.

   Rename nnnnnnn_5.py to matrikel_5.py, such that if your matrikel number is 
   1234567 the file is named 1234567_5.py

   In subdirectory 'test' we provided some test cases for you to evaluate your 
   solutions. There are 8 json files with different parameters. For each json 
   file there are 7 output files with the results for a specific query and  
   attribute values listed in the file name. For example:
     - all files with prefix '0_' in their file names correspond to the 
       parameters in '0.json',
     - the value in file '0_6_208000_415000_out' is the minimum number of blocks
       needed to evaluate query Q6 for atribute values 208000 415000 and 
       parameters in '0.json' file.

   You may test your code by calling

   cat in.json | python nnnnnnn_5.py Q a [b]

   where Q is the query type for which you evaluate the cost, a and b are the 
   attribute values. Depending on the query type, one (Q1-Q3) or two attribute 
   values (Q4-Q7) are required.

   IMPORTANT: Please stick to the naming convention because your assignments 
   are fetched, processed, and verified automatically. If the file name is 
   incorrect we may miss your assignment and you may not pass.

2) Assignment description.

   In this assignment we implement a cost estimation for specific query types. 
   The goal is to estimate the number of blocks that have to be accessed in 
   order to evaluate each query type with the most efficient strategy. 

   We limit our scenario to one relation with two attributes, A and B. Both
   attributes get non-negative integer values with duplicates allowed. 

   As queries, we consider selections only of the following types.

   Query types:
     Q1 : A = a
     Q2 : A >= a
     Q3 : A <= a
     Q4 : A < a OR b < A
     Q5 : a < A AND A < b
     Q6 : A < a AND B > b
     Q7 : A = a AND B < b

   In order to support the cost estimation, several parameters are provided as 
   input. For any possible input, you should consider the best strategy. 
   However, we will not test corner cases, e.g., a relation fits on one block.
   The parameters are provided in a JSON file.

   Parameters:
     relsize    : size of the relation
     rangeA     : range of values of attribute A (uniform distribution)
     rangeB     : range of values of attribute B (uniform distribution)
     indexA     : index type on attribute A
     indexB     : index type on attribute B
     nodedegree : number of pointers in B+-tree
     blocksize  : block size in Bytes
     tuplesize  : tuple size in Bytes

   We assume that there can be an index on each attribute. If there is an index,
   it is a B+-tree. We limit the indices to the following types. 

   Index types:
     sparse : there is a sparse index on the attribute 
              (relation is physically sorted on that attribute)
     dense  : there is a dense index on the attribute
              (no sorting can be assumed)
     none   : there is no index on the attribute

   According to the allowed index types, the following index combinations on
   attributes A and B are possible. The only not allowed case is to have 
   a sparse index on both attributes.

   Index combinations:

                  | sparse on B | dense on B | none on B
     sparse on A  |      -      |      +     |     +
     dense on A   |      +      |      +     |     +
     none on A    |      +      |      +     |     +

   The program takes as input a JSON file with parameters, query type as 
   integer, and, depending on the query type, one or two attribute values (see 
   Query types section).

   The output of the program is the number of blocks required to evaluate the
   specified query type most efficiently for the provided parameters and 
   attribute values.

   TIP: The parameters influencing the chosen strategy most are: index type, 
   ranges of values, and the searched attribute values. In some cases using a 
   linear scan may be more profitable than using an index, especially when using 
   an index causes reading the same blocks multiple times.

3) Task: Fill in missing function definitions.

   You are given a source code with input and output operations already 
   implemented.

   For each query type, there is a separate method with missing logic. Your task
   is to implement logic for each query method, that is calculate the minimum 
   number of blocks needed to evaluate the query (using best strategy).

   TIP: In the evaluation, we will avoid the cases when the number of tuples
   returned by a *range* query is smaller than the number of relation blocks. 
   This means that, for example for a dense index, you don't have to verify if a 
   linear scan is more efficient than using the index.

   We recommend that whenever a part of code is used for multiple queries, you
   delegate it to a separate function. This will reduce the number of possible 
   errors.

   All programming details are explained directly in the source code.

   For this assignment you can obtain 2 points:
     - 1 point for implementing queries Q1, Q2, and Q3,
     - 1 point for implementing queries Q4, Q5, Q6, and Q7.

   While evaluating, due to possible rounding differences in the formulas, we 
   will accept solutions in which the number of blocks varies from our 
   implementation by 3 blocks.

   IMPORTANT: Please do not change the output format of the program. Modifying 
   the output may cause failing the tests, disregarding the correctness of the 
   solution.

4) Submission.

   Use our dbabgabe system to submit your solutions. Submissions by email are
   not supported any more.

5) Python.

   To install python on your personal computer, follow the guidelines on 
   https://www.python.org/

   To edit python code, use any text editor. We recommend vim, emacs, sublime 
   text. For our purposes, using any sophisticated IDE is simply overwhelming. 
   Though, we don't discourage to do that.

   The source code, that we provide for each assignment, should be 
   self-explained and no python-specific knowledge is required to understand it 
   (assuming experience with any other programming language). 

   If you need more help, a great source of information is the python website:
   - documentation https://docs.python.org/2/
   - tutorial https://docs.python.org/2/tutorial/index.html
   - beginners guide https://wiki.python.org/moin/BeginnersGuide

   If you do not have python installed (or cannot install) on your personal 
   computer, the machines in the computer lab room and the 'sshstud' machine 
   have it installed.