DB2 - WS 2015/16 - Assignment 6 (External merge-sort) ===================================================== 1) Download. Download the file 'assignment6.zip' and unpack it. Rename nnnnnnn_6.py to matrikel_6.py, such that if your matrikel number is 1234567 the file is named 1234567_6.py In subdirectory 'test' we provided some test cases and the output files to evaluate your solutions. There are 20 input files of increasing size. For the first half of them, we provide the output files. You may test your code by calling seq 1 10 | shuf --random-source <(echo 12345123541) | ./nnnnnnn_6_solution.py 3 2 where 'seq 1 10' generates a sequence of numbers from 1 to 10, 'shuf --random-source <(echo 12345123541)' shuffles the sequence randomly with a seed '12345123541', the parameters '3' and '2' say that 3 blocks fit in the buffer and two tuples fit per block. IMPORTANT: Please stick to the naming convention because your assignments are fetched, processed, and verified automatically. If the file name is incorrect we may miss your assignment and you may not pass. 2) Assignment description. The goal of this assignment of this assignment is to implement external merge-sort algorithm. We only imitate the algorithm in main memory. The program takes as input: - a list of tuples (non-negative integer numbers), - size of buffer (in blocks), - number of tuples that fit in a block. The output is a sequence of the following operations corresponding to consecutive steps of the algorithm: - run N, - bl X, - bw Y, - ta Z. run N: N is the algorithm's run (starting with 0). This operation is issued at the beginning of each run of the algorithm. bl X: block load - issued when a block is loaded to buffer. X are the tuples of a block, for example, 'bl 1,2' means that the block with tuples '1' and '2' has been loaded to the buffer. bw Y: block write - issued when a block is output from the buffer. Y are the tuples of a block, for example, 'bw 1,2' means that the block with tuples '1' and '2' has been output from the buffer. This operation is issued immediately after the output block gets full. ta Z: tuple add - issued when a tuple is added to the output block. Z is a tuple, for example, 'ta 1' means that the tuple '1' has been added to the output block in the buffer. This operation cannot be issued in the first run of the algorithm when only sorting is performed. Example. We sort the sequence of consecutive numbers from 10 to 1. 3 blocks fit in the buffer, and 2 tuples fit per block. If more than one tuple fit in a block, the input numbers are split into blocks in order, that is, the sequence of tuples 10,9,8,7,6,5,4,3,2,1 is split into five blocks [10,9], [8,7], [6,5], [4,3], [2,1]. In the first run (run 0) only sorting is performed, and in the second run (run 1) groups of consecutive sorted blocks are merged. Command: seq 10 1 | python nnnnnnn_6_solution.py 3 2 Output: run 0 bl 10,9 bl 8,7 bl 6,5 bw 5,6 bw 7,8 bw 9,10 bl 4,3 bl 2,1 bw 1,2 bw 3,4 run 1 bl 5,6 bl 1,2 ta 1 ta 2 bw 1,2 bl 3,4 ta 3 ta 4 bw 3,4 ta 5 ta 6 bw 5,6 bl 7,8 ta 7 ta 8 bw 7,8 bl 9,10 ta 9 ta 10 bw 9,10 IMPORTANT: Changing the format of the output will cause test failures. There are three cases of this assignment, each corresponding to a single point which you can obtain. Case 3 is optional and gives you a bonus point. Case 1: the buffer has a fixed size of 3 blocks and 1 tuple fits per block. Case 2: we allow the buffer to have sizes greater or equal 3 and 1 tuple fits per block. Case 3: we allow the buffer to have sizes greater or equal 3 and more than 1 tuple can fit per block. 3) Task: Fill in missing function definitions. You are given a source code where input processing is already implemented. Your task is to implement the method ext_sort(blocks, buffer_size, tuples_per_block) where 'blocks' is a lost of blocks to sort, 'buffer_size' is the size of buffer in blocks, and 'tuples_per_block' is the number of tuples that fit in a block. You are provided the 'Block' class which represents a single block and holds a list of tuples. This class provides you also some methods to produce the output. However, you have to decide when the output methods are called. IMPLEMENTATION DETAILS: - For the first run of the algorithm use sort method from python. Load blocks to buffer. Load their tuples into a temporal list. Sort the list. Split the tuples into blocks. Write the blocks. - In merging steps, merge the groups of tuples top-down, as in the exercise we solved in class. The order matters for the output. - In a merge step, if there is a single block left, its processing should be postponed to the next run. This block should cause no output. We recommend to start with the implementation of Case 1 and then extend it to other cases. 4) Submission. Use our dbabgabe system (https://abgaben.cosy.sbg.ac.at/) to submit your solutions. Submissions by email are not supported any more. In case of incorrect output, the system will provide you with a diff for the first failed test, similarly to the B+-tree assignment. 5) Python. To install python on your personal computer, follow the guidelines on https://www.python.org/ To edit python code, use any text editor. We recommend vim, emacs, sublime text. For our purposes, using any sophisticated IDE is simply overwhelming. Though, we don't discourage to do that. The source code, that we provide for each assignment, should be self-explained and no python-specific knowledge is required to understand it (assuming experience with any other programming language). If you need more help, a great source of information is the python website: - documentation https://docs.python.org/2/ - tutorial https://docs.python.org/2/tutorial/index.html - beginners guide https://wiki.python.org/moin/BeginnersGuide If you do not have python installed (or cannot install) on your personal computer, the machines in the computer lab room and the 'sshstud' machine have it installed.