CSE462/562: Database Systems (Spring 2022)

CSE 462/562: Database Systems (Spring 2022)

Project 6: Join Algorithms

Project 6 Bonus: Writing optimized query plans

Released: Tuesday, 4/18/2023
Project due: Sunday, 5/21/2023, 01:00 am EDT
Write-up due: Sunday, 5/14/2023, 01:00 am EDT
Last updated: 3/30/2023

[Project 6 Nightly Testing Results]

Update (5/17/23): Bonus tests are run four times a day at 1:11 am, 7:11 am, 1:11 pm and 7:11 pm until the project deadline.

Update (5/15/23): Bonus tests are run twice a day at 1:11 am and 1:11 pm until the project deadline.

Note: the reference implementation for the queries in the bonus project is by no means the best possible physical plan. You might want to be creative for finding a better plan than the reference implementation for them.

Note 2: you may want to adjust the memory limit and time limit when you test the bonus project locally because of the differences in the runtime environment.

Note 3: we pass --buffer_pool_size=65536 --sort_nways=4096 to the bonus tests. You should pass those arguments to the test binaries when running them locally in order to evaluate your plan with the same configuration as the offline tests.

0. Getting Started

This project will be built upon your code fo all previous projects. If you need the solution code for projects 1, 2, 3, 4, and/or 5, please download it from UBBox (link released on Piazza, under Homework Solutions, on 4/27). To get started with Project 6, extract /local/Spring_2023/cse562/lab6.tar.xz on the student servers, and import the code as you did in previous projects. If you have not imported the TPC-H data in Project 5 setup, please run ./import_data.sh in your repository root (for those who are using turing or cerf), or download the TPC-H data using this UBBox link.

The code should build without compilation errors once the supplemental files are imported, but most of the additional tests are likely to fail. You may list all the tests in your build directory using the ctest -N command. There are 40 additional basic test cases in total and some bonus tests.

A few useful hints:

Some parts of this project will heavily rely on your B+-Tree index and external sorting implementation. Even though this project will be relatively easy with fewer lines of code to write comparing with previous ones, it still takes some time and you have to finish the basic tests before you may be able to attempt the bonus projects. Please, START EARLY.
The description below is meant to be a brief overview of the tasks. Always refer to the special document blocks in the header and source files for a more detailed specification of how the functions should behave.
You may add any definition/declaration to the classes or functions you are implementing, but do not change the prototype of any public function. In addition, DO NOT edit those source files not listed in source files to modify below, as they will be replaced during tests.
You may ignore any function/class that’s related to multi-threading and concurrency control since we are building a single-threaded DBMS this semester. In other words, your implementation doesn’t have to be thread-safe.
We provide a list of tasks throughout this document, which is a suggestion of the order of implementation such that your code may pass a few more tests after you complete a task. It, however, is not the only possible order, and you may find an alternative order that makes more sense to you. It is also possible that your code still could pass certain tests, which is supposed to because of some unexpected dependencies on later tasks, in which case you might want to refer to the test code implementation in the tests/ directory.

1. Project Overview

In this project, you will work on two additional QP operators in TacoDB: (sort) merge join and index nested-loop join. They are structured in the same way as the basic QP operators in project 5. Besides, we will provide you a large database generated with the TPC-H benchmark as well as a few complex SQL queries. Your task is to manually plan, optimize, and execute them in TacoDB, using all the QP operators you have implemented so far.

After successfully completing this project, you will be able to run equi-join and band-join much more efficiently in Taco-DB. The basic tests will test whether your implementation of the two join operators work on some small data. We will stress test your implementation using the larger-scale database, if you'd like to complete the bonus project.

About grading: the total grade for project 6 basic tests is 8 points + 3 points in the final write-up. There are the additional 6.6 points in the bonus projects, which will be added directly to your total score.

About testing: As before, the basic tests are run on Autograder. You may submit your branch/tag to project 6 page on Autolab as usual to get grading result on demand. The bonus projects will be tested offline nightly. We will post the test results on a scoreboard similar to project 5 large tests. Note that, you will only get the bonus points if your query finishes and produces the correct results within the predefined time and memory limits.

About write-up: The write-up is due on May 14, at 1:00 am EDT, much earlier compared to previous projects! The write-up questions will be more open-ended and will be used to provide feedbacks to your prior to the final exam.

1.1 (Sort) Merge Join

Source files to modify:

include/plan/MergeJoin.h
include/execution/MergeJoinState.h
src/plan/MergeJoin.cpp
src/execution/MergeJoinState.cpp
src/execution/*.cpp

Sort merge join is an efficient algorithm for equi-joins or band-joins as we discussed in the class. In TacoDB, we would like to implement an equi-join verison of it, and you may assume its input execution states produce records in your desired order (i.e., the child plan produces the results in the sorted order in terms of the join column(s)). This design allows us to decouple the sort operator and the merge join operator, such that the sort operator is only added to the plan when needed. Other than implementing the plan and execution state for merge join, you also have to implement the rewind(saved_position) and save_position() for all other operators in project 5 and 6 (including merge join itself) to ensure everything works.

Task 1: Implementing save_position() and rewind(saved_position) functions for all previous physical query plan states. Only actions like TableInsert and TableDelete do not need to support it.

Generally, you can save essential information in a Datum. You can create new Datum through different Datum::From() overrides (which can encapsulate all basic types with a size less than 8 bytes). If you need to store more information than a single Datum can chold, you can use DataArray utilities. You might get some hint on how to use it by looking at the example in CartesianProducteState::save_position(). The crux of this problem is to extract all essential information of iterator position so that later on it can be reconstructed when rewind(saved_position) is called.

Task 2: Implement merge join plans and execution logic.

Similar to what you have done with project 5, you need to implement everything in MergeJoin physical plan and MergeJoinState execution state. The physical plan part is pretty much the same as CartesianProduct with a few more states to remember. In particular, other than its children, you have to provide the expressions on which both sides will be joined upon and the comparison functions for these join expressions (hint: you can use lambda expressions in C++ to create a closure, i.e., an anonymous function that captures the variables in scope). The execution state should assume both sides of the join are sorted based on their join expressions. To reiterate over a group of items with the same key in the inner relation, you will need to use save_position() and rewind(saved_position) on the inner plan.

Since the result of a merge join can serve as an input of another one, you have to make sure to implement save_position and rewind(save_position) for MergeJoinState as well.

1.2 Index Nested Loop Join

Source files to modify:

include/plan/IndexNestedLoop.h
include/execution/IndexNestedLoopState.h
src/plan/IndexNestedLoop.cpp
src/execution/IndexNestedLoopState.cpp

Indexes, specifically B+-Trees in TacoDB, can be leveraged to perform certain types of joins. We will implement the index nested loop join for equi-joins in this project. As in the System-R style query plan, the inner relation of an index nested loop is always a table with an index built on top of a list of fields (provided as an index descriptor).

Task 3: Implement index nested loop join plans and execution logic.

The general idea of the index nested loop join is to use each output record of the outer plan to probe the provided index for the matching records from the inner table.

Note: You may find the index nested loop interface is for band join. However, we only test equi-joins in basic tests. If you would like to attempt the bonus project, you may need to implement regular band join as well.

The current interface works as the following: the first k-1 inner fields are compared to the corresponding outer expressions for equality, while the final (k-th) field is checked against a range defined by two outer expressions. Thus, to form an equi-join through this interface, you need to set equal k-th outer (lower) expression and upper expression, with lower_isstrict and upper_isstrict both set to false. Please refer to hints provided in include/plan/IndexNestedLoop.h for rationale.

Another tricky thing to keep in mind is related to save_position and rewind(saved_position) functions for IndexNestedLoopState. You still need to implement them since the index nested loop operator may serve as an input of a merge join. Note that we do not have an interface for the index to start a scan at a given (key, recid) for now. So you have to specially handle the rewinding to a specific data entry in the index nested loop operator. You can start the scan with a saved search key and do a short linear scan to find the data entry with the heap record id that it is supposed to be rewinded to.

1.3 Manual Query Planning (Bonus Project, 6 pts)

Source files to modify:

tests/execution/BonusTestTPCHQ3Plan.cpp
tests/execution/BonusTestTPCHQ5Plan.cpp
tests/execution/BonusTestTPCHQSPlan.cpp

Finally, to give you a hint on how the entire query processing pipeline (planning, optimization, and execution) works, we will manually plan and optimize three queries for TPC-H, which is a common benchmark used to measure DBMS query performance. You can find the table schemas and built indexes in the database by inspecting TPCHTest::CreateAndLoadTPCHTables() defined in tests/execution/TPCHTest.cpp.

You can find how to construct expressions and physical query plan by reading unit tests in tests/execution/*.cpp. You can also find baseline plans in the handout in tests/execution/BonusTestTPCHQ?Plan.cpp, where ? is one of 3, 5, S. Note that these are not the standard benchmark queries as we do not support the full SQL syntax in Taco-DB.

Task 4 (Bonus): Replace the baseline plan with your own optmized query plan so that it can run faster.

Here are the three queries in the bonus project (also available in the plan files):

    -- TPC-H Q3 (based on benchmark Q3)
    SELECT o_orderkey, o_orderdate, o_totalprice, o_shippriority
    FROM customer, orders
    WHERE c_mktsegment = '[SEGMENT]'
    AND c_custkey = o_custkey
    AND o_orderdate < date '[DATE]'
    AND o_orderstatus <> 'F'
    ORDER BY o_shippriority DESC, o_totalprice DESC, o_orderdate ASC
    LIMIT 100;
    -- [SEGMENT] is one of 'AUTOMOBILE', 'BUILDING', 'FURNITURE', 'HOUSEHOLD', 'MACHINERY'    
    -- [DATE] is some date between '1992-01-01' and '1998-12-01'
    -- (Note this range is larger than what's allowed in the original TPC-H Q3.) 

    -- TPC-H Q5 (based on benchmark Q5) 
    SELECT SUM(l_extendedprice * (1 - l_discount)), COUNT(*)
    FROM customer, orders, lineitem, supplier, nation, region
    WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
    AND r_name = '[REGION]'
    and o_orderdate >= date '[DATE]'
    AND o_orderdate < date '[DATE]' + interval '365' day;
    -- [REGION] is one of 'AFRICA', 'AMERICA', 'ASIA', 'EUROPE', 'MIDDLE EAST'    
    -- [DATE] is some date between '1992-01-01' and '1998-12-01'
    -- (Note this range is larger than what's allowed in the original TPC-H Q5.) 

    -- TPC-H QS (a non-benchmark query)
    SELECT SUM(l1.l_extendedprice * (1 - l1.l_discount) -
               l2.l_extendedprice * (1 - l2.l_discount)), COUNT(*)
    FROM lineitem l1, orders o1, lineitem l2, orders o2
    WHERE l1.l_orderkey = o1.o_orderkey
    AND l2_l_orderkey = o2.o_orderkey
    AND l1.l_receiptdate > l2.l_shipdate
    AND l1.l_receiptdate < l2.l_shipdate + interval [INTERVAL] day
    AND o1.o_custkey = o2.o_custkey
    and l1.l_returnflag = 'R'
    and l2.l_returnflag <> 'R'
    -- [INTERVAL] is some integer in [1, 90].

Note that the baseline plan, depsite being correct semantically, will not even run over TPC-H scale factor 1 in a reasonable amount of time. So you have to at least apply some optimization to get bonus points. There is a smaller subset of the TPC-H data of scale factor 1, named as tpch_s01. They are configured as the default test database for these bonus tests on your local machine. You can use them for debugging and test runs. Once you finish testing, you may run a query on TPC-H over the larger data of scale factor 1 with the following instruction:

How to run bonus tests locally:

Extract the preloaded TPC-H database by running ./import_code.sh, or do so manually by running cd data && tar xf tpch_s01.tar.xz && tar xf tpch_s01_ans.tar.xz, optionally, tar xf tpch_s1.tar.xz && tar xf tpch_s1_ans.tar.xz.
Make sure the following lines exist in your .gitignore file:
```
 
    /data/tpch_s1.tar.xz
    /data/tpch_s1
    /data/tpch_s01.tar.xz
    /data/tpch_s01
```
If not, add them to .gitignore so that you will not accidentally push these large files to your repository.
Compile your code in release mode using VSCode by selecting the Release build variant, or do so with the command line as follows:
a. cmake -Bbuild.Release -DCMAKE_BUILD_TYPE=Release .
b. cd build.Release && make
To run a test, say TPCH Q3 with parameters (‘MACHINERY’, ‘1993-03-01’) with buffer pool size of 64MB and using 4096-way merges in external sorting
a. cd build.Release
b. ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --gtest_filter='*MACHINERY_19930310'
By default, the test uses the data/tpch_s01 directory for TPC-H database, and data/tpch_s01_ans for the reference query result. The default timeout is 1000 seconds, and the default memory limit (for data segment only) is 64 MB.
c. To modify the time and/or memory limits, define the TIMEOUT and/or MEMLIMIT (env) variables. The following is an example where we set the timeout to be 5 seconds and the memory limit to 500000KB:
TIMEOUT=5 MEMLIMIT=500000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --gtest_filter='*MACHINERY_19930310'
d. To use a different database, say data/tpch_s1 (scale factor 1) with its reference query result in data/tpch_s1_ans, use the --test_db_path and --test_ans_path arguments:
TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1 --test_ans_path=../data/tpch_s1_ans --gtest_filter='*MACHINERY_19930310'
e. To retain the query results for debugging in a file, use the --test_res_prefix parameter to provide a prefix to the result path. Note the prefix is prepended to the file name and it may contain an existing directory in the path. For instance, to dump the query result in qres directory in build.Release, enter the following (make sure to enter the slash after qres!):
mkdir -p qres && TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1 --test_ans_path=../data/tpch_s1_ans --test_res_prefix=qres/ --gtest_filter='*MACHINERY_19930310'
This will create a file similar to qres/TPCHQ3_MACHINERY_19930310_2022-04-18T10-33-45EDT.csv.
f. To list which tests are available, append the --gtest_list_tests parameter to the test binary, e.g.,
./tests/execution/BonusTestTPCHQ3 --gtest_list_tests
./tests/execution/BonusTestTPCHQ5 --gtest_list_tests
./tests/execution/BonusTestTPCHQS --gtest_list_tests

2. Write-up requirement

Each individual student needs to complete a write-up independently and submit it to UBLearns with the answers to the following question. Please find the due date of the write-up at the beginning of this web page and submit a PDF document to UBLearns by the deadline. Please do not submit any other formats such as .docx, .doc, .txt -- we will not grade any non-PDF submission.

Question 1: assuming we use a simplified cost model that only counts the numbers I/Os in a query plan. What is the cost of the baseline query plan we provided for Q3 on the TPC-H SF 1 data, if the parameters are ('BUILDING', '19950310')? You may ignore the buffer effect and assume we use 4096-way merge. You should write your own test programs to extract the relevant statistics from the preloaded database (up to 20% of difference in your final answer is allowed).

Question 2: Write down your optimized physical plan for Q3 as a plan tree and analyze its cost. If you have not started working on the bonus project yet at this point, this is a good time to take a preview at it and try to sketch a rough solution. Any query plan with a cost of at most 10% of the cost of the baseline plan is acceptable.

Question 3: Suppose you were to implement a cost-based query optimizer in this system in a similar fashion to the System-R query optimizer. Which access paths are available for the Selection relational algebra operator in this system? How would you estimate their costs (in particular, selectivity) based on a particular query? You may answer this question using Q3 as a concrete example.

Question 4: Can the join reordering algorithm we discussed in the lecture reduce the cost of Q3 in this particular system? If so, by how much? If not, please elaborate the reason.

Question 5: Taco-DB is a single-threaded database with no transaction support so far. Imagine we were to support concurrent transaction processing in this system with a standard 2-PL based concurrency control protocol. For that, we modify the table access method (in its iterator) to acquire appropriate shared or exclusive tuple locks before accessing or modifying tuples. The locks are dropped at the end of each transaction when the user makes a call to a new function Database::CommitTransaction() (or Database::AbortTransaction). Does this system always admit serializable schedules? Briefly explain why or why not.

Question 6: Suppose we also want to support crash recovery in this system and we implement the standard ARIES algorithm. More specifically, we record the following information for each update: LSN, prevLSN, XID, type, pid, offset, length, before-image, after-image. In addition, we also record transaction begin, transaction end, transaction abort, and Compensation Log Record (CLR) as we discussed in the lecture. Assuming we always enforce the contraints required by Write-Ahead Logging protocol, can this system always recover from a crash? Briefly explain why or why not.