CSE462/562: Database Systems (Spring 2025)

CSE 462/562: Database Systems (Spring 2025, Taco-DB Project)

Bonus Project 6: Writing optimized query plans

Released: Tuesday, 3/25/2025
Project due: Monday, 5/13/2025, 23:59:59 am EDT
Last updated: 1/26/2025

Note: the reference implementation for the queries in the bonus project is, by no means, the best possible physical plan. You might want to be creative for finding a better plan than the reference implementation for them.

Note 2: we pass --buffer_pool_size=65536 --sort_nways=4096 to the bonus tests. You should pass those arguments to the test binaries when running them locally in order to evaluate your plan with the same configuration as the offline tests.

Note 3: make sure to create a release build (optimized build) before testing the bonus project solution.

0. Getting Started

This project will be built upon your code for previous projects. If you need the solution code for them, plesae extract /ro-data/labs/lab<i>_sol.tar.xz (where <i> is the project number) into your repository root and run ./import_supplemental_files.sh. If you are importing more than the solution code for an earlier project, you may want to import them in the project number order as the latter one may be overwritting some files. If run into any errors, plesae read the paragraph above "a few hints" in bold font.

To get started with project 6, extract /ro-data/labs/lab6.tar.xz and import the supplemental files.

The code should build without compilation errors once the supplemental files are imported, but all the bonus tests are likely to timeout. You may list all the tests in your build directory using the ctest -N command.

1. Manual Query Planning (Bonus Project, 12 pts)

Source files to modify:

tests/execution/BonusTestTPCHQ3Plan.cpp
tests/execution/BonusTestTPCHQ5Plan.cpp
tests/execution/BonusTestTPCHQSPlan.cpp

Finally, to provide a hands-on experience on how the entire query processing pipeline (planning, optimization, and execution) works in DBMS, we will manually plan and optimize three queries for TPC-H, which is a common benchmark used to measure DBMS query performance. You can find the table schemas and built indexes in the database by inspecting TPCHTest::CreateAndLoadTPCHTables() defined in tests/execution/TPCHTest.cpp.

You can find how to construct expressions and physical query plan by reading unit tests in tests/execution/*.cpp. You can also find baseline plans in the handout in tests/execution/BonusTestTPCHQ?Plan.cpp, where ? is one of 3, 5, S. Note that these are not the standard benchmark queries as we do not support the full SQL syntax in Taco-DB.

Task 5 (Bonus): Replace the baseline plan with your own optmized query plan so that it can run faster.

Here are the three queries in the bonus project (also available in the plan files):

    -- TPC-H Q3 (based on benchmark Q3)
    SELECT o_orderkey, o_orderdate, o_totalprice, o_shippriority
    FROM customer, orders
    WHERE c_mktsegment = '[SEGMENT]'
    AND c_custkey = o_custkey
    AND o_orderdate < date '[DATE]'
    AND o_orderstatus <> 'F'
    ORDER BY o_shippriority DESC, o_totalprice DESC, o_orderdate ASC
    LIMIT 100;
    -- [SEGMENT] is one of 'AUTOMOBILE', 'BUILDING', 'FURNITURE', 'HOUSEHOLD', 'MACHINERY'    
    -- [DATE] is some date between '1992-01-01' and '1998-12-01'
    -- (Note this range is larger than what's allowed in the original TPC-H Q3.) 

    -- TPC-H Q5 (based on benchmark Q5) 
    SELECT SUM(l_extendedprice * (1 - l_discount)), COUNT(*)
    FROM customer, orders, lineitem, supplier, nation, region
    WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
    AND r_name = '[REGION]'
    and o_orderdate >= date '[DATE]'
    AND o_orderdate < date '[DATE]' + interval '365' day;
    -- [REGION] is one of 'AFRICA', 'AMERICA', 'ASIA', 'EUROPE', 'MIDDLE EAST'    
    -- [DATE] is some date between '1992-01-01' and '1998-12-01'
    -- (Note this range is larger than what's allowed in the original TPC-H Q5.) 

    -- TPC-H QS (a non-benchmark query)
    SELECT SUM(l1.l_extendedprice * (1 - l1.l_discount) -
               l2.l_extendedprice * (1 - l2.l_discount)), COUNT(*)
    FROM lineitem l1, orders o1, lineitem l2, orders o2
    WHERE l1.l_orderkey = o1.o_orderkey
    AND l2_l_orderkey = o2.o_orderkey
    AND l1.l_receiptdate > l2.l_shipdate
    AND l1.l_receiptdate < l2.l_shipdate + interval [INTERVAL] day
    AND o1.o_custkey = o2.o_custkey
    and l1.l_returnflag = 'R'
    and l2.l_returnflag <> 'R'
    -- [INTERVAL] is some integer in [1, 90].

Note that the baseline plan, depsite being correct semantically, will not finish on TPC-H scale factor 1 in a reasonable amount of time. So you have to at least apply some query optimization we have discussed in the lectures to get bonus points. There is a smaller subset of the TPC-H data, named as tpch_s01. They are configured as the default test database for these bonus tests on your local machine. You can use them for debugging and test runs. Once you finish testing, you may run a query on TPC-H over the larger data of scale factor 1 with the following instruction:

How to run bonus tests in your dev container:

Skip this step if you have completed data import using ./import_data.sh.
Make sure the following lines exist in your .gitignore file:
```
 
    /data/tpch_s1.tar.xz
    /data/tpch_s1
    /data/tpch_s01.tar.xz
    /data/tpch_s01
```
If not, add them to .gitignore so that you will not accidentally push these large files to your repository.
Compile your code in release mode using VSCode by selecting the Release build variant, or do so with the command line as follows:
a. cmake -Bbuild.Release -DCMAKE_BUILD_TYPE=Release .
b. cd build.Release && make
To run a test, say TPCH Q3 with parameters (‘MACHINERY’, ‘1993-03-01’) with buffer pool size of 64MB and using 4096-way merges in external sorting
a. cd build.Release
b. ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --gtest_filter='*MACHINERY_19930310'
By default, the test uses the data/tpch_s01 directory for TPC-H database, and data/tpch_s01_ans for the reference query result. The default timeout is 1000 seconds, and the default memory limit (for data segment only) is 64 MB.
c. To modify the time and/or memory limits, define the TIMEOUT and/or MEMLIMIT (env) variables. The following is an example where we set the timeout to be 5 seconds and the memory limit to 500000KB:
TIMEOUT=5 MEMLIMIT=500000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --gtest_filter='*MACHINERY_19930310'
d. To use a different database, say data/tpch_s1 (scale factor 1) with its reference query result in data/tpch_s1_ans, use the --test_db_path and --test_ans_path arguments:
TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1 --test_ans_path=../data/tpch_s1_ans --gtest_filter='*MACHINERY_19930310'
e. To retain the query results for debugging in a file, use the --test_res_prefix parameter to provide a prefix to the result path. Note the prefix is prepended to the file name and it may contain an existing directory in the path. For instance, to dump the query result in qres directory in build.Release, enter the following (make sure to enter the slash after qres!):
mkdir -p qres && TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1 --test_ans_path=../data/tpch_s1_ans --test_res_prefix=qres/ --gtest_filter='*MACHINERY_19930310'
This will create a file similar to qres/TPCHQ3_MACHINERY_19930310_2022-04-18T10-33-45EDT.csv.
f. To list which tests are available, append the --gtest_list_tests parameter to the test binary, e.g.,
./tests/execution/BonusTestTPCHQ3 --gtest_list_tests
./tests/execution/BonusTestTPCHQ5 --gtest_list_tests
./tests/execution/BonusTestTPCHQS --gtest_list_tests