Assigned: Tuesday, 4/26/2022
Project due: Sunday, 5/22/2022, 11:59 pm EST
Write-up due:Sunday, 5/22/2022, 11:59 pm EST
Last updated: 5/16/2022
Updates (5/16/22): We will run the offline tests twice a day, at 1:00 am and 1:00 pm until the project deadline. We may not be able to further increase the test frequence due to the amount of time needed to run the tests over all the submissions. Please consider submitting your code early.
Updates (5/4/22): The score board is now online but the daily offline testing will not start until 5/7. We will run the tests 1x per day and possibly increase the frequency to 2x per day in the week before the deadline.
Note: the reference implementation for the queries in the bonus project is by no means the best possible physical plan. You might want to be creative for finding a better plan than the reference implementation for them.
Note 2: you may want to adjust the memory limit and time limit when you test the bonus project locally because of the differences in the runtime environment.
Note 3: we pass --buffer_pool_size=65536
--sort_nways=4096
to the bonus tests. You should
pass those arguments to the test binaries when running them
locally in order to evaluate your plan with the same
configuration as the offline tests.
This project will be built upon your code fo all previous projects3. We provide a tarball with some new starter code which you need to import into your project 4 repository. Note that you do not have to wait until finishing project 4 to import the code as they are fairly independent.
git status
is cleanlab5.tar.xz
and untar it inside your repo. You
should now have a directory called lab5
inside your
repo./lab5/import_supplemental_files.sh
data/
directory. The code should build without compilation errors once the
supplemental files are imported, but most of the additional tests are
likely to fail. You may list all the tests in your build directory using
the ctest -N
command. There are 40 additional basic test cases
in total and some bonus tests.
A few useful hints:
tests/
directory.
In this project, you will work on two additional QP operators in TacoDB, namely (sort) merge join and index nested-loop join. They are structured in the same way as the basic QP operators in project 4. Besides, we will provide you a real-world database and a few queries. You need to manually plan, optimize, and execute them in TacoDB.
After successfully completing this project, you will be able to run equi-join and band-join much more efficiently in Taco-DB. The basic tests will test whether your implementation of the two join operators work on small data. We will stress test your implementation using a larger-scale database, if you'd like to complete the bonus project.
About grading: the total grade for project 5 basic tests is 9 points + 2 points in the final write-up. We will assign 6 points to each of the join operators, and any excess points you get beyond 9 points will be considered as a bonus directly added to your final grade. Hence, the highest score you can get for project 5 basic tests is 12 + 2 points. We will also assign 6 points for the bonus project, with 2 points for each query (see details below).
About testing: As before, the basic tests are run on Autograder. You may submit your branch/tag to project 5 page on Autolab as usual to get grading result on demand. The bonus projects will be tested offline daily, and we will use the latest submission to project 5 by [cut-off time TDB] for the offline tests. We will post the test results on a scoreboard similar to project 4 large tests (upcoming...). Note that, you will only get the bonus points if your query finishes and produces the correct results within the predefined time and memory limits.
include/plan/MergeJoin.h
include/execution/MergeJoinState.h
src/plan/MergeJoin.cpp
src/execution/MergeJoinState.cpp
src/execution/*.cpp
Sort merge join is an efficient algorithm for equi-joins or band-joins as we
discussed in the class. In TacoDB, you have to implement a equi-join verison of
it, and you can assume its input execution states produce records in your
desired order. This design allows us to decouple the sort operator and the merge join
operator, such that the sort operator is only added to the plan when needed. Other
than implementing the plan and execution state for merge join, you also have to
implement the
rewind(saved_position)
and save_position()
for
all other operators in project 4 and 5 (including merge join itself) to ensure everything works.
Task 1: Implementing save_position()
and rewind(saved_position)
functions for all previous
physical query plan states. Only actions like TableInsert
and TableDelete
do not need to support it.
Generally, you can save essential information in a Datum. You can
create new Datum through different Datum::From()
overrides
(which can encapsulate all basic types with a size less than 8 bytes).
If you need to store more information than a single Datum can chold, you can
use DataArray
utilities (make sure you have imported
lab4_extsort_test_update.tar.xz at this point, which contains an important bug
fix for DataArray). You might get some hint on
how to use it by looking at the example in
CartesianProducteState::save_position()
. The crux of this
problem is to extract all essential information of iterator position so
that later on it can be reconstructed when
rewind(saved_position)
is called.
Task 2: Implement merge join plans and execution logic.
Similar to what you have done with project 4, you need to implement
everything in
MergeJoin
physical plan and
MergeJoinState
execution state. The physical plan part is
pretty much the same as CartesianProduct
with a few more
states to remember. In particular, other than its children, you have to
provide the expressions on which both sides will be joined upon and the
comparison functions for these join expressions (hint: you can use lambda
expressions in C++ to create a closure, i.e., an anonymous function that
captures the variables in scope). The execution
state assumes both sides of the join are sorted based on their join
expressions. To reiterate over a group of items with the same key in
the inner relation, you will need to use save_position()
and
rewind(saved_position)
on the inner plan.
Since the result of a merge join can serve as an input of another
one, you have to make sure to implement save_position
and
rewind(save_position)
for MergeJoinState
as
well.
include/plan/IndexNestedLoop.h
include/execution/IndexNestedLoopState.h
src/plan/IndexNestedLoop.cpp
src/execution/IndexNestedLoopState.cpp
Indexes, specifically B+-Trees in TacoDB, can be leveraged to perform certain types of joins. We will implement the index nested loop join for equi-joins in this project. As in the System-R style query plan, the inner relation of an index nested loop is always a table with an index built on top of a list of fields (provided as an index descriptor).
Task 3: Implement index nested loop join plans and execution logic.
The general idea of the index nested loop join is to use each output record of the outer plan to probe the provided index for the matching records from the inner table.
Note: You may find the index nested loop interface is for band join. However, we only test equi-joins in basic tests. If you would like to attemp the bonus Task 4, you may (or may not) need to implement regular band join as well.
The current interface works as the following: the first
k-1
inner fields are compared to the corresponding outer expressions
for equality, while the final (k
-th) field is checked
against a range defined by two outer expressions. Thus, to form an
equi-join through this interface, you need to set equal k
-th
outer (lower) expression and upper expression, with
lower_isstrict
and upper_isstrict
both set to
false
. Please refer to hints provided in
include/plan/IndexNestedLoop.h
for rationale.
Another tricky thing to keep in mind is related to
save_position
and rewind(saved_position)
functions for IndexNestedLoopState
. You still need to
implement them since the index nested loop operator may serve as an input of
a merge join. Note that we do not have an interface for the index to
start a scan at a given (key, recid)
for now. So you have to
specially handle the rewinding to a specific data entry in the index nested
loop operator. You can start the scan with a saved search key and do a short linear
scan
to find the data entry with the heap record id that it is supposed to be
rewinded to.
tests/execution/BonusTestTPCHQ3Plan.cpp
tests/execution/BonusTestTPCHQ5Plan.cpp
tests/execution/BonusTestTPCHQSPlan.cpp
Finally, to give you a hint on how the entire query processing pipeline
(planning, optimization, and execution) works, we will manually plan
and optimize three queries for TPC-H (scale factor 1), which is a common
benchmark used to measure DBMS query performance. You can
find the table schemas and built indexes in the database by inspecting
TPCHTest::CreateAndLoadTPCHTables()
defined in
tests/execution/TPCHTest.cpp
.
You can find how to construct expressions and physical query plan by
reading unit tests in tests/execution/*.cpp
. You can also
find baseline plans in the handout in
tests/execution/BonusTestTPCHQ?Plan.cpp
, where ?
is one of
3
, 5
, S
. Note that these are not the
standard benchmark queries as we do not support the full SQL standard in Taco-DB.
Task 4 (Bonus): Replace the baseline plan with your own optmized query plan so that it can run faster.
Here are the three queries in the bonus project (also available in the plan files):
-- TPC-H Q3 (based on benchmark Q3)
SELECT o_orderkey, o_orderdate, o_totalprice, o_shippriority
FROM customer, orders
WHERE c_mktsegment = '[SEGMENT]'
AND c_custkey = o_custkey
AND o_orderdate < date '[DATE]'
AND o_orderstatus <> 'F'
ORDER BY o_shippriority DESC, o_totalprice DESC, o_orderdate ASC
LIMIT 100;
-- [SEGMENT] is one of 'AUTOMOBILE', 'BUILDING', 'FURNITURE', 'HOUSEHOLD', 'MACHINERY'
-- [DATE] is some date between '1992-01-01' and '1998-12-01'
-- (Note this range is larger than what's allowed in the original TPC-H Q3.)
-- TPC-H Q5 (based on benchmark Q5)
SELECT SUM(l_extendedprice * (1 - l_discount)), COUNT(*)
FROM customer, orders, lineitem, supplier, nation, region
WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
AND r_name = '[REGION]'
and o_orderdate >= date '[DATE]'
AND o_orderdate < date '[DATE]' + interval '365' day;
-- [REGION] is one of 'AFRICA', 'AMERICA', 'ASIA', 'EUROPE', 'MIDDLE EAST'
-- [DATE] is some date between '1992-01-01' and '1998-12-01'
-- (Note this range is larger than what's allowed in the original TPC-H Q5.)
-- TPC-H QS (a non-benchmark query)
SELECT SUM(l1.l_extendedprice * (1 - l1.l_discount) -
* (1 - l2.l_discount)), COUNT(*)
l2.l_extendedprice FROM lineitem l1, orders o1, lineitem l2, orders o2
WHERE l1.l_orderkey = o1.o_orderkey
AND l2_l_orderkey = o2.o_orderkey
AND l1.l_receiptdate > l2.l_shipdate
AND l1.l_receiptdate < l2.l_shipdate + interval [INTERVAL] day
AND o1.o_custkey = o2.o_custkey
and l1.l_returnflag = 'R'
and l2.l_returnflag <> 'R'
-- [INTERVAL] is some integer in [1, 90].
Note that the baseline plan we provide to you would not even be able
to run over TPC-H scale factor 1 in a reasonable amount of time. So you have
to do at least some optimization to get bonus points. There is a smaller subset
of the TPC-H data of scale factor 1, named as tpch_s01
. They
are configured as the default test database for these bonus tests on your local machine.
You can use them for debugging and test runs. Once you finish testing, you may
run a query on TPC-H over the larger data of scale factor 1 with the following instruction:
cd data
&& tar xf tpch_s01.tar.xz && tar xf
tpch_s01_ans.tar.xz
, optionally, tar
xf tpch_s1.tar.xz && tar xf
tpch_s1_ans.tar.xz
..gitignore
file:If not, add them to
/data/tpch_s1.tar.xz
/data/tpch_s1
/data/tpch_s01.tar.xz
/data/tpch_s01
.gitignore
so that you will not accidentally
push these large files to your repository.cmake -Bbuild.release -DCMAKE_BUILD_TYPE=Release
.
cd build.release && make
cd
build.release
./tests/RunTest.sh
./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096
--gtest_filter='*MACHINERY_19930310'
data/tpch_s01
directory for TPC-H database, and
data/tpch_s01_ans
for the reference query result. The default
timeout is 1000 seconds, and the default memory limit (for data segment
only) is 64 MB.TIMEOUT
and/or MEMLIMIT
(env) variables. The
following is an example where we set the timeout to be 5 seconds and the
memory limit to 500000KB:TIMEOUT=5 MEMLIMIT=500000
./tests/RunTest.sh ./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096
--gtest_filter='*MACHINERY_19930310'
data/tpch_s1
(scale factor 1) with its reference
query result in data/tpch_s1_ans
, use the
--test_db_path
and --test_ans_path
arguments:TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh
./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1
--test_ans_path=../data/tpch_s1_ans
--gtest_filter='*MACHINERY_19930310'
--test_res_prefix
parameter to provide a prefix. Note the prefix is prepended to the file
name and it may contain an existing directory in the path. For instance, to
dump the query result in qres
directory in
build.release
, enter the following (must keep the
slash after qres
!):mkdir -p qres
&& TIMEOUT=5 MEMLIMIT=30000 ./tests/RunTest.sh
./tests/execution/BonusTestTPCHQ3 --buffer_pool_size=65536 --sort_nways=4096 --test_db_path=../data/tpch_s1
--test_ans_path=../data/tpch_s1_ans --test_res_prefix=qres/
--gtest_filter='*MACHINERY_19930310'
qres/TPCHQ3_MACHINERY_19930310_2022-04-18T10-33-45EDT.csv
.--gtest_list_tests
parameter to the test binary, e.g.,./tests/execution/BonusTestTPCHQ3 --gtest_list_tests
./tests/execution/BonusTestTPCHQ5 --gtest_list_tests
./tests/execution/BonusTestTPCHQS --gtest_list_tests
Each individual student needs to submit an independent write-up to UBLearns for the project design, division of implementation responsibility (no more than 1 page in letter size for design and coding responsibility). Please find the due date of the write-up at the beginning of this web page and submit a PDF document to UBLearns by the deadline. Please find a more detailed description of the requirements and a sample for the write-up on UBLearns.