Release date: Thursday, 9/19/2024
Project due: Wednesday, 10/9/2024, 23:59:59 PM EDT
Last updated: 9/24/2024
This project will be built upon your code for previous projects. If you need
the solution code for projects 1, and/or 2 (will be available on 9/18/2024),
plesae extract /ro-data/labs/lab<i>_sol.tar.xz
(where
<i>
is the project number) into your repository root and run
./import_supplemental_files.sh
. If you are importing more than
the solution code for an earlier project, you may want to import them in the
project number order as the latter one may be overwritting some files. If run into any
errors, plesae read the paragraph above "a few hints" in bold font.
To get started with project 3, extract
/ro-data/labs/lab3.tar.xz
within your
local repository root in your dev container. Then import the
supplemental files using
./import_supplemental_files.sh
. The code should
build without compilation errors once the supplemental files
are imported, but most of the additional tests are likely to
fail. In addition, existing test cases may also fail due to the
introduction of the additional system components, which should
pass once you correctly implement the components in project 3.
You may list all the tests in your build directory using the
ctest -N
command, view them in the Run and
Debug
window in VSCode.
Please also run ./import_data.sh
in the repository root to import
data if you're working on the student server. If you're not working on the
student server, please copy the data tarballs into the data
subdirectory (which may not exist yet) under the repository root, and extract
all of them. Once you successfully import the data, you should find the
following directories under data/
:
(IMPORTANT!) Before committing, make sure the
followling lines are added to .gitignore
in
your repository to avoid committing these large files to
Github.
/data/preloaded_btree
/data/tpch_s01
/data/tpch_s01_ans
/data/tpch_s1
/data/tpch_s1_ans
Note: your repository
must be free of any uncommitted or untracked files before
running the import script. If you run into an error during
import, please remove the lab2
directory and import_supplemental_files.sh
script; commit your repo; and retry.
A few hints:
VarlenDataPage
with the reference implementation that
came with Project 2 solution.
tests/
directory.
In this project, you will work on the basic operators for query processing in Taco-DB. The goal of this project is to implement the physical plans and execution states for the rest of basic operators for query processing. In addition to the query oprerators, there are also two additional operators for inserting and deleting records from the tables. Note that they must maintain all the indexes over the table if any record is inserted or deleted.
To reduce your workload, we provide the implementation of a few operators for your reference. You might find it useful to read them before starting on the tasks. The operators that you DO NOT need to implement are listed below:
The following is a list of the query operators that you need to implement.
At the end of this project, you should be able to programmatically construct query plans and execute them. Technically, you will have an (almost) fully functioning single-threaded database (without a SQL parser and an optional query optimizer) after this project. The main theme of this project is going to be quite different from the previous projects. It will focus more on the common engineering patterns and the memory management. After finishing the first a few operators, you will find all the others follow the same design pattern with subtle differences.
There are basic test cases for each of these operators
(including the ones you do not need to implement), which are
not meant to be comprehensive coverage test of your
implementation. There will be test cases with a slightly more
complex query plan
(BasicTestCompositePlan.TestAggregationQuery
) that
contains these basic operators to further test the the
correctness of the implementation.
Source files to read:
include/utils/tree/TreeNode.h
include/expr/ExprNode.h
include/expr/optypes.h
include/expr/*.h
and
src/expr/*.cpp
Source files to modify:
include/expr/Variable.h
src/expr/Variable.cpp
include/expr/BinaryOperator.h
src/expr/BinaryOperator.cpp
One of the core structures we use in query processing is the
trees. Its basic interfaces and utilities are defined in
TreeNode
class. All expressions, physical plans,
and execution states inherit from it. To give a full example of
how to utilize this infrastructure, we provide a reference
implementation of most of the expression trees. The
following description assumes that you understand the basic
concepts of expression trees from any compiler or programming
language course, but you may also refer to any compiler
textbook to understand what an expression tree is.
Expressions trees represent expressions in the SQL language
(such as boolean expressions student.sid =
enrollment.sid
, aggregation expressions
AVG(student.adm_year - 1900)
, arithmetic expressions
DATE '1995-01-01' + interval '5' day
, and etc.), which
can be evaluated over a record from a certain schema. Here, different from regular
programming languages where there are global and local variables,
database expressions can refer to a column (in Taco-DB, denoted as Variable
),
or a plain literal (e..g, an integer 1900
). Based on that,
we can construct compound expressions with operators and function calls.
More specifically, each operator in the expression tree is
represented as an subclass that inherits ExprNode
.
To ensure memory safety (i.e., no memory leakage, no double
free, no access after freeing), we always wrap it around as a
unique pointer. We use an interpretation-based strategy to
evaluate the expressions trees. We first call the
ExprNode::Eval()
function at the expression tree
root with a record from its outer query plan (e.g., the outer
plan of an expression in the target list of a projection
operator is the projection operator). Then if this is a
non-leaf expression node, it will recursively call the
ExprNode::Eval()
functions on its child nodes
(denoting the subexpressions), and apply its own operation over
the subexpression values to produce its evaluation result. A
leaf node of an expression tree is always either a
Literal
(which evaluates to a constant) or a
Variable
(which evaluates to a field value in the
record from the outer plan). We provide two overloads of the
ExprNode::Eval
function on serialized records
(const char*
) and deserialized records
(std::vector<NullableDatumRef>
). The
function always returns a Datum as return value with full
ownership (i.e., the Datum itself manage the memory for any
pointers of its data). To ensure that, you can see there are
DeepCopy()
calls in a few Eval()
implementations. This ensures we make a Datum with full memory
ownership, which will always self-manage its own memory for
memory safety.
Memory management is likely the most challenging part of this project.
You may also find the concepts of rvalue references,
move semantics, and smart pointers of
C++ very useful. You may find many useful information of C++ from CppReference, and in particular, reference,
std::move(), std::unique_ptr, std::shared_ptr.
Hint: you might want to read these references when you're
going through the provided implementation and spot something
that you don't understand. Try to reason about when memory
allocation/deallocation/shallow object copying/deep object
copying happens when we use these constructs, and ask questions
on Piazza if you are unable to understand the provided code.
Task 1: Implement Variable
and
BinaryOperator
.
Your code will likely pass all the expression node tests in BasicTestExpression
at this point.
Source files to read:
include/plan/PlanNode.h
include/execution/PlanExecState.h
include/plan/X.h
,
include/execution/XState.h
,
src/plan/X.cpp
,
src/execution/XState.cpp
,
where X
is one of the following:
TableScan
Projection
Limit
TempTable
TableInsert
TableDelete
Source files to modify:
include/plan/X.h
,
include/execution/XState.h
,
src/plan/X.cpp
,
src/execution/XState.cpp
,
where X
is one of the following:
This part of the work contains most of the single-table query operators
and table update actions available in Taco-DB. There will be a few
additional operators not covered here, which we will work
on in Projects 4 and 5.
For each
operator/action, there will be a physical plan (node) and its
corresponding execution state (node). The physical plan of an
operator/action stores all the essential information during
query plan construction
(e.g., its child plan, selection predicate, etc.), while
the plan execution state stores run-time information for evaluating
the physical plan over the current database instance. A new execution
state is created for every execution of a physical plan.
As you can see in the
abstract interface defined in PlanNode
and
PlanExecState
, a physical plan encodes the
information about the query itself and always contains an output
schema, while an execution state is
essentially an iterator in the volcano model and stores
the internal state of the iterator.
Once again, the biggest challenge in implementing these plans and execution states is to ensure memory safety. In particular, you need to make sure: (1) if a piece of memory can still be accessed, it must be valid for access (i.e., not deallocated); (2) when you finish using a piece of memory, it should be freed by its owner; (3) the total memory usage of the system should be bounded (we will limit the memory you can use in the tests and exceeding memory usage may get the test process terminated).
As a hint, when you are implementing these classes, you need to take some special care on the memory management of these objects:
Deserialized Records: Deserialized
records are represented by
std::vector<NullableDatumRef>
. You have
to make sure all the fields (all instances of
NullableDatumRef
) are pointing to a valid
Datum
kept alive by its owner (usually that means
it is stored in a local Datum variable or Datum container that is still in scope or
a member Datum variable or Datum conatiner of some execution state object). You can ensure this by caching the
current record pointed by an execution state internally
through a std::vector<Datum>
when
necessary. Note that in many cases, you don’t have to cache
these fields if you know the
NullableDatumRef
refers to some Datum that is owned
by its child execution state and will be kept alive when the
reference is used.
Derived Schema: You may need a new
schema in some physical operators. In these
cases, the plan node should create a new schema and own
the object by store it as an
std::unique_ptr<Schema>
. You should
always call Schema::ComputeLayout()
on a newly
created schema, because it can only serialize
and deserialize records after that call. Do not use
Schema::CollectTypeInfo()
because the current
design of the physical plan and execution state interfaces
require any output schema to be able to serialize and
deserialize records. Similarly, you don’t always need to
create a new schema for a plan if you can directly borrow
it from its child.
You are given full freedom on how to implement the internals of these
classes, including class member, debugging utilities, memory management
strategy, etc, as long as you do not change the abstract interfaces of
PlanNode
for physical plans and PlanExecState
for execution states.
Note: please read the inlined specification above the definition of these classes in the source code for more hints for implementation.
Also note that: you don't have to implement
the save_position()
function and the one-parameter overload of
rewind(DatumRef)
function right now.
They will only be needed in project 5. But you do need to implement
the zero-parameter overload of the rewind()
.
We recommend the following order for this project:
Task 1.2.1: Read TempTable
and
TempTableState
for in-memory temporary tables.
Task 1.2.2: Read TableScan
and
TableScanState
for heap file based table scan.
Task 1.2.3: Implement Selection
and SelectionState
for selection.
Task 1.2.4: Read Projection
,
ProjectionState
for projection.
Task 1.2.5: Implement Aggregation
and AggregationState
for aggregation without
group-by (we will not implement the group-by aggregation in
this semester).
Task 1.2.6: Read Limit
and
LimitState
for limiting output to the first N
rows.
Task 1.2.7: Read TableInsert
,
TableDelete
, TableInsertState
, and
TableDeleteState
for table insert and delete actions.
Task 1.2.8: Before proceeding with the
implemention of Sort
and SortState
for sorting, please read and complete the Tasks 1.3.1 and 1.3.2
in Section 1.3.
A side note for sort operator: please use the value of
absl::GetFlag(FLAGS_sort_nways)
in
src/execution/SortState.cpp
as the default
number of sorted runs to merge at a time for external sorting.
After finishing each of the implementation tasks above, it is very likely that you will be able to pass the corresponding tests for that operator.
After you pass all the individual operator/action tests, a final group of tests will verify if they will work together seamlessly. This ensures that memory safety is satisfied across the plan/execution state tree. Specifically, they are:
BasicTestCompositePlan.TestAggregationQuery
Finally, there will be a few hidden system tests running query plans over the TPC-H data. They will not be tested on Autograder. Rather, we will run the tests on your latest submission every day and post the results on an anonymous score board.
Source file to READ:
include/extsort/ItemIterator.h
Source files to modify:
include/extsort/ExternalSort.h
src/extsort/ExternalSort.cpp
src/storage/VarlenDataPage.cpp
Hint: you will also need to implement an additional
function VarlenDataPage::InsertRecordAt()
to pass
the associated tests BasicTestSortedDataPage
,
and to continue with the rest of this subsection.
External sorting in Taco-DB is implemented as a separate
component
since it can be used in various scenarios. Specifically,
it sorts all items (raw byte arrays) provided by an
ItemIterator
. Then it returns a unique pointer of
ItemIterator
that can be used to iterate through
all the items from input in the sort order. There are two
arguments given to configure the behavior of external sorting.
comp
provides the comparator between two general
items, while merge_ways
provides the number of
ways to merge at a time, which also means the memory budget for
the buffers used in the merge algorithm is (merge_ways +
1) * PAGE_SIZE
.
Task 1.3.1: Implement the external merge-sort.
You should read the two file ExternalSort.h
and ExternalSort.cpp
. To reduce your workload,
you only need to implement the following functions:
ExternalSort::Sort()
ExternalSort::MergeInternalRuns()
There are two stages in external sorting: (1) packing initial
runs, sorting them, and writing them to a temporary file; (2) recursively
read sorted runs from the last pass, and use a multi-way
merge algorithm to generate a new pass. Specifically, you should
limit the total amount of memory you allocated for
input/output buffers to (N + 1) * PAGE_SIZE
(N
input buffers and 1 output buffer for each merge).
You only need two temporary files in the whole procedure: one for
writing out the newly generated pass; the other for remembering the last pass as an
input. You also need to remember the boundaries of sorted runs in the
input pass so that you know where to start in merging. Every time when a
new pass is generated and the old pass from which it is generated is not
needed, you can simply continue the next round by flipping the input and
output file. (Hint: you can use
m_current_pass
and 1 - m_current_pass
to do
this).
In the implementation of external sorting, you should
completely bypass the buffer manager and allocate
your own n_merge_ways + 1
pages of buffer
internally. The ExternalSort
object is responsible
for performing all the read/write in the temporary files directly
through the File
interface. This is because the
buffer manager only manages data access to main files. Besides,
you want to use VarlenDataPage
to help you lay
records out in sorted runs (so you don’t have to reinvent a new
page layout). During the merge, you will need to use a priority
queue. You can directly leverage the implementation provided by
the C++ Standard Template Library. Its detailed documentation
can be found here.
(Hint: STL priority queue put
items with higher priority at the front by default.)
Note: Not every sorted run is full and don’t forget the incomplete run at the end.
Task 1.3.2: Implement the output iterator for external sorting. Specifically, you need to implement the following functions:
ExternalSort::OutputIterator::OutputIterator()
ExternalSort::OutputIterator::~OutputIterator()
ExternalSort::OutputIterator::Next()
ExternalSort::OutputIterator::GetCurrentItem()
ExternalSort::OutputIterator::SavePosition()
ExternalSort::OutputIterator::Rewind()
ExternalSort::OutputIterator::EndScan()
This iterator implementation will be very similar to
Table::Iterator
and Index::Iterator
you have
already implemented. The only extra thing you want to pay attention to
is that this iterator should support rewinding to any previously saved position,
which will be useful for implementing Sort::rewind(DatumRef saved_position)
in the next project. Note that the position information should be encoded in an
uint64_t
, and these integers do not have to be continuous
even when two rewind locations are right next to each other.
After finishing Tasks 12 and 13, you should be able to pass the following tests and continue with Task 11:
BasicTestExternalSort.TestSortUniformInt64Asc
BasicTestExternalSort.TestSortUniformInt64Desc
BasicTestExternalSort.TestSortUniformStringAsc
BasicTestExternalSort.TestSortUniformStringDesc
How to run large tests locally: if you run the large
tests (BasicTestExternalSort
) with the
debug build, they will likely to time out because no compiler
optimization is enable in debug build. To enable compiler
optimization, you'll need to create a new build
directory as follows (assuming the new folder is called
build.Release
):
# in repository root
cd <dir-to-local-repository>
cmake -Bbuild.Release -DCMAKE_BUILD_TYPE=Release .
cd build.Release
make -j 8
If you're using VSCode, then you can select
a build type by either clicking on this button
at the bottom,
or opening the command palette and search for CMake: Select Variant
.
After that, perform a rebuild of the project.
How to run large tests locally:
We provide a few scripts for running external sorting basic tests with
the default timeout and memory limits set to a higher number. To run the
tests with the default, run ctest -R test_name
, or use
the VSCode run configuration,
or:
cd build.Release
./tests/extsort/BasicTestExternalSort.TestXXX.sh # replace XXX with the test names
How to specify memory limits and/or timeouts: While you won't be able to change the memory limits or timeouts we set in the offline tests, but you may need to change these values for your local environment to pass the tests locally. If you need to specify a different memory limit and/or timeouts locally during debugging, you must do so by running it through command line:
cd build.release
# set the memory limit to 50000KB and timeout to 100 seconds in the test
MEMLIMIT=50000 TIMEOUT=100 ./tests/extsort/BasicTestExternalSort.TestXXX.sh # replace XXX with the test names
You may find the memory limit and timeout settings from
test_lab3.conf.sh
, or use the following command
to print it to screen:
( . /ro-data/testconfs/test_lab3.conf.sh && echo "$TEST_SPEC" ) | jq .