CSE462/562: Database Systems (Spring 2025)

CSE 462/562: Database Systems (Spring 2025, Taco-DB Project)

Project 5: Join Algorithms

Released: Friday, 3/21/2025
Project due: Monday, 5/5/2025, 23:59:59 am EDT
Last updated: 1/26/2025

0. Getting Started

This project will be built upon your code for previous projects. If you need the solution code for them, plesae extract /ro-data/labs/lab<i>_sol.tar.xz (where <i> is the project number) into your repository root and run ./import_supplemental_files.sh. If you are importing more than the solution code for an earlier project, you may want to import them in the project number order as the latter one may be overwritting some files. If run into any errors, plesae read the paragraph above "a few hints" in bold font.

To get started with project 5, extract /ro-data/labs/lab5.tar.xz when your current working directory is your local repository root, and import the supplemental files as your did in previous projects.

The code should build without compilation errors once the supplemental files are imported, but most of the additional tests are likely to fail. You may list all the tests in your build directory using the ctest -N command.

A few useful hints:

Some parts of this project will heavily rely on your B+-Tree index and external sorting implementation. Even though this project will be relatively easy with fewer lines of code to write comparing with previous ones, it still takes some time and you have to finish the basic tests before you may be able to attempt the next (bonus) project. Please, START EARLY.
The description below is meant to be a brief overview of the tasks. Always refer to the special document blocks in the header and source files for a more detailed specification of how the functions should behave.
You may add any definition/declaration to the classes or functions you are implementing, but do not change the prototype of any public function. In addition, DO NOT edit those source files not listed in source files to modify below, as they will be replaced during tests.
You may ignore any function/class that’s related to multi-threading and concurrency control since we are building a single-threaded DBMS this semester. In other words, your implementation doesn’t have to be thread-safe.
We provide a list of tasks throughout this document, which is a suggestion of the order of implementation such that your code may pass a few more tests after you complete a task. It, however, is not the only possible order, and you may find an alternative order that makes more sense to you. It is also possible that your code still could pass certain tests, which is supposed to because of some unexpected dependencies on later tasks, in which case you might want to refer to the test code implementation in the tests/ directory.

1. Project Overview

In this project, you will work on three additional QP operators in TacoDB: cartesian product (which can be used in conjuction with selection to perform block nested loop join), (sort) merge join and index nested-loop join. They are structured in the same way as the basic QP operators in project 4. Besides, we will provide you a large database generated with the TPC-H benchmark as well as a few complex SQL queries. Your task is to manually plan, optimize, and execute them in TacoDB, using any composition of the available QP operators you have implemented so far.

After successfully completing this project, you will be able to run equi-join and band-join much more efficiently in Taco-DB. The basic tests will test whether your implementation of the two join operators work on some small data. We will stress test your implementation using the larger-scale database, if you'd like to complete the next (bonus) project.

About grading: the total grade for project 5 basic tests is 12.7 points. There are the additional 12 points in the next (bonus) project, which will be added directly to your total score.

1.1 Cartesian Product

Source files to modify:

include/plan/CartesianProduct.h
src/plan/CartesianProduct.cpp
include/execution/CartesianProductState.h
src/execution/CartesianProductState.cpp

The cartesian production operator concatenates all possible pairs of tuples from two input operators, which is the simplest binary relational operator and one of the basic relational algebra operators. In this task, you should implement the text-book version of the cartesian product under the standard SQL semantics (bag semantics). With this operator, you should be able to form inner join query plans by placing a selection operator on top of a cartesian production operator.

Task 1: Implement CartesianProduct and CartesianProductState for Cartesian products.

Your code is likely to pass the tests BasicTestCartesianProduct.*, and BasicTestCompositionPlan.TestJoinQuery.

1.2 (Sort) Merge Join

Source files to modify:

include/plan/MergeJoin.h
include/execution/MergeJoinState.h
src/plan/MergeJoin.cpp
src/execution/MergeJoinState.cpp
src/execution/*.cpp

Sort merge join is an efficient algorithm for equi-joins or band-joins as we discussed in the class. In TacoDB, we would like to implement an equi-join verison of it, and you may assume its input execution states produce records in your desired order (i.e., the child plan produces the results in the sorted order in terms of the join column(s)). This design allows us to decouple the sort operator and the merge join operator, such that the sort operator is only added to the plan when needed. Other than implementing the plan and execution state for merge join, you also have to implement the rewind(saved_position) and save_position() for all other operators in project 4 and 5 (including merge join itself) to ensure everything works.

Task 2: Implementing save_position() and rewind(saved_position) functions for all previous physical query plan states. Only actions like TableInsert and TableDelete do not need to support it.

Generally, you can save essential information in a Datum. You can create new Datum through different Datum::From() overloads (which can encapsulate all basic types with a size less than 8 bytes). If you need to store more information than a single Datum can hold, you can use DataArray utilities. You might get some hint on how to use it by looking at the example in CartesianProducteState::save_position(). The crux of this problem is to extract all essential information of iterator position so that later on it can be reconstructed when rewind(saved_position) is called.

Task 3: Implement merge join plans and execution logic.

Similar to what you have done with project 5, you need to implement everything in MergeJoin physical plan and MergeJoinState execution state. The physical plan part is pretty much the same as CartesianProduct with a few more states to remember. In particular, other than its children, you have to provide the expressions on which both sides will be joined upon and the comparison functions for these join expressions (hint: you can use lambda expressions in C++ to create a closure, i.e., an anonymous function that captures the variables in scope). The execution state should assume both sides of the join are sorted based on their join expressions before this operator is applied, meaning no additional sorting should be done within MergeJoin. To reiterate over a group of items with the same key in the inner relation, you will need to use save_position() and rewind(saved_position) on the inner plan.

Since the result of a merge join can serve as an input of another one, you have to make sure to implement save_position and rewind(save_position) for MergeJoinState as well.

1.3 Index Nested Loop Join

Source files to modify:

include/plan/IndexNestedLoop.h
include/execution/IndexNestedLoopState.h
src/plan/IndexNestedLoop.cpp
src/execution/IndexNestedLoopState.cpp

Indexes, specifically B+-Trees in TacoDB, can be leveraged to perform certain types of joins. We will implement the index nested loop join for equi-joins in this project. As in the System-R style query plan, the inner relation of an index nested loop is always a table with an index built on top of a list of fields (provided as an index descriptor).

Task 4: Implement index nested loop join plans and execution logic.

The general idea of the index nested loop join is to use each output record of the outer plan to probe the provided index for the matching records from the inner table.

Note: You may find the index nested loop interface is for band join. However, we only test equi-joins in basic tests. If you would like to attempt the next (bonus) project, you may need to implement regular band join as well.

The current interface works as the following: the first k-1 inner fields are compared to the corresponding outer expressions for equality, while the final (k-th) field is checked against a range defined by two outer expressions. Thus, to form an equi-join through this interface, you need to set equal k-th outer (lower) expression and upper expression, with lower_isstrict and upper_isstrict both set to false. Please refer to hints provided in include/plan/IndexNestedLoop.h for rationale.

Another tricky thing to keep in mind is related to save_position and rewind(saved_position) functions for IndexNestedLoopState. You still need to implement them since the index nested loop operator may serve as an input of a merge join. Note that we do not have an interface for the index to start a scan at a given (key, recid) for now. So you have to specially handle the rewinding to a specific data entry in the index nested loop operator. You can start the scan with a saved search key and do a short linear scan to find the data entry with the heap record id that it is supposed to be rewinded to.

At this point, you should also be able to pass one additional system test over the TPC-H data: SystemTestTPCHQX.