Release date: 2026-03-20 00:00:00-04
Final Due date: 2026-05-04 23:59:59-04
Final Submission enabled until: 2026-05-06 23:59:59-04
Last updated: 3/23/2026
Project 3 will be a half-semester open project to build an log-structured merge (LSM) tree based on the original 1996 O'Neil paper. The suggested order for implementation is to first develop an external merge sort implementation; then B-tree bulk loading on sorted data; and finally the LSM tree itself. In this project, you need to account for concurrency control but it is ok to simply protect part of the trees using mutually exclusive locks as described in the O'Neil paper. Alternatively, you may perform any optimization or make tweaks to the original design as you see fit. You should consider implementing a simple I/O buffer and make changes to the B-tree code to make use of the I/O buffers.
Your implementation should contain APIs for a set of
minimal functionalities of each component. For external sorting, you should
provide an interface that takes in an unsorted collection of key-value pairs
(as iterator, file, or other means that do not require full materialization in
memory) and produce a sorted collection of key-value pairs (again, as iterator,
or file that do not require full materialization in memory). For B-tree bulk
loading, you should provide an interface that takes a sorted collection of
key-value pairs (as iterator or files), and builds the internal levels such
that it is a valid B+-tree in a single file. In addition, your interface must
take a specific fill factor f such that all but a logrithmic
number of nodes are $f$-filled (meaning at least $f$ portion of the total node
space is occupied in each node). For the final LSM-tree project, you should
support insertion, deletion and range lookups with a set of similar interfaces
to those in the BTree class.
In this project, we will not provide test cases for you. Instead, you need to write test cases to validate your design, and the grading will be based on live demo and code reviews.
To get started, please import /ro-data/labs/lab3503.tar.xz, which
will disable linking against the pre-built binary solution. From this point on,
your codebase will only build and link against your own source code. You may
make any changes to the codebase as needed.
In addition to the final deadline, we will have two checkpoints:
You need to make submissions by the checkpoint and final deadlines in order to have your code reviewed. You will also be required to demonstrate your code in the recitation after each deadline. The grade will be based on the correctness and efficiency of your implementation, and the completeness of your test cases.
(Grading Rubrics) For each checkpoint, we will grade your last submission based on the following criteria: