CSE462/562: Database Systems (Spring 2022)

CSE 462/562: Database Systems (Spring 2022)

Project 3: Data Layout and Access Methods

Release date: Thursday, 2/16/2023
Project due: Tuesday, 2/28/2023, 01:00 am EST
Write-up due:Thursday, 3/2/2023, 01:00 am EST
Last updated: 2/16/2023

0. Getting Started

This project will be built upon your code for project 2. If you need the solution code for project 2, please download it from UBBox (link released on Piazza, under Homework Solutions). To get started, extract /local/Spring_2023/cse562/lab3.tar.xz on the student servers and import the supplemental files as your did in previous projects. The code should build without compilation errors once the supplemental files are imported, but most of the additional tests and BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed are likely to fail. You may list all the tests in your build directory using the ctest -N command. As before, you may attempt any number of code submissions at Autolab, project 3 (the same hourly submission rate limit applies as before). To do so, please tag or create a new branch of the commit you want to submit, and provide the tag or branch name in Autolab.

Grading policy: Your grade will be based on the testing result of your code, and an independently completed write-up (see write-up section for more details). There is a hidden test suite in project 3 that is only tested on Autolab and during the offline tests (read Section 1.2 for details). Once you submit your code on Autolab, it will run the tests on its VM and give you a list of scores for individual test cases. Similar to project 2, we will also run an offline test on your last submission after the project due and take the higher score for each test case.

A few useful hints:

We strongly encourage you to read through the entire document before you start writing code. You might also find it useful to review the lecture slides for the data storage layout and the Project 2 web page. The project requires you to write a fair amount of code, so you might want to start early.
The description below is meant to be a brief overview of the tasks. Always refer to the special document blocks in the headers and source files for a more detailed specification of how the functions should behave.
You may add any definitions/declarations to the classes or functions you're implementing, but do no change the signatures of the public fucntions. Also do not edit those source files not listed in the source files to modify below, as they will be replaced during tests.
You may ignore any function/class that's related to multi-threading and concurrency control since we are building a single-threaded DBMS and your implementation doesn't have to be thread safe.

1. Project overview

Figure 1: Taco-DB Storage Layer Overview

In this project, you will continue on the rest of the storage layer of Taco-DB (see Figure 1). More specifically, the goal is to implement the data page and the heap file. There are a few questions to guide you through the project You need to answer them in your project write-up.

1.1 Data Page

[Back to Project Overview]

Source files to modify:

include/storage/VarlenDataPage.h
src/storage/VarlenDataPage.cpp

How to run parameterized tests individually: The tests for the data page are parameterized (meaning each named test case is actually a suite of test cases instantiated from a template with different parameter values). You won't be able to run each individual test case if running them with ctest. Instead, ctest will run a test template for multiple rounds with all the parameter values. To run a test with a specific parameter, e.g., the page initialization test with min_reclen = 8, max_reclen = 8 and usr_data_sz = 0,

(for command line) run it directly using the test executable with command: ./tests/storage/BasicTestVarlenDataPage --gtest_filter='*TestInitialize/8_8_0'. If you are running the test case using gdb in command line, start it as usual: gdb ./tests/storage/BasicTestVarlenDataPage. Then enter r --gtest_filter='*TestInitialize/8_8_0' in GDB to start the program with a specific test case.
(for VSCode) open Run and Debug (Ctrl + Shift + d). Locate the test suite (e.g., */BasicTestVarlenDataPage.TestInitialize/*) from the drop down menu and click on the settings button right next to it. This will open .vscode/launch.json in the text editor and the cursor will be located at the entry for that specific test case. Update the first argument in the args field with the specific parameter that you want to run (e.g., */BasicTestVarlenDataPage.TestInitialize/8_8_0).

Brief overview of data pages: Taco-DB manages its data in the fixed-size pages (of PAGE_SIZE bytes), where we may store a number of records as well as some additional information needed by the database. Taco-DB supports variable-length data types (e.g., VARCHAR), as well as nullable table fields, and thus it needs to have the capability of storing variable-length records on the data pages using a slotted data page design. In this project, a record is simply an array of bytes of certain length. Note that a record must be aligned to some 8-byte address (you may use MAXALIGN() for that).

Figure 5: Data Page that supports variable-length records

The layout of a data page is shown in Figure 5. The page begins with a 24-byte header VarlenDataPageHeader, which includes the PageHeaderData at offset 0. Following that is an optional fixed-size user data area (you'll need it for the later B-tree project). Slots on the page are indexed by SlotId typed integers, from 1 upwards (with INVALID_SID=0). The slot array starts at the end of the page and grows backwards. Hence the SlotData of slot sid may be accessed as ((SlotData*)(pagebuf + PAGE_SIZE))[-(int) sid], if char *pagebuf points to the beginning of the page. The actual record payload for slot sid is stored somewhere between the end of the user data area and the begin offset of the free space. To insert a record, find a free existing slot or create a new slot in the slot array, and find some empty space to copy the record payload. To erase a record, set its slot to be invalid and possibly remove all the invalid slot at the end of the slot array if any. You may refer to the class and function documentation in the header file for a more detailed specification.

Before you get started with coding, please answer the following questions by reading the function specifications in include/storage/VarlenDataPage.h:

Question 1: Let's assume there is a pinned buffer frame pagebuf and we have invoked VarlenDataPage::InitializePage(pagebuf). What are the values of the following C++ expressions: (1) ((VarlenDataPageHeader *) pagebuf)->m_ph_sz; (2) ((VarlenDataPageHeader *) pagebuf)-> m_nslots ?

Question 2: Suppose we create an instance of the data page as VarlenDataPage dp1(pagebuf) and call dp1.InsertRecord() with some parameter. What are the values of the following C++ expressions: (1) ((VarlenDataPageHeader *) pagebuf)->m_ph_sz; (2) ((VarlenDataPageHeader *) pagebuf)-> m_nslots ; (3) dp1.GetMaxSlotId()?

Question 3: Suppose we mark the buffer frame as dirty and unpin the buffer frame. Then pin it again in a buffer frame pagebuf2. What are the values of the following C++ expressions: (1) ((VarlenDataPageHeader *) pagebuf2)->m_ph_sz; (2) ((VarlenDataPageHeader *) pagebuf2)-> m_nslots?

Question 4: Suppose we create an instance of the data page as VarlenDataPage dp2(pagebuf2). What are the values of the following C++ expressions: (1) ((VarlenDataPageHeader *) pagebuf2)->m_ph_sz; (2) ((VarlenDataPageHeader *) pagebuf2)-> m_nslots ; (3) dp2.GetMaxSlotId()?

Question 5: Suppose we invoke VarlenDataPage::InitializePage(pagebuf2, 8) at this time. What are the values of the following C++ expressions: (1) ((VarlenDataPageHeader *) pagebuf2)->m_ph_sz; (2) ((VarlenDataPageHeader *) pagebuf2)-> m_nslots ; (3) dp2.GetMaxSlotId()?

Task 1: Implement the page initialization and a few page meta information functions:

VarlenDataPage::InitializePage()
the constructor VarlenDataPage::VarlenDataPage()
VarlenDataPage::GetUserData()
VarlenDataPage::GetMaxSlotId()
VarlenDataPasge::GetRecordCount()

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInitialize/* at this point.

Task 2: Implement the slot query and insertion functions: VarlenDataPage::InsertRecord(), VarlenDataPage::IsOccupied(), and VarlenDataPage::GetRecordBuffer().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInsertRecord/* at this point.

Task 3: Implement the record erasure function: VarlenDataPage::EraseRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestEraseRecord/* at this point.

Task 4: Implement the record update function: VarlenDataPage::UpdateRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestUpdateRecord/* and */BasicTestVarlenDataPage.TestSidNotValidErrors/* at this point.

As you might have noticed, there could be unoccupied spaces between offsets m_ph_sz and m_fs_begin (i.e., holes) after erasure or update. Hence, the data page may have to compact those space by moving the record payloads to be consecutive. During a page compaction, the SlotData in the slot array may have to be updated, but the slots themselves may not be moved. In other words, a valid slot remains valid and still refers to the same record; an invalid slot remains invalid; and the total number of slots remains unchanged after a compaction. To avoid trying to compact a page every time an insertion or an out-of-place update happens, you should maintain a bit m_has_hole in the page header to indicate whether it is possible to (not must) have holes in the occupied space. In the tests, we assume the data page performs lazy compaction, which means the compaction process will not be triggered until it becomes necessary. Please refer to the class documentation for more details.

Task 5: Implement the page compaction and update the page update functions accordingly.

You should pass all the tests in BasicTestVarlenDataPage at this point.

Optional: there are four functions for maintainng a sorted array of records and computing the available space: VarlenDataPage::InsertRecordAt(), VarlenDataPage::RemoveSlot(), VarlenDataPage::ShiftSlots(), and VarlenDataPage::ComputeFreeSpace. These are not required in this project and we do not provide the tests for them yet. You'll need to implement them in the next project of implementing B-trees.

1.2 Heap File (Table interface)

[Back to Project Overview]

Source files to modify:

include/storage/Table.h
src/storage/Table.cpp

Tips on the tests:

There are two versions of each tests for the heap files: BasicTestTable and SystemTestTable. You might want to debug your implementation using the basic test first. The code for both tests is actually the same (see tests/storage/TestTable_common.inc). The difference is that the basic test uses the an in-memory catalog implementation BootstrapCatCache to create the table descriptors in tests, and thus do not have dependencies on your heap file implementation. The hidden system test uses the a heap file based catalog implementation PersistentCatCache, which initializes a catalog using your heap file implementation, and thus is more likely to fail to initialize if your implementation is buggy.

The test BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed, which also uses PersistentCatCache, is a good indicator for whether your heap file implementation is buggy. If it passes, your code may or may not pass the system test. However, it probably will not pass the system tests if this test fails.

Brief overview of heap file: A heap file is simply an unordered (multi-)set of records. It is the most basic way of organizing records in DBMS. In Taco-DB, a heap file is implemented on top of the file manager managed virtual files and stores records in some arbitrary order in the list of pages. A table descriptor TabDesc stores essential information about a table, which includes the file ID of the table. The Table class should open the file through the global file manager g_fileman, make calls to the global buffer manager g_bufman to access the pages, and updates or retrives records on the page using VarlenDataPage. It should support inserting, updating, deleting and iterating over records. To simplify the implementation, you only need to allocate pages when there's no space for insertion, and do not have to reuse free space on previous pages (but you could if you'd like to). In addition, you should free the page in the file manager when they become empty. The Table::Iterator class provides forward-only iterator over all the records, and it should only keep at most one page pin at a time.

Note: in a more realistic DBMS, this may cause an issue of trying to deallocating a page that is actively being scanned (which may be in the same transaction/thread, such as an UPDATE statement). You do not have to worry about the tricky case in this project because we ensure that there will be either no concurrent scans over the file when a deletion or update happens, or the concurrent scan may only cause in-place update of records (which can never result in empty pages).

You might want to pay attention to the specifications in the header file in order to pass all the tests, especially the system tests.

Before you start working on the code, please answer the following questions by reading the specifications in include/storage/Table.h.

Question 6: Suppose there is a table descriptor tabdesc that describes a new table over a newly allocated file with file ID tabdesc->tabfid(), and we invoke Table::Initialize(tabdesc). If we create an instance of the table as auto tab = Table::Create(tabdesc) and iterate through all the records in it using the Table::Iterator obtained from tab.StartScan(), how many records will we find?

Question 7: Suppose we invoke tab.InsertRecord() on some record that fits into a data page and then iterate through all the records with tab again. How many records will we find?

Question 8: Suppose we create another instance of the table as auto tab2 = Table::Create(tabdesc) and invoke tab2.InsertRecord() on some record that fits into a data page. If we iterate through all the records with tab2, how many records will we find?

Question 9: If we iterate through all the records again with tab, how many records will we find?

Task 6: Implement the table initialization and construction functions:

Table::Initialize()
Table::Create()
Table::Table()
Table::~Table()

Your code is likely to pass BasicTestTable.TestEmptyTable at this point.

Task 7: Implement Table::InsertRecord(), Table::StartScan, Table::StartScanBefore and the Table::Iterator class. You may change, delete, or add to the existing members in the Table and Table::Iterator if necessary, and the signature of any private functions -- they only exist as a suggestion.

You code is likely to pass BasicTestTable.TestInsertRecord and BasicTestTable.TestStartScanBefore at this point.

Task 8: Implement Table::EraseRecord().

Your code is likely to pass BasicTestTable.TestEraseRecord at this point.

Question 10: Obviously, the pages in a heap file can become empty because of deletion of records. You should free the page when that happens. The question is whether a page can become empty because of update of records? If so, when (please give a concrete example as a sequence of operations on the heap file)? If not, why (please give a proof based on the function specifications)?

Task 9: Implement Table::UpdateRecord().

Your code should pass all the tests in BasicTestTable and SystemTestTable at this point.

2. Write-up requirement

Each individual student needs to complete a write-up independently and submit it to UBLearns with answers to Questions 1 - 15. Please find the due date of the write-up at the beginning of this web page and submit a PDF document to UBLearns by the deadline. Please do not submit any other formats such as .docx, .doc, .txt -- we will not grade any non-PDF submission.