CSE 462/562: Database Systems (Fall 2024)

Project 2: Data Layout

Released: Thursday, 9/5/2024
Project due: Monday, 9/16/2024, 23:59:59 PM EDT
Last updated: 9/12/2024

0. Getting Started

This project will be built upon your code for project 1 FSFile. If your code does not pass tests and you'd like to use the solution code for Project 1, please extract /ro-data/labs/lab1_sol.tar.xz (to be released after Project 1 grace period on 9/11) within your local repository root in your dev container. Then, run ./import_supplemental_files.sh. If you run into any errors, please read the third paragraph below in bold font.

To get started with project 2, extract /ro-data/labs/lab2.tar.xz within your local repository root in your dev container. Then import the supplemental files using ./import_supplemental_files.sh. The code should build without compilation errors once the supplemental files are imported, but most of the additional tests are likely to fail. In addition, existing test cases may also fail due to the introduction of the additional system components, which should pass once you correctly implement the components in project 2. You may list all the tests in your build directory using the ctest -N command, view them in the Run and Debug window in VSCode.

Note: your repository must be free of any uncommitted or untracked files before running the import script. If you run into an error during import, please remove the lab2 directory and import_supplemental_files.sh script; commit your repo; and retry.

As before, you may attempt any number of code submissions, subject to the same hourly submission rate limit of 10/hour/group applies as before. To do so, please tag the commit you want to submit, submit your commit using the following command: submit test 2 <git-commit-id-or-tag-name>.

To list your submissions, please enter submit list-subs 2.

To list the submission details of a submission with a specific sequence number, please enter: submit list-sub-details 2 <seqno>.

To find the project deadline and last date to submit, please enter: submit list-labs.

Grading policy: Your grade will be based on the testing result of your code. There is a hidden test suite in project 2 SystemTestTable whose source code is not available in the lab tarball. Once you submit your code, the hidden test cases will also run on your code, and any scores you obtain from the hidden test cases will also count towards your grade.

A few useful hints:

  1. We strongly encourage you to read through the entire document for the high-level design of the storage layer in Taco-DB before you start writing code. The project requires you to write a fair amount of code, so you might want to start early.
  2. The description below is meant to be a brief overview of the tasks. Always refer to the special documentation blocks in the headers and source files for a more detailed specification of how the functions should behave.
  3. You may add any definitions/declarations to the classes or functions you're implementing, but do not change the signatures of the public fucntions. Also do not edit those source files not listed in the source files to modify below, as they will be replaced during tests.
  4. You may ignore any function/class that's related to multi-threading and concurrency control since we are building a single-threaded DBMS. Your implementation doesn't have to be thread safe.
  5. We provide a list of tasks throughout this document, which is a suggestion of the order of implementation such that your code may pass a few more tests after you complete a task. It, however, is not the only possible order and you may also find alternative order that makes more sense to you. It is also possible that your code still could not pass certain test that is supposed to because of some unexpected dependencies on prior/later tasks, in which case you might want to refer to the test code implementation in the tests/ directory.
  6. If you're working in a team of two, you might want to first go through the list of tasks, agree on a big picture of design, and divide the implementation tasks into indepedent pieces. One strategy is to implement tasks that may be tested independently. For instance, the tasks buffer eviction policy and buffer pin logics can be implemented independently as long as you both agree on the same data layouts. Alternatively, you could also do pair coding. You may also come up with better ways of collaboration that makes more sense to you. In any case, the grade for the code implementation will be based the test result of your group's final submission.

1. Project overview

Taco-DB storage layer overview

Figure 1: Taco-DB Storage Layer Overview

Figure 1 shows the storage layer of Taco-DB We have provided the implementation of file manager which manages disk space in a set of data files on the file system. You don't have to modify any code of the file manager but you should read Section 1.1 in order to know the APIs. On top of that, we provide a buffer manager that implements the buffer pool we discussed in the class, in order to reduce I/Os and improve data access latency. You do not need to implement anything, but you should also read Section 1.2 and the associated source code to understand the implementation and interfaces.

The goal of project 2 is to implement the buffer manager (see Section 1.2), the (slotted) data page (Section 1.4). There is also an exercise in order to help you understand the record layout in DBMS better: to read the code of Section 1.3: the record layout and Schema object implementation in Taco-DB (no submission needed).

1.1. File Manager

[Back to Project Overview]

Source files to modify: none.

Source files to read:

You don't have to implement anything in the file manager. You only need to skim through the file manager API at this stage. You may always revisit this section and the implementation when it's needed.

file manager

Figure 2: File Manager

Taco-DB manages its data in a flat main data storage space opprganized as fixed-size pages (see Figure 2), through the FileManager interface defined in include/storage/FileManager.h. The size of each page is fixed to PAGE_SIZE = 4096 bytes. The page numbers are represented as the unsigned 32-bit integer type PageNumber, where page number 0 (INVALID_PID) is an invalid page number reserved for file manager use. Recall that you have implemented an interface FSFile for file I/O over the file system. The file manager breaks up the entire space into up to 256 files stored on your file system (usually in a tmpd.XXXXXX/main directory when you're running tests). There is no need to worry about every detail of the file manager, but it is useful to understand how the file manager manages the page for debugging purpose.

Note: the tmpd.XXXX directory is removed upon normal exit of a test case so that your will not deplete the shared storage space on minsky. However, it may remain if the program exits abnormally (e.g., SEGFAULT, stopped during debugging in VSCode/GDB, ...). The build script will also remove data directories created more than 1 day ago when you build your code.

There is a single global instance of file manager g_fileman declared in include/dbmain/Database.h.

Note: the FileManager also manages the temporary data files, but each temporary file has its own page number space that is independent of the main data storage space. They are not used in this project and we will cover the details in later projects that use temporary files. Within the context of this project, a file page refers to a page in the main data storage space.

file manager managed (virtual) file

Figure 3: File manager manged (virtual) file

The file manager further divides the entire main data storage space as (virtual) files (see Figure 3). Each file is simply a linked list of pages in the main data storage space, which may be created and opened through the file manager. An open virtual file is returned as a object of File class, and you may find the first page, find the last page, or extend the file through it. Each data page has a fixed-size header PageHeaderData at the beginning that is managed by the file manager and stores the next and previous page number and some other information internal to the file manager. When you access a page in Taco-DB, you must not modify the page header.

1.2. Buffer Manager

[Back to Project Overview]

Source files to modify: none

Source files to read:

While one can read/write pages directly through FileManager::ReadPage() and FileManager::WritePage() on demand, it is not very efficient as each read/write incurs a disk I/O. As we covered in the lecture, DBMS utilizes a page buffer manager to keep a set of pages in the memory in order to reduce I/O requests and improve the data access throughput and latency. In Taco-DB, the BufferManager is built on top of the FileManager . It handles all the page access requests from other system components and maintains a buffer of disk pages in a fixed number of buffer frames, by issuing I/O calls on the global file manager instance g_fileman.

There is a single global instance of buffer manager g_bufman declared in include/dbmain/Database.h except in buffer manager tests.

buffer manager

Figure 4: Buffer Manager PinPage() call

Figure 4 is a sketch of how BufferManager works. It allocates a number of consecutive pages in the memory indexed by an unsigned integer type BufferId from 0, with INVALID_BUFID defined as std::numeric_limits::max(). The buffer frames must be aligned to 512-byte boundaries (using aligned_alloc()), because the FileManager uses O_DIRECT I/Os over the file system. For each buffer frame, the buffer manager maintains some meta information in a separate structure, including the current pin count, a dirty bit, the page number its currently holding, as well as necessary information needed by the eviction policy. In addition, the buffer frame should maintain a hash table that maps page numbers to the buffer ids to speed up lookups.

Upon a PinPage() call, the buffer manager should look up the page number in the hash table. If it is found in some buffer frame, it increments the pin count and returns its buffer ID. Otherwise, the buffer manager selects an unpinned buffer frame using some eviction policy (e.g., LRU, MRU, Clock, etc., but don't use random eviction policy!) and writes back the page currently in the selected frame if it holds a dirty page. Once the frame is clean, you may read the requested page into the selected buffer frame. It is then returned to the caller with its pin count incremented. Note that a buffer frame may have a pin count greater than 1 (think: why and when?). An UnPinPage() call simply decrements the specified buffer frame's pin count.

If there are a sufficient number of pin page requests over distinct pages, we will eventually run out of free frames. In that case, the current PinPage() implementation will not work unless we implement an eviction policy for choosing a victim page to evict from the buffer pool. We have discussed several basic eviction policies (LRU, MRU, Clock) in the lecture, you may choose any of them to implement.

The buffer manager in Taco-DB does not flush pages back unless the page is evicted or the database shuts down, as opposed to in a more sophisticated DBMS where the buffer pool may flush pages from time to time for performance and recovery consideration. In Taco-DB, there is a BufferManager::Flush() function that flushes all the dirty pages back to the disk. It should only be called during database shutdown from BufferManager::Destroy(). When that happens, all the buffer frames (whether it is clean or dirty) should have zero pin counts. You should throw a fatal error if some pin count is not zero, because non-zero pin count indicates bugs in other components.

1.3 Schema and Record Layout

[Back to Project Overview]

Source files to modify: none.

Source files to read:

In this Section, we will explore how fields are laid out given a particular schema in Taco-DB. This is a variant of one of the record layouts discussed in the lectures. Your task is to understand the details and implement a few functions to access fields in a record correctly.

Before we talk about the Schema objects and record layout, let's have a brief overview of data types and data representation in Taco-DB. You might be quite familiar with what data types are in common programming languages (e.g., int, float, std::string, ... in C++) -- a data type usually defines a set of possible values, a set of allowed operations as well as the representation of these values in terms of low-level primitives such as bits, bytes, or words, etc. Since we are implementing an RDBMS, it has a different type system (of SQL) than the language it is implemented in (i.e., C++), an important distinction to make.

Each SQL type (e.g., INT4, FLOAT, VARCHAR(256)) has its own valid set of values and operations permitted. A datum of a SQL type is always stored as a fixed-length or variable-length object (a sequence of bytes in memory or on disk) in Taco-DB without the type information attached. For instance, an INT4 is stored as a 4-byte object with its initial address aligned (cf. Data Alignment in Records of Lecture 4) to 4-byte memory address/file offset boundaries. There are a set of functions and operators that can be applied to INT4, mostly defined in src/utils/typsupp/int4.cpp. It may be conveniently stored in an int32_t in C++ for simplifying the implementation. Another example is VARCHAR(n), a variable-length type, which is stored as an object with the same length as the string itself of up to n characters with no special alignment (i.e., alignment = 1). Most of its functions and operators are defined in src/utils/typsupp/varchar.cpp. In C++, we pass around a pointer (char*) to the beginning of a VARCHAR(n) object when it's in memory. You may wonder how we can determine the length of a VARCHAR string since we don't store a length with the object or a null-terminator ('\0' as a C string). The short answer is the length is encoded in the record layout and you will find out how once you finish reading this section.

Because the database has to operate on data with dynamic types specified during runtime, the system must know the type information of a datum during run time in order to operate on it. Each type has its own requirements and allowed operators and functions stored in a database catalog (a set of special tables internal to the system), and we use a unique Type ID to denote each type. Whenever we process a table in a database, we obtain a Schema object, which stores the type IDs for each field. That is one of the two main purposes of the Schema objects -- to keep track of the types of data.

Since different data types have different representation and types are dynamic during run time, we use a Datum object (see include/base/datum.h) to store an object of any type in memory (as a sequence of bytes). You may think of it as a union of values of different types (e.g., integers, floats, strings), but internally it has differnt ways of storing data. In general, a fixed-length object of length 1, 2, 4 or 8 bytes is stored inline in a Datum object (e.g., int32_t, int64_t), while a variable-length object (e.g., strings) or a fixed-length object with other number of bytes is stored in a buffer and Datum keeps a pointer (char*, which can be cast to any other pointers to specific type structures). Depending on where this buffer is located (on buffer page or in some malloc-allocated memory), the buffer may be owned or not owned by the Datum. Note that Datum does not have a copy constructor or a copy assignment operator because of that (more on this in project 5). To reference a Datum, you may either use const Datum& or the DatumRef class. When an object is stored on a data page, it is always stored as a flat array of bytes without pointers or references. Note that a fixed-length object may be stored as a variable-length datum (e.g., CHAR(5)), which should be accessed using Datum::GetVarlenBytes(), Datum::GetVarlenSize() or Datum::GetVarlenAsStringView() rather than Datum::GetFixedlenBytes().

On the other hand, Taco-DB relies on the type catalog information to interpret the values. The type information is looked up from the catalog table Type. These types are SQL types rather than C++ types, and they can be used to find information about how to manipute these objects. There are a few important fields of the Type catalog table (<builddir>/generated_source/catalog/systables/Type.h -- you must have built the code to see this file) exlained below:

A built-in function is some function implemented in C++ that accepts a list of Datum of certain SQL types and returns a Datum of a certain SQL type. These built-in functions are managed by a centralized function catalog, and can be looked up/called dynamically through FunctionCall method. For instance, the typlenfunc for a predefined SQL type is a built-in function that accepts a UINT8 Datum and returns a INT2 length. It calculates the binary length of this given type based on the type parameter. You may find the details of built-in functions in include/base/fmgr.h and include/utils/builtin_funcs.h. As an example, for the fixed-length string CHAR(n), its corresponding typlenfunc will return n because we always store it as an n-byte object.

Finally, you are ready for understanding how the Schema object computes record layout, the second main purpose of it. The record layout is briefly described in the documentation comment block of the class in include/catalog/Schema.h, and the implementation is in Schema::ComputeLayoutImpl(), Schema::WritePayloadToBufferImpl(). You should read the code to figure out the specification of the layout. As an exercise, your task is

Task 1.3.1: Implement the record field accessor functions:

Your code is likely to pass the tests BasicTestSchema.* at this point.

Hint: You might find reading the test cases in tests/catalog/BasicTestSchema.cpp and tests/catalog/TestSchema.h as useful examples to help you understand the layout implementation.

1.4 Data Page

[Back to Project Overview]

Source files to modify:

How to run parameterized tests individually: The tests for the data page are parameterized (meaning each named test case is actually a suite of test cases instantiated from a template with different parameter values). You won't be able to run each individual test case if running them with ctest. Instead, ctest will run a test template for multiple rounds with all the parameter values. To run a test with a specific parameter, e.g., the page initialization test with min_reclen = 8, max_reclen = 8 and usr_data_sz = 0,

Brief overview of data pages: Taco-DB manages its data in the fixed-size pages (of PAGE_SIZE bytes), where we may store a number of records as well as some additional information needed by the database. Taco-DB supports variable-length data types (e.g., VARCHAR), as well as nullable table fields, and thus it needs to have the capability of storing variable-length records on the data pages using a slotted data page design. In this project, a record is simply an array of bytes of certain length. Note that a record must be aligned to some 8-byte address (you may use MAXALIGN() for that).

data page that supports variable-length records

Figure 5: Data Page that supports variable-length records

The layout of a data page is shown in Figure 5. The page begins with a 24-byte header VarlenDataPageHeader, which includes the PageHeaderData at offset 0. Following that is an optional fixed-size user data area (you'll need it for the later B-tree project). The user of a VarlenDataPage may request reserving the user data area for a specific size at initialization, and your implementation must make sure the bytes in this area are not modified by the VarlenDataPage. Slots on the page are indexed by SlotId typed integers, from 1 upwards (with INVALID_SID=0). The slot array starts at the end of the page and grows backwards. Hence the SlotData of slot sid may be accessed as ((SlotData*)(pagebuf + PAGE_SIZE))[-(int) sid], if char *pagebuf points to the beginning of the page. The actual record payload for slot sid is stored somewhere between the end of the user data area and the begin offset of the free space. To insert a record, find a free existing slot or create a new slot in the slot array, and find some empty space to copy the record payload. To erase a record, set its slot to be invalid and possibly remove all the invalid slot at the end of the slot array if any. You may refer to the class and function documentation in the header file for a more detailed specification.

Task 1.4.1: Implement the page initialization and a few page meta information functions:

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInitialize/* at this point.

Task 1.4.2: Implement the slot query and insertion functions: VarlenDataPage::InsertRecord(), VarlenDataPage::IsOccupied(), and VarlenDataPage::GetRecordBuffer().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInsertRecord/* at this point.

Task 1.4.3: Implement the record erasure function: VarlenDataPage::EraseRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestEraseRecord/* at this point.

Task 1.4.4: Implement the record update function: VarlenDataPage::UpdateRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestUpdateRecord/* and */BasicTestVarlenDataPage.TestSidNotValidErrors/* at this point.

As you might have noticed, there could be unoccupied spaces between offsets m_ph_sz and m_fs_begin (i.e., holes) after erasure or update. Hence, the data page may have to compact those space by moving the record payloads to be consecutive. During a page compaction, the SlotData in the slot array may have to be updated, but the slots themselves may not be moved. In other words, a valid slot remains valid and still refers to the same record; an invalid slot remains invalid; and the total number of slots remains unchanged after a compaction. To avoid trying to compact a page every time an insertion or an out-of-place update happens, you should maintain a bit m_has_hole in the page header to indicate whether it is possible to (not must) have holes in the occupied space. In the tests, we assume the data page performs lazy compaction, which means the compaction process will not be triggered until it becomes necessary. Please refer to the class documentation for more details.

Task 1.4.5: Implement the page compaction and update the page update functions accordingly.

You should pass all the tests in BasicTestVarlenDataPage at this point.

You might also noticed there are four additional functions for maintainng a sorted array of records and computing the available space: VarlenDataPage::InsertRecordAt(), VarlenDataPage::RemoveSlot(), VarlenDataPage::ShiftSlots(), and VarlenDataPage::ComputeFreeSpace(). These are not required in this project and we do not provide the tests for them yet. You will need to implement some of these in future projects. We will also provide additional test cases when you are required to implement them.

1.5 Heap File (Table interface)

[Back to Project Overview]

Source files to modify:

Tips on the tests:

There are two versions of each tests for the heap files: BasicTestTable and SystemTestTable. You might want to debug your implementation using the basic test first. The code for both tests is actually the same (see tests/storage/TestTable_common.inc). The difference is that the basic test uses the an in-memory catalog implementation BootstrapCatCache to create the table descriptors in tests, and thus do not have dependencies on your heap file implementation. The hidden system test uses the a heap file based catalog implementation PersistentCatCache, which initializes a catalog using your heap file implementation, and thus is more likely to fail to initialize if your implementation is buggy.

The test BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed, which also uses PersistentCatCache, is a good indicator for whether your heap file implementation is buggy. If it passes, your code may or may not pass the system test. However, it probably will not pass the system tests if this test fails.

Brief overview of heap file: A heap file is simply an unordered (multi-)set of records. It is the most basic way of organizing records in DBMS. In Taco-DB, a heap file is implemented on top of the file manager managed virtual files and stores records in some arbitrary order in the list of pages. A table descriptor TabDesc stores essential information about a table, which includes the file ID of the table. The Table class should open the file through the global file manager g_fileman, make calls to the global buffer manager g_bufman to access the pages, and updates or retrives records on the page using VarlenDataPage. It should support inserting, updating, deleting and iterating over records. To simplify the implementation, you only need to allocate pages when there's no space for insertion, and do not have to reuse free space on previous pages (but you could if you'd like to). In addition, you should free the page in the file manager when they become empty. The Table::Iterator class provides forward-only iterator over all the records, and it should only keep at most one page pin at a time.

Note: in a more realistic DBMS, this may cause an issue of trying to deallocating a page that is actively being scanned (which may be in the same transaction/thread, such as an UPDATE statement). You do not have to worry about the tricky case in this project because we ensure that there will be either no concurrent scans over the file when a deletion or update happens, or the concurrent scan may only cause in-place update of records (which can never result in empty pages).

You might want to pay attention to the specifications in the header file in order to pass all the tests, especially the system tests.

Before you start working on the code, please answer the following questions by reading the specifications in include/storage/Table.h.

Task 1.5.6: Implement the table initialization and construction functions:

Your code is likely to pass BasicTestTable.TestEmptyTable at this point.

Task 1.5.7: Implement Table::InsertRecord(), Table::StartScan, Table::StartScanBefore and the Table::Iterator class. You may change, delete, or add to the existing members in the Table and Table::Iterator if necessary, and the signature of any private functions -- they only exist as a suggestion.

You code is likely to pass BasicTestTable.TestInsertRecord and BasicTestTable.TestStartScanBefore at this point.

Task 1.5.8: Implement Table::EraseRecord().

Your code is likely to pass BasicTestTable.TestEraseRecord at this point.

Task 1.5.9: Implement Table::UpdateRecord().

Your code should pass all the tests in BasicTestTable and SystemTestTable at this point.