CSE 462/562: Database Systems (Spring 2022)

Project 2: Buffer Manager and Data Layout

Released: Tuesday, 2/7/2024
Project due: Sunday, 2/25/2024, 23:59 PM EST
Last updated: 2/17/2024

0. Getting Started

This project will be built upon your code for project 1 FSFile. If you need the solution code for project 1, please download it from Autolab in Project 2, attachments (link). To apply the solution to your code base, upload the tar ball to minsky (if you're using the student server), extract it under your repository root, and invoke ./import_supplemental_files.sh. To get started with project 2, extract /local/Spring_2024/cse562/lab2.tar.xz on the student servers and import the supplemental files using ./import_supplemental_files.sh. The code should build without compilation errors once the supplemental files are imported, but most of the additional tests are likely to fail. In addition, existing test cases may also fail due to the introduction of the additional system components, which should pass once you correctly implement the components in project 2. You may list all the tests in your build directory using the ctest -N command, view them in the Run and Debug window in VSCode. As before, you may attempt any number of code submissions at Autolab, project 2 (the same hourly submission rate limit applies as before). To do so, please tag the commit you want to submit, and provide the tag in Autolab.

Grading policy: Your grade will be based on the testing result of your code There is a hidden test suite in project 2 that will not be tested on Autolab, but only during the offline tests. Once you submit your code on Autolab, it will run the tests on its VM and give you a list of scores for individual test cases. As some of the tests may timeout on the VM due to its unstable performance, we will run an offline test on your final submission after the deadline. You final score for each test case will be the higher one between the last submission on Autolab and the offline test. For example, if your code passes all the tests except BasicTestEvictionPolicy.TestMemoryLimit on Autolab (which may randomly fail due to timeout from time to time) but passes it in the offline test, you will still get the full score for that test case.

A few useful hints:

  1. We strongly encourage you to read through the entire document for the high-level design of the storage layer in Taco-DB before you start writing code. The project requires you to write a fair amount of code, so you might want to start early.
  2. The description below is meant to be a brief overview of the tasks. Always refer to the special documentation blocks in the headers and source files for a more detailed specification of how the functions should behave.
  3. You may add any definitions/declarations to the classes or functions you're implementing, but do not change the signatures of the public fucntions. Also do not edit those source files not listed in the source files to modify below, as they will be replaced during tests.
  4. You may ignore any function/class that's related to multi-threading and concurrency control since we are building a single-threaded DBMS. Your implementation doesn't have to be thread safe.
  5. We provide a list of tasks throughout this document, which is a suggestion of the order of implementation such that your code may pass a few more tests after you complete a task. It, however, is not the only possible order and you may also find alternative order that makes more sense to you. It is also possible that your code still could not pass certain test that is supposed to because of some unexpected dependencies on prior/later tasks, in which case you might want to refer to the test code implementation in the tests/ directory.
  6. If you're working in a team of two, you might want to first go through the list of tasks, agree on a big picture of design, and divide the implementation tasks into indepedent pieces. One strategy is to implement tasks that may be tested independently. For instance, the tasks buffer eviction policy and buffer pin logics can be implemented independently as long as you both agree on the same data layouts. Alternatively, you could also do pair coding. You may also come up with better ways of collaboration that makes more sense to you. In any case, the grade for the code implementation will be based the test result of your group's final submission.

1. Project overview

Taco-DB storage layer overview

Figure 1: Taco-DB Storage Layer Overview

Figure 1 shows the storage layer of Taco-DB We have provided the implementation of file manager which manages disk space in a set of data files on the file system. You don't have to modify any code of the file manager but you should read Section 1.1 in order to know the APIs. The goal of project 2 is to implement the buffer manager (see Section 1.2), the (slotted) data page (Section 1.4). There is also an exercise in order to help you understand the record layout in DBMS better: to read the code of Section 1.3: the record layout and Schema object implementation in Taco-DB (no submission needed).

1.1. File Manager

[Back to Project Overview]

Source files to modify: none.

Source files to read:

You don't have to implement anything in the file manager. You only need to skim through the file manager API at this stage. You may always revisit this section and the implementation when it's needed.

file manager

Figure 2: File Manager

Taco-DB manages its data in a flat main data storage space opprganized as fixed-size pages (see Figure 2), through the FileManager interface defined in include/storage/FileManager.h. The size of each page is fixed to PAGE_SIZE = 4096 bytes. The page numbers are represented as the unsigned 32-bit integer type PageNumber, where page number 0 (INVALID_PID) is an invalid page number reserved for file manager use. Recall that you have implemented an interface FSFile for file I/O over the file system. The file manager breaks up the entire space into up to 256 files stored on your file system (usually in a tmpd.XXXXXX/main when you're running tests, or /local/Spring_2024/your-ubit-name/cse562_tmps/tmpd.XXXXXX/main if you're running the tests on minsky). There is no need to worry about every detail of the file manager, but it is useful to understand how the file manager manages the page for debugging purpose.

Note: the tmpd.XXXX directory is removed upon normal exit of a test case so that your quota on the student servers will not be depleted very quickly. However, it may remain if the program exits abnormally (e.g., SEGFAULT, stopped during debugging in VSCode/GDB, ...). The build script will also remove data directories created more than 1 day ago when you build your code. Starting from now, all the temporary data will be automatically placed into a local directory /local/Spring_2024/your-ubit-name/cse562_tmps, if you're using minsky, so that you will less likely run out of the quota in your home directory.

There is a single global instance of file manager g_fileman declared in include/dbmain/Database.h.

Note: the FileManager also manages the temporary data files, but each temporary file has its own page number space that is independent of the main data storage space. They are not used in this project and we will cover the details in later projects that use temporary files. Within the context of this project, a file page refers to a page in the main data storage space.

file manager managed (virtual) file

Figure 3: File manager manged (virtual) file

The file manager further divides the entire main data storage space as (virtual) files (see Figure 3). Each file is simply a linked list of pages in the main data storage space, which may be created and opened through the file manager. An open virtual file is returned as a object of File class, and you may find the first page, find the last page, or extend the file through it. Each data page has a fixed-size header PageHeaderData at the beginning that is managed by the file manager and stores the next and previous page number and some other information internal to the file manager. When you access a page in Taco-DB, you must not modify the page header.

1.2. Buffer Manager

[Back to Project Overview]

Source files to modify:

While one can read/write pages directly through FileManager::ReadPage() and FileManager::WritePage() on demand, it is not very efficient as each read/write incurs a disk I/O. As we covered in the lecture, DBMS utilizes a page buffer manager to keep a set of pages in the memory in order to reduce I/O requests and improve the data access throughput and latency. In Taco-DB, the BufferManager is built on top of the FileManager . It handles all the page access requests from other system components and maintains a buffer of disk pages in a fixed number of buffer frames, by issuing I/O calls on the global file manager instance g_fileman.

There is a single global instance of buffer manager g_bufman declared in include/dbmain/Database.h except in buffer manager tests.

buffer manager

Figure 4: Buffer Manager PinPage() call

Figure 4 is a sketch of how BufferManager works. It allocates a number of consecutive pages in the memory indexed by an unsigned integer type BufferId from 0, with INVALID_BUFID defined as std::numeric_limits::max(). The buffer frames must be aligned to 512-byte boundaries (using aligned_alloc()), because the FileManager uses O_DIRECT I/Os over the file system. For each buffer frame, the buffer manager maintains some meta information in a separate structure, including the current pin count, a dirty bit, the page number its currently holding, as well as necessary information needed by the eviction policy. In addition, the buffer frame should maintain a hash table that maps page numbers to the buffer ids to speed up lookups.

Task 1: Add the necesssary class members for the data structures described above to BufferManager and implement:

Your code should still pass BasicTestBufferManager.TestNormalInitAndDestroy at this point (it passes before you modify any source code).

Upon a PinPage() call, the buffer manager should look up the page number in the hash table. If it is found in some buffer frame, it increments the pin count and returns its buffer ID. Otherwise, the buffer manager selects an unpinned buffer frame using some eviction policy (e.g., LRU, MRU, Clock, etc., but don't use random eviction policy!) and writes back the page currently in the selected frame if it holds a dirty page. Once the frame is clean, you may read the requested page into the selected buffer frame. It is then returned to the caller with its pin count incremented. Note that a buffer frame may have a pin count greater than 1 (think: why and when?). An UnPinPage() call simply decrements the specified buffer frame's pin count.

Task 2: Implement BufferManager::PinPage() and BufferManager::UnpinPage() without a working eviction policy.

Your code is likely to pass the following tests at this point:

If there are a sufficient number of pin page requests over distinct pages, we will eventually run out of free frames. In that case, the current PinPage() implementation will not work unless we implement an eviction policy for choosing a victim page to evict from the buffer pool. We have discussed several basic eviction policies (LRU, MRU, Clock) in the lecture, you may choose any of them to implement.

Task 3: Implement a reasonably good eviction policy of your choice (e.g., LRU, MRU, Clock, etc.).

Your code is likely to pass the following tests at this point. Note that these tests do not test the efficiency of your eviction policy yet.

The buffer manager in Taco-DB does not flush pages back unless the page is evicted or the database shuts down, as opposed to in a more sophisticated DBMS where the buffer pool may flush pages from time to time for performance and recovery consideration. In Taco-DB, there is a BufferManager::Flush() function that flushes all the dirty pages back to the disk. It should only be called during database shutdown from BufferManager::Destroy(). When that happens, all the buffer frames (whether it is clean or dirty) should have zero pin counts. You should throw a fatal error if some pin count is not zero, because non-zero pin count indicates bugs in other components.

Task 4: Implement BufferManager::Flush() and add a call to it in BufferManager::Destroy().

Your code should pass all the tests in BasicTestBufferManager and BasicTestEvictionPolicy at this point. The two tests in BasicTestEvictionPolicy checks if your eviction policy is working and reasonably good (i.e., not random eviction). They take a bit longer to run (tens of seconds in debug build), so their timeout are set to higher values on Autolab and in the offline tests. If it runs for more than 1 minute in the debug build, there is a high chance that your implementation is not efficient enough to pass the tests.

1.3 Schema and Record Layout

[Back to Project Overview]

Source files to modify: none.

Source files to read:

In this Section, we will explore how fields are laid out given a particular schema in Taco-DB. This is an implementation of one possible record layout discussed in the lectures. You do not have to modify the code but you will need to read the code to answer the questions in your project write-up.

Before we talk about the Schema objects and record layout, let's have a brief overview of data types and data representation in Taco-DB. You might be quite familiar with what data types are in common programming languages (e.g., int, float, std::string, ... in C++) -- a data type usually defines a set of possible values, a set of allowed operations as well as the representation of these values in terms of low-level primitives such as bits, bytes, or words, etc. Since we are implementing an RDBMS, it has a different type system (of SQL) than the language it is implemented in (i.e., C++), an important distinction to make.

Each SQL type (e.g., INT4, FLOAT, VARCHAR(256)) has its own valid set of values and operations permitted. A datum of a SQL type is always stored as a fixed-length or variable-length object (a sequence of bytes in memory or on disk) in Taco-DB without the type information attached. For instance, an INT4 is stored as a 4-byte object with its initial address aligned (cf. Data Alignment in Records of Lecture 4) to 4-byte memory address/file offset boundaries. There are a set of functions and operators that can be applied to INT4, mostly defined in src/utils/typsupp/int4.cpp. It may be conveniently stored in an int32_t in C++ for simplifying the implementation. Another example is VARCHAR(n), a variable-length type, which is stored as an object with the same length as the string itself of up to n characters with no special alignment (i.e., alignment = 1). Most of its functions and operators are defined in src/utils/typsupp/varchar.cpp. In C++, we pass around a pointer (char*) to the beginning of a VARCHAR(n) object when it's in memory. You may wonder how we can determine the length of a VARCHAR string since we don't store a length with the object or a null-terminator ('\0' as a C string). The short answer is the length is encoded in the record layout and you will find out how once you finish reading this section.

Because the database has to operate on data with dynamic types specified during runtime, the system must know the type information of a datum during run time in order to operate on it. Each type has its own requirements and allowed operators and functions stored in a database catalog (a set of special tables internal to the system), and we use a unique Type ID to denote each type. Whenever we process a table in a database, we obtain a Schema object, which stores the type IDs for each field. That is one of the two main purposes of the Schema objects -- to keep track of the types of data.

Since different data types have different representation and types are dynamic during run time, we use a Datum object (see include/base/datum.h) to store an object of any type in memory (as a sequence of bytes). You may think of it as a union of values of different types (e.g., integers, floats, strings), but internally it has differnt ways of storing data. In general, a fixed-length object of length 1, 2, 4 or 8 bytes is stored inline in a Datum object (e.g., int32_t, int64_t), while a variable-length object (e.g., strings) or a fixed-length object with other number of bytes is stored in a buffer and Datum keeps a pointer (char*, which can be cast to any other pointers to specific type structures). Depending on where this buffer is located (on buffer page or in some malloc-allocated memory), the buffer may be owned or not owned by the Datum. Note that Datum does not have a copy constructor or a copy assignment operator because of that (more on this in project 5). To reference a Datum, you may either use const Datum& or the DatumRef class. When an object is stored on a data page, it is always stored as a flat array of bytes without pointers or references. Note that a fixed-length object may be stored as a variable-length datum (e.g., CHAR(5)), which should be accessed using Datum::GetVarlenBytes(), Datum::GetVarlenSize() or Datum::GetVarlenAsStringView() rather than Datum::GetFixedlenBytes().

On the other hand, Taco-DB relies on the type catalog information to interpret the values. The type information is looked up from the catalog table Type. These types are SQL types rather than C++ types, and they can be used to find information about how to manipute these objects. There are a few important fields of the Type catalog table (<builddir>/generated_source/catalog/systables/Type.h -- you must have buit the code to see this file) exlained below:

A built-in function is some function implemented in C++ that accepts a list of Datum of certain SQL types and returns a Datum of a certain SQL type. These built-in functions are managed by a centralized function catalog, and can be looked up/called dynamically through FunctionCall method. For instance, the typlenfunc for a predefined SQL type is a built-in function that accepts a UINT8 Datum and returns a INT2 length. It calculates the binary length of this given type based on the type parameter. You may find the details of built-in functions in include/base/fmgr.h and include/utils/builtin_funcs.h. As an example, for the fixed-length string CHAR(n), its corresponding typlenfunc will return n because we always store it as an n-byte object.

Finally, you are ready for understanding how the Schema object computes record layout, the second main purpose of it. The record layout is briefly described in the documentation comment block of the class in include/catalog/Schema.h, and the implementation is in Schema::ComputeLayoutImpl(), Schema::WritePayloadToBufferImpl(), Schema::GetOffsetAndLength(). We will leave this as an (ungraded) exercise to you. No submission needed for these questions.

Hint: if a non-negative integer x is of some integral type T (e.g., T = int) and d is an non-negative integer, then the following are always true as long as the result does not overflow:

  1. x << d == x * (T) pow(2,d)
  2. x >> d == x / (T) pow(2,d)
  3. x & ((1 << d) - 1) == x % (1 << d)

Question 1: How would you get a field as a Datum from a record in memory, given a pointer char *rec to it, its corresponding schema object Schame *sch and its field ID FieldId fid? Write down a single C++ expression for this question.

Question 2: Assuming the Datum you get in Q7 has the type of INT4, how would you get its value as an int32_t in C++? Write down a single C++ expression for this question using rec, sch and fid in Q7. (note: this will be useful for debugging code in later projects)

Consider the following table schema defined in SQL for questions 9 - 15.


  CREATE TABLE A(
    f1 CHAR(2),
    f2 VARCHAR(100) NOT NULL,
    f3 CHAR(5) NOT NULL,
    f4 INT2 NOT NULL,
    f5 VARCHAR(2),
    f6 INT4
  );
				

Question 3: For a record in table A that does not have any NULL fields, what is the order of all the fields sorted by their offsets?

For questions 4 - 9, let t1 be a record in the table with the following values: (f1 = 'xy', f2 = '1', f3 = 'abcde', f4 = 78, f5 = NULL, f6 = 201)

Question 4: What is the offset of field f1 in record t1? Is it always at the same offset for any record in table A?

Question 5: What is the offset of field f2 in record t1? Is it always at the same offset for any record in table A?

Question 6: What is the offset of field f3 in record t1? Is it always at the same offset for any record in table A?

Question 7: What is the offset of field f4 in record t1? Is it always at the same offset for any record in table A?

Question 8: What is the offset of field f6 in record t1? Is it always at the same offset for any record in table A?

Question 9: What is the total length of record t1? Is the total length always the same as the offset of the end of the last field (i.e., the offset of the last field + the length of the last field) you listed in Q9, for any record in table A?

1.4 Data Page

[Back to Project Overview]

Source files to modify:

How to run parameterized tests individually: The tests for the data page are parameterized (meaning each named test case is actually a suite of test cases instantiated from a template with different parameter values). You won't be able to run each individual test case if running them with ctest. Instead, ctest will run a test template for multiple rounds with all the parameter values. To run a test with a specific parameter, e.g., the page initialization test with min_reclen = 8, max_reclen = 8 and usr_data_sz = 0,

Brief overview of data pages: Taco-DB manages its data in the fixed-size pages (of PAGE_SIZE bytes), where we may store a number of records as well as some additional information needed by the database. Taco-DB supports variable-length data types (e.g., VARCHAR), as well as nullable table fields, and thus it needs to have the capability of storing variable-length records on the data pages using a slotted data page design. In this project, a record is simply an array of bytes of certain length. Note that a record must be aligned to some 8-byte address (you may use MAXALIGN() for that).

data page that supports variable-length records

Figure 5: Data Page that supports variable-length records

The layout of a data page is shown in Figure 5. The page begins with a 24-byte header VarlenDataPageHeader, which includes the PageHeaderData at offset 0. Following that is an optional fixed-size user data area (you'll need it for the later B-tree project). Slots on the page are indexed by SlotId typed integers, from 1 upwards (with INVALID_SID=0). The slot array starts at the end of the page and grows backwards. Hence the SlotData of slot sid may be accessed as ((SlotData*)(pagebuf + PAGE_SIZE))[-(int) sid], if char *pagebuf points to the beginning of the page. The actual record payload for slot sid is stored somewhere between the end of the user data area and the begin offset of the free space. To insert a record, find a free existing slot or create a new slot in the slot array, and find some empty space to copy the record payload. To erase a record, set its slot to be invalid and possibly remove all the invalid slot at the end of the slot array if any. You may refer to the class and function documentation in the header file for a more detailed specification.

Task 1: Implement the page initialization and a few page meta information functions:

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInitialize/* at this point.

Task 2: Implement the slot query and insertion functions: VarlenDataPage::InsertRecord(), VarlenDataPage::IsOccupied(), and VarlenDataPage::GetRecordBuffer().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestInsertRecord/* at this point.

Task 3: Implement the record erasure function: VarlenDataPage::EraseRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestEraseRecord/* at this point.

Task 4: Implement the record update function: VarlenDataPage::UpdateRecord().

Your code is likely to pass the tests */BasicTestVarlenDataPage.TestUpdateRecord/* and */BasicTestVarlenDataPage.TestSidNotValidErrors/* at this point.

As you might have noticed, there could be unoccupied spaces between offsets m_ph_sz and m_fs_begin (i.e., holes) after erasure or update. Hence, the data page may have to compact those space by moving the record payloads to be consecutive. During a page compaction, the SlotData in the slot array may have to be updated, but the slots themselves may not be moved. In other words, a valid slot remains valid and still refers to the same record; an invalid slot remains invalid; and the total number of slots remains unchanged after a compaction. To avoid trying to compact a page every time an insertion or an out-of-place update happens, you should maintain a bit m_has_hole in the page header to indicate whether it is possible to (not must) have holes in the occupied space. In the tests, we assume the data page performs lazy compaction, which means the compaction process will not be triggered until it becomes necessary. Please refer to the class documentation for more details.

Task 5: Implement the page compaction and update the page update functions accordingly.

You should pass all the tests in BasicTestVarlenDataPage at this point.

Optional: there are four functions for maintainng a sorted array of records and computing the available space: VarlenDataPage::InsertRecordAt(), VarlenDataPage::RemoveSlot(), VarlenDataPage::ShiftSlots(), and VarlenDataPage::ComputeFreeSpace. These are not required in this project and we do not provide the tests for them yet. You'll need to implement them in the next project of implementing B-trees.

1.5 Heap File (Table interface)

[Back to Project Overview]

Source files to modify:

Tips on the tests:

There are two versions of each tests for the heap files: BasicTestTable and SystemTestTable. You might want to debug your implementation using the basic test first. The code for both tests is actually the same (see tests/storage/TestTable_common.inc). The difference is that the basic test uses the an in-memory catalog implementation BootstrapCatCache to create the table descriptors in tests, and thus do not have dependencies on your heap file implementation. The hidden system test uses the a heap file based catalog implementation PersistentCatCache, which initializes a catalog using your heap file implementation, and thus is more likely to fail to initialize if your implementation is buggy.

The test BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed, which also uses PersistentCatCache, is a good indicator for whether your heap file implementation is buggy. If it passes, your code may or may not pass the system test. However, it probably will not pass the system tests if this test fails.

Brief overview of heap file: A heap file is simply an unordered (multi-)set of records. It is the most basic way of organizing records in DBMS. In Taco-DB, a heap file is implemented on top of the file manager managed virtual files and stores records in some arbitrary order in the list of pages. A table descriptor TabDesc stores essential information about a table, which includes the file ID of the table. The Table class should open the file through the global file manager g_fileman, make calls to the global buffer manager g_bufman to access the pages, and updates or retrives records on the page using VarlenDataPage. It should support inserting, updating, deleting and iterating over records. To simplify the implementation, you only need to allocate pages when there's no space for insertion, and do not have to reuse free space on previous pages (but you could if you'd like to). In addition, you should free the page in the file manager when they become empty. The Table::Iterator class provides forward-only iterator over all the records, and it should only keep at most one page pin at a time.

Note: in a more realistic DBMS, this may cause an issue of trying to deallocating a page that is actively being scanned (which may be in the same transaction/thread, such as an UPDATE statement). You do not have to worry about the tricky case in this project because we ensure that there will be either no concurrent scans over the file when a deletion or update happens, or the concurrent scan may only cause in-place update of records (which can never result in empty pages).

You might want to pay attention to the specifications in the header file in order to pass all the tests, especially the system tests.

Before you start working on the code, please answer the following questions by reading the specifications in include/storage/Table.h.

Task 6: Implement the table initialization and construction functions:

Your code is likely to pass BasicTestTable.TestEmptyTable at this point.

Task 7: Implement Table::InsertRecord(), Table::StartScan, Table::StartScanBefore and the Table::Iterator class. You may change, delete, or add to the existing members in the Table and Table::Iterator if necessary, and the signature of any private functions -- they only exist as a suggestion.

You code is likely to pass BasicTestTable.TestInsertRecord and BasicTestTable.TestStartScanBefore at this point.

Task 8: Implement Table::EraseRecord().

Your code is likely to pass BasicTestTable.TestEraseRecord at this point.

Task 9: Implement Table::UpdateRecord().

Your code should pass all the tests in BasicTestTable and SystemTestTable at this point.