Release date: Thursday, 2/16/2023
Project due: Tuesday, 2/28/2023, 01:00 am EST
Write-up due:Thursday, 3/2/2023, 01:00 am EST
Last updated: 2/16/2023
This project will be built upon your code for project 2. If
you need the solution code for project 2, please download it
from UBBox (link released on Piazza, under Homework Solutions). To get started,
extract /local/Spring_2023/cse562/lab3.tar.xz
on
the student servers and import the supplemental files as your
did in previous projects. The code should build without
compilation errors once the supplemental files are imported,
but most of the additional tests and
BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed
are likely to fail. You may list all the tests in your build
directory using the ctest -N
command. As before,
you may attempt any number of code submissions at Autolab,
project 3 (the same hourly submission rate limit applies as
before). To do so, please tag or create a new branch of the
commit you want to submit, and provide the tag or branch name
in Autolab.
Grading policy: Your grade will be based on the testing result of your code, and an independently completed write-up (see write-up section for more details). There is a hidden test suite in project 3 that is only tested on Autolab and during the offline tests (read Section 1.2 for details). Once you submit your code on Autolab, it will run the tests on its VM and give you a list of scores for individual test cases. Similar to project 2, we will also run an offline test on your last submission after the project due and take the higher score for each test case.
A few useful hints:
In this project, you will continue on the rest of the storage layer of Taco-DB (see Figure 1). More specifically, the goal is to implement the data page and the heap file. There are a few questions to guide you through the project You need to answer them in your project write-up.
Source files to modify:
How to run parameterized tests individually: The tests
for the data page are parameterized (meaning each named test case
is actually a suite of test cases instantiated from a template with
different parameter values).
You won't be able to run each individual test case if running
them with ctest. Instead, ctest
will run a test
template for multiple rounds with all the parameter values.
To run a test with a specific
parameter, e.g., the page initialization test with
min_reclen = 8
, max_reclen = 8
and
usr_data_sz = 0
,
./tests/storage/BasicTestVarlenDataPage
--gtest_filter='*TestInitialize/8_8_0'
.
If you are running the test case using gdb
in
command line, start it as usual: gdb
./tests/storage/BasicTestVarlenDataPage
. Then enter
r --gtest_filter='*TestInitialize/8_8_0'
in GDB to
start the program with a specific test case.Run and Debug
(Ctrl + Shift + d
). Locate the test suite
(e.g., */BasicTestVarlenDataPage.TestInitialize/*
)
from the drop down menu and click on the settings
button right next to it. This will open .vscode/launch.json
in the text editor and the cursor will be located at the entry for
that specific test case. Update the first argument in the
args
field with the specific parameter that you
want to run (e.g., */BasicTestVarlenDataPage.TestInitialize/8_8_0
).
Brief overview of data pages:
Taco-DB manages its data in the fixed-size pages (of PAGE_SIZE
bytes), where we may
store a number of records as well as some additional
information needed by the database. Taco-DB supports
variable-length data types (e.g., VARCHAR), as well as nullable
table fields, and thus it needs to have the capability of
storing variable-length records on the data pages using a
slotted data page design. In this project, a
record is simply an array of bytes of certain
length. Note that a record must be aligned to
some 8-byte address (you may use MAXALIGN()
for
that).
The layout of a data page is shown in Figure 5. The page begins
with a 24-byte header VarlenDataPageHeader
, which
includes the PageHeaderData
at offset 0.
Following that is an optional fixed-size user data area (you'll
need it for the later B-tree project). Slots on the page
are indexed by SlotId
typed integers, from 1 upwards (with
INVALID_SID=0
). The slot array starts at the end
of the page and grows backwards. Hence the
SlotData
of slot sid
may be accessed
as ((SlotData*)(pagebuf + PAGE_SIZE))[-(int) sid]
,
if char *pagebuf
points to the beginning of the
page. The actual record payload for slot sid
is
stored somewhere between the end of the user data area and the
begin offset of the free space. To insert a record, find a free
existing slot or create a new slot in the slot array, and find
some empty space to copy the record payload. To erase a record,
set its slot to be invalid and possibly remove all the invalid
slot at the end of the slot array if any. You may refer to the
class and function documentation in the header file for a more
detailed specification.
Before you get started with coding, please answer the following
questions by reading the function specifications in
include/storage/VarlenDataPage.h
:
Question 1: Let's assume there is a pinned buffer frame
pagebuf
and we have invoked
VarlenDataPage::InitializePage(pagebuf)
. What
are the values of the following C++ expressions: (1)
((VarlenDataPageHeader *) pagebuf)->m_ph_sz
;
(2) ((VarlenDataPageHeader *) pagebuf)-> m_nslots
?
Question 2: Suppose we create an instance of the
data page as VarlenDataPage dp1(pagebuf)
and call
dp1.InsertRecord()
with some parameter. What are
the values of the following C++ expressions: (1)
((VarlenDataPageHeader *) pagebuf)->m_ph_sz
;
(2) ((VarlenDataPageHeader *) pagebuf)-> m_nslots
;
(3) dp1.GetMaxSlotId()
?
Question 3: Suppose we mark the buffer frame as dirty
and unpin the buffer frame. Then pin it again in a buffer frame
pagebuf2
. What are the values of the following C++
expressions: (1) ((VarlenDataPageHeader *)
pagebuf2)->m_ph_sz
; (2) ((VarlenDataPageHeader
*) pagebuf2)-> m_nslots
?
Question 4: Suppose we create an instance of the
data page as VarlenDataPage dp2(pagebuf2)
. What are
the values of the following C++ expressions: (1)
((VarlenDataPageHeader *) pagebuf2)->m_ph_sz
;
(2) ((VarlenDataPageHeader *) pagebuf2)-> m_nslots
;
(3) dp2.GetMaxSlotId()
?
Question 5: Suppose we invoke
VarlenDataPage::InitializePage(pagebuf2, 8)
at this time.
What are the values of the following C++ expressions: (1)
((VarlenDataPageHeader *) pagebuf2)->m_ph_sz
;
(2) ((VarlenDataPageHeader *) pagebuf2)-> m_nslots
;
(3) dp2.GetMaxSlotId()
?
Task 1: Implement the page initialization and a few page meta information functions:
VarlenDataPage::InitializePage()
VarlenDataPage::VarlenDataPage()
VarlenDataPage::GetUserData()
VarlenDataPage::GetMaxSlotId()
VarlenDataPasge::GetRecordCount()
Your code is likely to pass the tests
*/BasicTestVarlenDataPage.TestInitialize/*
at this
point.
Task 2: Implement the slot query and insertion functions:
VarlenDataPage::InsertRecord()
,
VarlenDataPage::IsOccupied()
, and
VarlenDataPage::GetRecordBuffer()
.
Your code is likely to pass the tests
*/BasicTestVarlenDataPage.TestInsertRecord/*
at
this point.
Task 3: Implement the record erasure function:
VarlenDataPage::EraseRecord()
.
Your code is likely to pass the tests
*/BasicTestVarlenDataPage.TestEraseRecord/*
at this
point.
Task 4: Implement the record update function:
VarlenDataPage::UpdateRecord()
.
Your code is likely to pass the tests
*/BasicTestVarlenDataPage.TestUpdateRecord/*
and
*/BasicTestVarlenDataPage.TestSidNotValidErrors/*
at this point.
As you might have noticed, there could be unoccupied spaces
between offsets m_ph_sz
and
m_fs_begin
(i.e., holes) after erasure or update.
Hence, the data page may have to compact those space by moving
the record payloads to be consecutive. During a page
compaction, the SlotData in the slot array may have to be
updated, but the slots themselves may not be moved. In other
words, a valid slot remains valid and still refers to the same
record; an invalid slot remains invalid; and the total number
of slots remains unchanged after a compaction. To avoid trying
to compact a page every time an insertion or an out-of-place
update happens, you should maintain a bit
m_has_hole
in the page header to indicate whether
it is possible to (not must) have holes in the occupied
space. In the tests, we assume the data page performs lazy
compaction, which means the compaction process will not be
triggered until it becomes necessary. Please refer to the class
documentation for more details.
Task 5: Implement the page compaction and update the page update functions accordingly.
You should pass all the tests in
BasicTestVarlenDataPage
at this point.
Optional: there are four functions for maintainng a sorted array
of records and computing the available space:
VarlenDataPage::InsertRecordAt()
,
VarlenDataPage::RemoveSlot()
,
VarlenDataPage::ShiftSlots()
, and
VarlenDataPage::ComputeFreeSpace
. These are not
required in this project and we do not provide the tests for
them yet. You'll need to implement them in the next project of
implementing B-trees.
Source files to modify:
Tips on the tests:
There are two versions of each tests
for the heap files: BasicTestTable
and
SystemTestTable
. You might want to debug your
implementation using the basic test first. The code
for both tests
is actually the same (see
tests/storage/TestTable_common.inc
). The
difference is that the basic test uses the
an in-memory catalog implementation
BootstrapCatCache
to create the table descriptors
in tests, and thus do not have dependencies on your heap file
implementation. The hidden system test uses the a heap file
based catalog implementation PersistentCatCache
,
which initializes a catalog using your heap file
implementation, and thus is more likely to fail to initialize
if your implementation is buggy.
The test
BasicTestRepoCompilesAndRuns.TestShouldAlwaysSucceed
,
which also uses PersistentCatCache
, is a good
indicator for whether your heap file implementation is buggy.
If it passes, your code may or may not pass the system test.
However, it probably will not pass the system tests if this
test fails.
Brief overview of heap file:
A heap file is simply an unordered (multi-)set of records. It
is the most basic way of organizing records in DBMS. In
Taco-DB, a heap file is implemented on top of the file manager
managed virtual files and stores records in some arbitrary
order in the list of pages. A table descriptor
TabDesc
stores essential information about a
table, which includes the file ID of the table. The
Table
class should open the file through the
global file manager g_fileman
, make calls to the
global buffer manager g_bufman
to access the
pages, and updates or retrives records on the page using
VarlenDataPage
. It should support inserting,
updating, deleting and iterating over records. To simplify the
implementation, you only need to allocate pages when there's no
space for insertion, and do not have to reuse free space on
previous pages (but you could if you'd like to). In addition,
you should free the page in the file manager when they become
empty. The Table::Iterator
class provides forward-only
iterator over all the records, and it should only keep at most
one page pin at a time.
Note: in a more realistic DBMS, this may cause an issue of trying to deallocating a page that is actively being scanned (which may be in the same transaction/thread, such as an UPDATE statement). You do not have to worry about the tricky case in this project because we ensure that there will be either no concurrent scans over the file when a deletion or update happens, or the concurrent scan may only cause in-place update of records (which can never result in empty pages).
You might want to pay attention to the specifications in the header file in order to pass all the tests, especially the system tests.
Before you start working on the code, please answer the
following questions by reading the specifications in
include/storage/Table.h
.
Question 6: Suppose there is a table descriptor
tabdesc
that describes a new table over a newly
allocated file with file ID tabdesc->tabfid()
,
and we invoke Table::Initialize(tabdesc)
. If we
create an instance of the table as auto tab =
Table::Create(tabdesc)
and iterate through all the
records in it using the Table::Iterator
obtained from
tab.StartScan()
, how many records will we find?
Question 7: Suppose we invoke
tab.InsertRecord()
on some record that fits into a
data page and then iterate through all the records with
tab
again. How many records will we find?
Question 8: Suppose we create another instance of the
table as auto tab2 = Table::Create(tabdesc)
and
invoke tab2.InsertRecord()
on some record that fits
into a data page. If we iterate through all the records
with tab2
, how many records will we find?
Question 9: If we iterate through all the records again
with tab
, how many records will we find?
Task 6: Implement the table initialization and construction functions:
Table::Initialize()
Table::Create()
Table::Table()
Table::~Table()
Your code is likely to pass
BasicTestTable.TestEmptyTable
at this point.
Task 7: Implement Table::InsertRecord()
,
Table::StartScan
,
Table::StartScanBefore
and the
Table::Iterator
class. You may change, delete, or
add to the existing members in the Table
and
Table::Iterator
if necessary, and the signature of
any private functions -- they only exist as a suggestion.
You code is likely to pass
BasicTestTable.TestInsertRecord
and
BasicTestTable.TestStartScanBefore
at this point.
Task 8: Implement Table::EraseRecord()
.
Your code is likely to pass
BasicTestTable.TestEraseRecord
at this point.
Question 10: Obviously, the pages in a heap file can become empty because of deletion of records. You should free the page when that happens. The question is whether a page can become empty because of update of records? If so, when (please give a concrete example as a sequence of operations on the heap file)? If not, why (please give a proof based on the function specifications)?
Task 9: Implement Table::UpdateRecord()
.
Your code should pass all the tests in
BasicTestTable
and SystemTestTable
at
this point.
Each individual student needs to complete a write-up independently and submit it to UBLearns with answers to Questions 1 - 15. Please find the due date of the write-up at the beginning of this web page and submit a PDF document to UBLearns by the deadline. Please do not submit any other formats such as .docx, .doc, .txt -- we will not grade any non-PDF submission.