-
Notifications
You must be signed in to change notification settings - Fork 6
Table files
Original notes on table format
LevelDB stor data in stored table files in levels 0..max-level. Each level increase the table file size by a factor of 10. Table files use the extension .tbland they are numbered sequentially and increased automatically. Files are merged, and old files removed when compacting happens.
A table file contain several sections of data in the following layout
[data block 1]
[data block 2]
...
[data block N]
[meta block 1]
...
[meta block K]
[metaindex block]
[index block]
[Footer] (fixed size; starts at file_size - sizeof(Footer))
The format in formal specification (individual types described below)
table_file :=
data = data_block*
metaindex = metaindex
index = block_index
footer = footer;
A block handle points to a position and lenght of a block of data in the file.
block_handle :=
offset: varint64
size: varint64
The varints are regular formatted google style varints from protobuf.
The sequence of key/value pairs in the file are stored in sorted order and partitioned into a sequence of data blocks. These blocks come one after another with start at the beginning of the file.
data_block:=
block_data = data*;
restart_index = restart_index;
data :=
shared = varint64;
non_shared = varint64;
data_size = varint64;
user_data = byte[data_size];
restart_index :=
restart_points = uint32[count];
count = uint32;
The restart index is used to compress key stored in the data block. Since data is stored in sorted order it allows for parts of the keys to share bytes with the previous key. To make scanning fast, restartpoints are stored at even intervalls, and at that point the full key is stored. This can then be used to jump between blocks of data for faster search.
After data blocks we find a set of meta blocks. The supported meta block types are described in the original documentation and I won't go into it here.
The metaindex block contains one entry for every other meta block where the key is the name of the meta block and the value is a BlockHandle pointing to that meta block. This block follow immediate afer all the meta blocks themselves.
Notice that the key in the index in this case is an UTF8 encoded string containing the name of the meta block.
The index block contains one entry per data block, where the key is a string >= last key in that data block and before the first key in the successive data block. The value is the BlockHandle for the data block. It ends with a restart index.
block_index :=
indexes = index*;
restart_index = restart_index;
restart_index :=
restart_points = uint32[count];
count = uint32;
index :=
shared = varint64;
non_shared = varint64;
size = varint64;
data_handle = block_handle;
This is essentially the same format as for a regular data block above.
At the very end of the file is a fixed length footer that contains the BlockHandle of the metaindex and index blocks as well as a magic number.
footer :=
metaindex_handle: block_handle; // Block handle for metaindex
index_handle: block_handle; // Block handle for index
padding: zero*; // zeroed bytes to make fixed length
// (40==2*sizeof(BlockHandle max length)
magic: fixed64; // == 0xdb4775248b80fb57
When reading a table file, this is what we start with. From this we get the index handle that we use to search for a key (using min, max search), and then read the actual data block.