-
Notifications
You must be signed in to change notification settings - Fork 4
Unused series unloading #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: pp
Are you sure you want to change the base?
Conversation
gshigin
commented
Jun 2, 2025
- Loader/Unloader classes for unused series manipulation
- Test coverage
# Conflicts: # pp/bare_bones/encoding.h
@@ -321,6 +325,13 @@ class PROMPP_ATTRIBUTE_PACKED CompactBitSequence : public CompactBitSequenceBase | |||
public: | |||
[[nodiscard]] PROMPP_ALWAYS_INLINE BitSequenceReader reader() const noexcept { return {Base::memory_, size_in_bits_}; }; | |||
|
|||
PROMPP_ALWAYS_INLINE void trim_lower_bytes(uint32_t bytes_count) { | |||
assert(Bit::to_bits(bytes_count) <= Base::size_in_bits()); | |||
memset(Base::memory_, '\0', bytes_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid zeroing bytes: you should zero bytes in region Base::size_in_bytes() - bytes_count
template <OutputStream S> | ||
PROMPP_ALWAYS_INLINE void write_to(S& stream) const noexcept { | ||
const uint32_t data_size_in_bits = size(); | ||
const uint32_t data_size_in_bytes = ((data_size_in_bits + 63) >> 6) * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can create method:
bare_bones/bit.h
template <class T>
PROMPP_ALWAYS_INLINE constexpr T used_bytes(T bits) noexcept {
return to_bytes(bits + kByteBits - 1);
}
And use it here. Also you can refactor similar code in CompactBitSequenceBase::size_in_bytes
and in Loader
class
|
||
[[nodiscard]] PreparedSequences prepare_sequences() const noexcept { | ||
PreparedSequences result{}; | ||
result.total_bitseqs_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to initialize this field in PreparedSequences
:
struct PreparedSequences {
...
uint32_t total_bitseqs_size{};
}
|
||
for (const auto ls_id : storage_.unused_series_bitmap) { | ||
const auto encoding_type = storage_.open_chunks[ls_id].encoding_state.encoding_type; | ||
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) { | |
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) { |
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) { | |
if (is_unloadable_encoder(encoding_type)) { |
Because unloadable chunk can't be empty
for (const auto ls_id : storage_.unused_series_bitmap) { | ||
const auto encoding_type = storage_.open_chunks[ls_id].encoding_state.encoding_type; | ||
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) { | ||
result.ls_id_bitmap.resize(ls_id + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You call result.ls_id_bitmap
inside loop. It degrades performance.
You can find max unloadable chunk ls_id (use reverse iterator) and call result.ls_id_bitmap.resize
before loop
const BareBones::Bitset& ls_id_bitmap, | ||
const EncodingChunkLengthSequence& chunk_length_sequence, | ||
const EncodingChunkIDSequence& chunk_id_sequence) noexcept { | ||
ls_id_bitmap.write_to(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reserve size in stream: calculate needed memory size in prepare_sequences
and call stream.reserve
. See how it implemented in series_data::serialization::Serializer::serialize_impl
} | ||
} | ||
|
||
void load_next(std::span<const uint8_t> buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer std::string_view
for parsing because he has remove_prefix
method. If you will use std::string_view
you don't need a offset parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible subspan()
|
||
auto inner = typename EncodedSequence::sequence_type::DecodeIterator(compact_data, compact_data + EncodedSequence::sequence_type::kMaxKeySize, elem_count); | ||
|
||
static thread_local typename EncodedSequence::encoder_type encoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better make encoder
a class member
|
||
class Loader { | ||
using EncodingChunkLengthSequence = | ||
BareBones::EncodedSequence<BareBones::Encoding::DeltaDeltaZigZag<BareBones::StreamVByte::CompactSequence<BareBones::StreamVByte::Codec0124Frequent0>>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data types must be common for loader ad unloader. Maybe create common.h/types.h
and place this declarations in it?
} | ||
|
||
DataStorage& storage_; | ||
std::map<uint32_t, SeriesToLoadInfo> series_to_load_tmp_bitseqs_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building and find in btree map is expensive operations. I think phmap::flat_hash_map
will be better for performance and phmap::flat_hash_map
has reserve
method (you known ids count in constructor)
} | ||
|
||
const uint32_t index = get_bitset_iterator_index(bitset_it, it); | ||
const uint32_t chunk_id_snapshot = nth_value_in_encoded_sequence<EncodingChunkIDSequence>(id_it, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You write data for every ls_id sequentially. You don't need to do nth_value_in_encoded_sequence
: you need iteratively read data from sequences, check ls_id in series_to_load_tmp_bitseqs_
and do load logic
} | ||
|
||
void load_chunk_id(uint32_t ls_id, SeriesToLoadInfo& info) const { | ||
const auto chunk_data = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto chunk_data = | |
const auto& chunk_data = |
return storage_.finalized_data_streams[chunk_data.chunk().encoder.external_index]; | ||
}(); | ||
|
||
auto chunk_bit_sequence_reader = chunk_bit_sequence.reader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead all of this you can use std::memcpy
because we write data in bytes, not in bits
|
||
PROMPP_ALWAYS_INLINE static void skip_bytes(BareBones::BitSequenceReader& reader, uint32_t count) noexcept { reader.ff(BareBones::Bit::to_bits(count)); } | ||
|
||
PROMPP_ALWAYS_INLINE static void read_data(BareBones::BitSequenceReader& reader, uint32_t count, encoder::CompactBitSequence& output) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of BitSequenceReader
you can use std::memcpy
} | ||
} | ||
|
||
Encoder<> encoder{storage_}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type of Encoder
should be template type