Skip to content

Unused series unloading #110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: pp
Choose a base branch
from
Open

Unused series unloading #110

wants to merge 32 commits into from

Conversation

gshigin
Copy link
Collaborator

@gshigin gshigin commented Jun 2, 2025

  • Loader/Unloader classes for unused series manipulation
  • Test coverage

@gshigin gshigin requested a review from cherep58 June 2, 2025 16:02
@gshigin gshigin self-assigned this Jun 2, 2025
glebshigin added 2 commits June 2, 2025 19:07
# Conflicts:
#	pp/bare_bones/encoding.h
@gshigin gshigin marked this pull request as ready for review June 3, 2025 07:49
@gshigin gshigin requested a review from vporoshok as a code owner June 3, 2025 07:49
@@ -321,6 +325,13 @@ class PROMPP_ATTRIBUTE_PACKED CompactBitSequence : public CompactBitSequenceBase
public:
[[nodiscard]] PROMPP_ALWAYS_INLINE BitSequenceReader reader() const noexcept { return {Base::memory_, size_in_bits_}; };

PROMPP_ALWAYS_INLINE void trim_lower_bytes(uint32_t bytes_count) {
assert(Bit::to_bits(bytes_count) <= Base::size_in_bits());
memset(Base::memory_, '\0', bytes_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid zeroing bytes: you should zero bytes in region Base::size_in_bytes() - bytes_count

template <OutputStream S>
PROMPP_ALWAYS_INLINE void write_to(S& stream) const noexcept {
const uint32_t data_size_in_bits = size();
const uint32_t data_size_in_bytes = ((data_size_in_bits + 63) >> 6) * 8;
Copy link
Collaborator

@cherep58 cherep58 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create method:

bare_bones/bit.h

template <class T>
PROMPP_ALWAYS_INLINE constexpr T used_bytes(T bits) noexcept {
  return to_bytes(bits + kByteBits - 1);
}

And use it here. Also you can refactor similar code in CompactBitSequenceBase::size_in_bytes and in Loader class


[[nodiscard]] PreparedSequences prepare_sequences() const noexcept {
PreparedSequences result{};
result.total_bitseqs_size = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to initialize this field in PreparedSequences:

struct PreparedSequences {
   ...
  uint32_t total_bitseqs_size{};
}


for (const auto ls_id : storage_.unused_series_bitmap) {
const auto encoding_type = storage_.open_chunks[ls_id].encoding_state.encoding_type;
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) {
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) {
Suggested change
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) {
if (is_unloadable_encoder(encoding_type)) {

Because unloadable chunk can't be empty

for (const auto ls_id : storage_.unused_series_bitmap) {
const auto encoding_type = storage_.open_chunks[ls_id].encoding_state.encoding_type;
if (!storage_.open_chunks[ls_id].is_empty() && is_unloadable_encoder(encoding_type)) {
result.ls_id_bitmap.resize(ls_id + 1);
Copy link
Collaborator

@cherep58 cherep58 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call result.ls_id_bitmap inside loop. It degrades performance.
You can find max unloadable chunk ls_id (use reverse iterator) and call result.ls_id_bitmap.resize before loop

const BareBones::Bitset& ls_id_bitmap,
const EncodingChunkLengthSequence& chunk_length_sequence,
const EncodingChunkIDSequence& chunk_id_sequence) noexcept {
ls_id_bitmap.write_to(stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reserve size in stream: calculate needed memory size in prepare_sequences and call stream.reserve. See how it implemented in series_data::serialization::Serializer::serialize_impl

}
}

void load_next(std::span<const uint8_t> buffer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer std::string_view for parsing because he has remove_prefix method. If you will use std::string_view you don't need a offset parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible subspan()


auto inner = typename EncodedSequence::sequence_type::DecodeIterator(compact_data, compact_data + EncodedSequence::sequence_type::kMaxKeySize, elem_count);

static thread_local typename EncodedSequence::encoder_type encoder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better make encoder a class member


class Loader {
using EncodingChunkLengthSequence =
BareBones::EncodedSequence<BareBones::Encoding::DeltaDeltaZigZag<BareBones::StreamVByte::CompactSequence<BareBones::StreamVByte::Codec0124Frequent0>>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data types must be common for loader ad unloader. Maybe create common.h/types.h and place this declarations in it?

}

DataStorage& storage_;
std::map<uint32_t, SeriesToLoadInfo> series_to_load_tmp_bitseqs_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building and find in btree map is expensive operations. I think phmap::flat_hash_map will be better for performance and phmap::flat_hash_map has reserve method (you known ids count in constructor)

}

const uint32_t index = get_bitset_iterator_index(bitset_it, it);
const uint32_t chunk_id_snapshot = nth_value_in_encoded_sequence<EncodingChunkIDSequence>(id_it, index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You write data for every ls_id sequentially. You don't need to do nth_value_in_encoded_sequence: you need iteratively read data from sequences, check ls_id in series_to_load_tmp_bitseqs_ and do load logic

}

void load_chunk_id(uint32_t ls_id, SeriesToLoadInfo& info) const {
const auto chunk_data =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto chunk_data =
const auto& chunk_data =

return storage_.finalized_data_streams[chunk_data.chunk().encoder.external_index];
}();

auto chunk_bit_sequence_reader = chunk_bit_sequence.reader();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead all of this you can use std::memcpy because we write data in bytes, not in bits


PROMPP_ALWAYS_INLINE static void skip_bytes(BareBones::BitSequenceReader& reader, uint32_t count) noexcept { reader.ff(BareBones::Bit::to_bits(count)); }

PROMPP_ALWAYS_INLINE static void read_data(BareBones::BitSequenceReader& reader, uint32_t count, encoder::CompactBitSequence& output) noexcept {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of BitSequenceReader you can use std::memcpy

}
}

Encoder<> encoder{storage_};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of Encoder should be template type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants