Improve recursive normalized data read performance #2742

phoebusm · 2025-10-30T23:03:17Z

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/18298965201

What does this implement or fix?

`read`

batch_read_keys read index keys of individual leaf nodes one by one during submission of read tasks. This PR has made this step runs in parallel in C++ layer.
It has shown read performance improvment, espceially on slow network or data with more leaf nodes:

Read	Time(s)
	Remote AWS		Local S3 Storage (moto)
	Before	After	Before	After
200 Large Dataframe	98.4112	50.547	27.7294	25.2147
2000 Small Dataframe	159.712	9.73144	33.0835	10.7383

`batch_read`

It has been changed to unify to code path with read, Now node keys are read in the same chain of root keys.
The performance has not bettered or worsened, as expected.

Read	Time(s)
	Remote AWS		Local S3 Storage (moto)
	Before	After	Before	After
2000 Symbols × 200 Dataframe	7.379	7.161	7.224	7.252

Any other comments?

ASV benchmark fails because of unreliable arrow and peakmem tests. They can be ignored.

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

python/benchmarks/real_nested_dict_read.py

alexowens90 · 2025-11-14T09:55:41Z

python/benchmarks/real_nested_dict_read.py

+        manager.log_info()
+
+    def get_symbol_name(self, symbol_idx=0):
+        return f"nested_dict_{self.params[0]}_sym{symbol_idx}"


self.params[0] is [1000], this doesn't make much sense?

I think the point was slightly missed. If we added another num_dict_entries parameter (e.g. so it's [1000, 2000], then we will want separate symbol names for the different number of dict entries

I have rewritten this bit to make it less ambiguous, as the data is not persisted anyway

alexowens90 · 2025-11-14T09:56:51Z

python/benchmarks/recursive_normalizer.py

Are there already lmdb benchmarks for reading recursively normalized data?
What about for writing?

Yea why not LMDB.
I didn't touch write but yea it's worth adding the write asv test

OK cool, can remove the real_ from this filename now then

cpp/arcticdb/version/local_versioned_engine.cpp

cpp/arcticdb/version/version_core.cpp

cpp/arcticdb/version/version_core.hpp

cpp/arcticdb/entity/read_result.hpp

phoebusm · 2025-11-19T13:09:59Z

cpp/arcticdb/pipeline/pipeline_utils.hpp

    }
 }

+inline ReadResult create_python_read_result(


Just moving existing function to a new place to resolve some circular dependency issue

phoebusm · 2025-11-19T13:14:15Z

cpp/arcticdb/version/version_core.cpp

    return versioned_item;
 }

-auto get_read_and_process_fn(


https://github.com/man-group/ArcticDB/pull/2742/files/1c94f45e6d16af1226a945c2040fd27da1b107cb..407257ad91088f24fef13bce2062bcc195131d98#r2541952458

cpp/arcticdb/version/local_versioned_engine.cpp

vasil-pashov · 2025-11-21T11:43:24Z

cpp/arcticdb/pipeline/pipeline_utils.hpp

    }
 }

+inline ReadResult create_python_read_result(


Can this be moved to a .cpp file

It has been marked inline explicitly so I prefer not to touch that

vasil-pashov · 2025-11-21T11:45:46Z

cpp/arcticdb/python/adapt_read_dataframe.hpp

+    py::list node_results;
+    for (auto& node_result : ret.node_results) {
+        node_results.append(py::make_tuple(
+                node_result.symbol_,
+                std::move(node_result.frame_data_),
+                python_util::pb_to_python(node_result.norm_meta_)
+        ));
+    }


This seems duplicated in python_utils.hpp can it be extracted in a function?

vasil-pashov · 2025-11-21T12:06:03Z

cpp/arcticdb/version/local_versioned_engine.cpp

+        for (const auto& key : keys) {
+            node_futures.emplace_back(read_frame_for_version(store(), key, read_query, read_options, handler_data));
        }
+        auto node_trys = folly::collectAll(node_futures).get();


Since we're not wrapping exceptions as in batch data can't we use folly::collect to fail faster? In that case we must be very careful not to have segfaults on failures and make all parameters for the future shared_ptrs.

Me and Alex have discussion regarding this and we think, as fail faster is not that important here, this implemetation is prefered

phoebusm changed the title ~~Feature/recursive normalizer optimization~~ Improve recursive normalized data read performance Oct 31, 2025

phoebusm added the minor Feature change, should increase minor version label Oct 31, 2025

phoebusm marked this pull request as ready for review October 31, 2025 18:51

phoebusm requested review from IvoDD, alexowens90 and poodlewars as code owners October 31, 2025 18:51

phoebusm marked this pull request as draft November 5, 2025 13:54

phoebusm force-pushed the feature/recursive_normalizer_optimization branch 3 times, most recently from 49d4f83 to 8d9256b Compare November 7, 2025 18:15

phoebusm added 20 commits November 11, 2025 14:02

Recursive normalization read optimization

3c54da1

Submit the task via executor

5293025

Fix seg fault

3d3f5c4

Format

b666807

Minor amend

6341312

Move declare

3a641e5

Format

0315676

Simplify non-recursive normalized data path

d7b704d

Rearrange functions

17c1bf1

Format

48e989d

Add asv test

e8e9d44

Read node data in one go

00436dc

Fix reference issue

251cbfe

Simplify the folly mess

aeebdcc

Corresponding python layer changes

335cf52

Minor amendment

3002ce9

Fix data conversion

5221fe6

Try fix c++ error

2a611c5

Improve batch_read

87ab72f

Add read_batch asv test

3f5b9d9

phoebusm added 3 commits November 11, 2025 14:02

Format issue

9071241

Fix asv tests

5227b0a

Remove multi_keys

3c72055

phoebusm force-pushed the feature/recursive_normalizer_optimization branch from 8d9256b to 785f385 Compare November 11, 2025 14:02

Fix rebase issue

0b72b38

phoebusm force-pushed the feature/recursive_normalizer_optimization branch from a52e95d to 0b72b38 Compare November 12, 2025 16:10

Format

1c94f45

phoebusm marked this pull request as ready for review November 12, 2025 20:05

phoebusm added patch Small change, should increase patch version and removed minor Feature change, should increase minor version labels Nov 12, 2025

alexowens90 requested changes Nov 14, 2025

View reviewed changes

phoebusm force-pushed the feature/recursive_normalizer_optimization branch from 6f9a6d3 to 407257a Compare November 19, 2025 13:03

phoebusm commented Nov 19, 2025

View reviewed changes

Address PR comments

a7c4926

phoebusm force-pushed the feature/recursive_normalizer_optimization branch from 407257a to a7c4926 Compare November 19, 2025 13:19

alexowens90 reviewed Nov 19, 2025

View reviewed changes

cpp/arcticdb/version/local_versioned_engine.cpp Show resolved Hide resolved

phoebusm added 2 commits November 19, 2025 17:51

Update asv test

d0eea77

Extend asv coverage

8471058

phoebusm force-pushed the feature/recursive_normalizer_optimization branch from bb9f7c4 to 8471058 Compare November 20, 2025 11:35

alexowens90 approved these changes Nov 20, 2025

View reviewed changes

vasil-pashov approved these changes Nov 21, 2025

View reviewed changes

Address PR comments

9280d5e

Improve recursive normalized data read performance #2742

Are you sure you want to change the base?

Improve recursive normalized data read performance #2742

Conversation

phoebusm commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

read

batch_read

Any other comments?

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phoebusm Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

phoebusm commented Oct 30, 2025 •

edited

Loading

`read`

`batch_read`

phoebusm Nov 19, 2025 •

edited

Loading