-
Notifications
You must be signed in to change notification settings - Fork 153
Support recursive normalizer in v2 API write #2736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d773062 to
b7e0ad0
Compare
b7e0ad0 to
c5ecb03
Compare
python/tests/unit/arcticdb/version_store/test_recursive_normalizers.py
Outdated
Show resolved
Hide resolved
python/tests/unit/arcticdb/version_store/test_recursive_normalizers.py
Outdated
Show resolved
Hide resolved
| metadata=metadata, | ||
| prune_previous_version=prune_previous_versions, | ||
| pickle_on_failure=True, | ||
| parallel=staged, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V2 API allows staging non-natively normalized data in write_pickle by passing staged=True. Maybe this combination should be blocked in future major release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think there is a ticket to remove the staged argument from write_pickle as it doesn't make sense. Technically an API break though so needs to wait for a better reason to do 7.0.0
b206cdb to
8d2c350
Compare
| index_column: Optional[str], default=None | ||
| Optional specification of timeseries index column if data is an Arrow table. Ignored if data is not an Arrow | ||
| table. | ||
| recursive_normalizers: bool, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be a bit more descriptive in the docstring. Even though it's working in V1 almost no one from the outside world knows about it. IMO we should treat it as a new feature. We should also improve the description of the PR as it will go in the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have given a bit more details in the PR desc.
And I have also created a notebook for the feature. In docstring, users will be referred to it for more details
| if is_recursive_normalizers_enabled: | ||
| if staged: | ||
| raise ArcticUnsupportedDataTypeException( | ||
| "Staged data must be of a type that can be natively normalized" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a bit more explicit here and tell the user that they're trying to use recursive normalizers for staged data and that it's not allowed. It will be more useful for the user and for us when a support request comes in.
| See documentation on `write`. | ||
| recursive_normalizers: bool, default None | ||
| See documentation on `write`. | ||
| If the leaf nodes cannot be natively normalized, they will be pickled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting did I miss a discussion on this. I'd expect write pickle to pickle everything all the time regardless of normalizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good spot. If both are enabled, recursive normalizer has priority over pickling. I have amended to docstring to explain the priority.
It follows V1 API on this.
804d2a5 to
2b29c99
Compare
Reference Issues/PRs
https://man312219.monday.com/boards/7852509418/pulses/18298965201
What does this implement or fix?
def writeanddef write_picklein v2 API support recursive normalizerdict,list,tuple) of dataframes and arrays without having to pickling the entire structurePer discussed, in v2 API, recursive normalizer setting in LibraryOption will be respected. Default library option of recursive normalizer will be
False. Existing libraries are unaffected.Any other comments?
def batch_writein v1 API doesn't support recursive normalizer. Ticket: https://man312219.monday.com/boards/7852509418/pulses/7855436309Therefore the support of recursive normalizer in corresponding v2 API will not be covered in this PR.
Pickling
picklingis a bit of a mess in V1 API.For
arrowandpandasdata, if the normalization fails, it can fallback to msgpack and pickling, depending on whetherpickle_on_failureisTrue.However, for other kinds of data, it almost certainly fallback to msgpack and pickling. The only option to prevent pickling is a library config
strict_mode. But I don't see there is any API to enable this option, in V1/V2 nor internally.Checklist
Checklist for code changes...