Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create bloomfilter when writing duplicates values for a field #1698

Open
asfimport opened this issue May 28, 2024 · 2 comments
Open

Comments

@asfimport
Copy link
Collaborator

I'm unable to create a bloomfilter for a field, when I perform writes with repeating values. The bloomfilter returned is null when I try to read such a parquet file. If there are no repeating values, the bloomfilter is created without any issue.
The working and non-working case in captured in the below repo

https://github.com/MaheshGPai/parquet-mr-test
https://github.com/MaheshGPai/parquet-mr-test/blob/main/src/test/java/com/mahesh/test/AppTest.java#L73

Reporter: Mahesh Pai

Note: This issue was originally created as PARQUET-2484. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
By default, bloom filter will be disabled if dictionary encoding is applied. Please set withDictionaryEncoding(false) while creating the parquet writer and check if this issue still happens.

@asfimport
Copy link
Collaborator Author

Mahesh Pai:
Thanks @wgtmac!
When I disable dictionary encoding the bloomfilter is getting created. But I'm curious as to why this works when distinct values are written. With distinct values and dictionary encoding enabled, the writer is generating the bloomfilter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant