Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Filter predicate storing nested string causing OOM's #403

Open
asfimport opened this issue Dec 29, 2022 · 3 comments
Open

Parquet Filter predicate storing nested string causing OOM's #403

asfimport opened this issue Dec 29, 2022 · 3 comments

Comments

@asfimport
Copy link
Collaborator

Each Instance of ColumnFilterPredicate stores the filter values in toString variable eagerly. Which is not useful

static abstract class ColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable  {
  private final Column<T> column;
  private final T value;
  private final String toString; 


protected ColumnFilterPredicate(Column<T> column, T value) {
  this.column = Objects.requireNonNull(column, "column cannot be null");

  // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, so they guard against
  // null in their own constructors.
  this.value = value;

  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + value + ")";
}

 

 

If your filter predicate is too long/nested this can take a lot of memory while creating Filter.
We have seen in our productions this can go upto 4gbs of space while opening multiple parquet readers

Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is eagerly calculated and stored in string and lot of duplication is happening while making And/or filter.

I did not find use case of storing it so eagerly

Reporter: Abhishek Jain

Note: This issue was originally created as PARQUET-2220. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Abhishek Jain:
Can anyone from parquet contributors take a look on this ? 

@asfimport
Copy link
Collaborator Author

Abhishek Jain:
very sorry for tagging @gszadovszky @theosib-amazon . Just want to get this noticed

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
[~abhiSumo304], I agree eagerly storing the toString value is not a good idea. I don't think it has proper use case either. toString should be used for debugging purposes anyway so eagerly storing the value does not really make sense. Unfortunately, I don't work on the Parquet code base actively anymore. Feel free to put up a PR to fix this and I'll try to review it in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant