Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption? #48

Open
dforsber opened this issue Apr 20, 2021 · 8 comments
Open

Memory consumption? #48

dforsber opened this issue Apr 20, 2021 · 8 comments

Comments

@dforsber
Copy link
Collaborator

Does this extension require memory to (uncompress) Parquet fully into the memory? Or can it "stream" the Parquet by row groups? E.g. in case of SELECT * FROM parquet('file.parquet').

In other words, does the engine need to have memory available for a full uncompressed Parquet data blob?

@cldellow
Copy link
Owner

The intent was that it doesn't uncompress the Parquet file fully into memory, although I don't think that I specifically tested that.

The library itself doesn't create any buffers for more than a row of data at once.

IIRC, the underlying parquet-cpp library loads an entire row group's worth of data into memory at once for each column that is referenced.

Sorry I can't be more helpful here!

@dforsber
Copy link
Collaborator Author

That was very helpful, thank you. I guess I need to play with some PRAGMA settings to get to the point where sqlite does not OOM when just transferring a big Parquet file to sqlite database.

The process is quite slow probably due to single theeaded sqlite IO (it is also CPU heavy so I guess it doe deconpression and IO on the same thread). Even with snappy.

As I dont think the .cooy is available for Parquet files.

@dforsber
Copy link
Collaborator Author

Do you think it would make sense to launch a separate thread for deconpression to get more CPU power within multicore machines?

@cldellow
Copy link
Owner

Separate threads would make sense - I think someone else mentioned that since each column is compressed separately, there's a lot of opportunity for parallelism. If we did add that, it'd probably make sense to make the parallelism configurable.

I wonder if SELECT * might actually load everything into memory. When we emit values, we emit them with SQLITE_TRANSIENT (see

sqlite3_result_text(ctx, (const char*)rv->ptr, rv->len, SQLITE_TRANSIENT);
) which tells SQLite that it can't rely on the pointer being valid for the duration of the query, and so it must do a defensive copy. This is described at https://sqlite.org/c3ref/result_blob.html

It sounds like you're attaching the parquet file and then trying to ingest it into a regular SQLite table? I can imagine that if you were doing INSERT INTO xyz SELECT * FROM abc that sqlite could do that with a fixed amount of memory, but maybe they don't. If that was the case, a workaround, might be to do N queries that use where clauses to filter the table into N disjoint sets.

@cldellow
Copy link
Owner

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

@dforsber
Copy link
Collaborator Author

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

That would be nice. I could refine the PR and merge it.

@cldellow
Copy link
Owner

Great, invite sent!

@dforsber
Copy link
Collaborator Author

The issue was with using a UNION ALL VIEW over two Parquet files causing the whole VIEW to be materialised when accessed. Resolved the issue by not using the VIEW in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants