-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory consumption? #48
Comments
The intent was that it doesn't uncompress the Parquet file fully into memory, although I don't think that I specifically tested that. The library itself doesn't create any buffers for more than a row of data at once. IIRC, the underlying parquet-cpp library loads an entire row group's worth of data into memory at once for each column that is referenced. Sorry I can't be more helpful here! |
That was very helpful, thank you. I guess I need to play with some PRAGMA settings to get to the point where sqlite does not OOM when just transferring a big Parquet file to sqlite database. The process is quite slow probably due to single theeaded sqlite IO (it is also CPU heavy so I guess it doe deconpression and IO on the same thread). Even with snappy. As I dont think the .cooy is available for Parquet files. |
Do you think it would make sense to launch a separate thread for deconpression to get more CPU power within multicore machines? |
Separate threads would make sense - I think someone else mentioned that since each column is compressed separately, there's a lot of opportunity for parallelism. If we did add that, it'd probably make sense to make the parallelism configurable. I wonder if sqlite-parquet-vtable/parquet/parquet.cc Line 294 in d44c88a
It sounds like you're attaching the parquet file and then trying to ingest it into a regular SQLite table? I can imagine that if you were doing |
BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :( Would you like to be added as a collaborator so you can directly merge stuff? |
That would be nice. I could refine the PR and merge it. |
Great, invite sent! |
The issue was with using a UNION ALL VIEW over two Parquet files causing the whole VIEW to be materialised when accessed. Resolved the issue by not using the VIEW in the first place. |
Does this extension require memory to (uncompress) Parquet fully into the memory? Or can it "stream" the Parquet by row groups? E.g. in case of
SELECT * FROM parquet('file.parquet')
.In other words, does the engine need to have memory available for a full uncompressed Parquet data blob?
The text was updated successfully, but these errors were encountered: