Memory consumption? #48

dforsber · 2021-04-20T06:52:23Z

Does this extension require memory to (uncompress) Parquet fully into the memory? Or can it "stream" the Parquet by row groups? E.g. in case of SELECT * FROM parquet('file.parquet').

In other words, does the engine need to have memory available for a full uncompressed Parquet data blob?

The text was updated successfully, but these errors were encountered:

cldellow · 2021-04-20T13:04:03Z

The intent was that it doesn't uncompress the Parquet file fully into memory, although I don't think that I specifically tested that.

The library itself doesn't create any buffers for more than a row of data at once.

IIRC, the underlying parquet-cpp library loads an entire row group's worth of data into memory at once for each column that is referenced.

Sorry I can't be more helpful here!

dforsber · 2021-04-20T18:27:43Z

That was very helpful, thank you. I guess I need to play with some PRAGMA settings to get to the point where sqlite does not OOM when just transferring a big Parquet file to sqlite database.

The process is quite slow probably due to single theeaded sqlite IO (it is also CPU heavy so I guess it doe deconpression and IO on the same thread). Even with snappy.

As I dont think the .cooy is available for Parquet files.

dforsber · 2021-04-20T18:31:33Z

Do you think it would make sense to launch a separate thread for deconpression to get more CPU power within multicore machines?

cldellow · 2021-04-20T18:52:37Z

Separate threads would make sense - I think someone else mentioned that since each column is compressed separately, there's a lot of opportunity for parallelism. If we did add that, it'd probably make sense to make the parallelism configurable.

I wonder if SELECT * might actually load everything into memory. When we emit values, we emit them with SQLITE_TRANSIENT (see

sqlite-parquet-vtable/parquet/parquet.cc

Line 294 in d44c88a

sqlite3_result_text(ctx, (const char*)rv->ptr, rv->len, SQLITE_TRANSIENT);

) which tells SQLite that it can't rely on the pointer being valid for the duration of the query, and so it must do a defensive copy. This is described at https://sqlite.org/c3ref/result_blob.html

It sounds like you're attaching the parquet file and then trying to ingest it into a regular SQLite table? I can imagine that if you were doing INSERT INTO xyz SELECT * FROM abc that sqlite could do that with a fixed amount of memory, but maybe they don't. If that was the case, a workaround, might be to do N queries that use where clauses to filter the table into N disjoint sets.

cldellow · 2021-04-20T18:53:45Z

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

dforsber · 2021-04-21T17:16:02Z

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

That would be nice. I could refine the PR and merge it.

cldellow · 2021-04-21T18:56:58Z

Great, invite sent!

dforsber · 2021-04-30T20:06:13Z

The issue was with using a UNION ALL VIEW over two Parquet files causing the whole VIEW to be materialised when accessed. Resolved the issue by not using the VIEW in the first place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption? #48

Memory consumption? #48

dforsber commented Apr 20, 2021

cldellow commented Apr 20, 2021

dforsber commented Apr 20, 2021

dforsber commented Apr 20, 2021

cldellow commented Apr 20, 2021

cldellow commented Apr 20, 2021

dforsber commented Apr 21, 2021

cldellow commented Apr 21, 2021

dforsber commented Apr 30, 2021

Memory consumption? #48

Memory consumption? #48

Comments

dforsber commented Apr 20, 2021

cldellow commented Apr 20, 2021

dforsber commented Apr 20, 2021

dforsber commented Apr 20, 2021

cldellow commented Apr 20, 2021

cldellow commented Apr 20, 2021

dforsber commented Apr 21, 2021

cldellow commented Apr 21, 2021

dforsber commented Apr 30, 2021