Advice request on the way to yield pandas dataframes in chunk. #1782

yohplala · 2021-12-18T20:00:13Z

yohplala
Dec 18, 2021

Hi,

I am implementing a function that yields pandas dataframes from vaex with variable chunk sizes.

Hence, I cannot directly rely on vdf.to_pandas_df(chunk_size=50_000_000)
Instead, I am using yield vdf[start:end].to_pandas_df(), with start and end being updated in a for loop.

Please, do you see any bottleneck / performance issue with this approach?
(I am asking, as vaex tends sometimes to show surprises :))
thanks in advance for your feedback!
Bests,

maartenbreddels · 2021-12-18T20:48:43Z

maartenbreddels
Dec 18, 2021
Maintainer

What is the data source, and is the data frame filtered? (from mobile phone) Op za 18 dec. 2021 21:00 schreef yohplala ***@***.***>:

…

Hi, I am implementing a function that yields pandas dataframes from vaex with *variable* chunk sizes. Hence, I cannot directly rely on vdf.to_pandas_df(chunk_size=50_000_000) Instead, I am using yield vdf[start:end].to_pandas_df(), with start and end being updated in a for loop. Please, do you see any bottleneck / performance issue with this approach? (I am asking, as vaex tends sometimes to show surprises :)) thanks in advance for your feedback! Bests, — Reply to this email directly, view it on GitHub <#1782>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANPEPIYVUYQBUOVUU5JJPTURTR5RANCNFSM5KLAPM2Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

yohplala Dec 18, 2021
Author

What is the data source, and is the data frame filtered?

Short term, data source is pyarrow files. I will process data with vaex, doing aggregation to reduce the data (groupby), then I would like to yield it as mentionned above.
Yes, the data can be filtered, to select only 'new data' . Is it recommended to do an extract after the filtering, as mentionned in the FAQ.
To be able to manually add a new column to the filtered df2 DataFrame, one needs to use the df2.extract() method first.

Mid term, data source will be parquet files, again some data processing with vaex, then yielding in pandas format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice request on the way to yield pandas dataframes in chunk. #1782

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Advice request on the way to yield pandas dataframes in chunk. #1782

yohplala Dec 18, 2021

Replies: 1 comment · 1 reply

maartenbreddels Dec 18, 2021 Maintainer

yohplala Dec 18, 2021 Author

yohplala
Dec 18, 2021

Replies: 1 comment 1 reply

maartenbreddels
Dec 18, 2021
Maintainer

yohplala Dec 18, 2021
Author