Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to subsample when loading the binary ? #13

Open
ReHoss opened this issue Mar 10, 2023 · 3 comments
Open

Possibility to subsample when loading the binary ? #13

ReHoss opened this issue Mar 10, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@ReHoss
Copy link

ReHoss commented Mar 10, 2023

Hello,

Is it possible to subsample the event file while loading ? What do you recommend if we don't have enough RAM like in a Jupyter notebook to read the event file ?

I know that tensorboard use a subsampling strategy.

Thanks for you consideration.

@j3soon
Copy link
Owner

j3soon commented Mar 14, 2023

Hi,

I would like to know more details about your use case. What event types are you loading? and how large is the event file? Does your use case require iterating through all events, or does it only need to process certain filtered events?

tbparse is designed to load all events directly into the system memory, and currently does not support subsampling. However, it may be possible to add a feature for pre-filtering the events in the future, given valid use cases.

If you simply want to iterate through the events, maybe you can try out the raw method by TensorBoard/TensorFlow as documented here.

@ReHoss
Copy link
Author

ReHoss commented Mar 14, 2023

From: https://github.com/tensorflow/tensorboard/blob/master/README.md

Is my data being downsampled? Am I really seeing all the data?

TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag by using the --samples_per_plugin command line argument (ex: --samples_per_plugin=scalars=500,images=20). See this Stack Overflow question for some more information.

And according to the help command:

--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most users should not need to set this flag. (default: '')

For instance, the asker from the StackOverflow thread trains over 20M steps.

I train over 1e6 steps but run 100 experiments. If I log accurately the training score I end up with an extremely large DataFrame.

It would be nice to have an option to downsample randomly (with a seed interface then) or evenly. Ideally for n training curves, same time steps are kept.

Thank you for your consideration,
Best,

@j3soon
Copy link
Owner

j3soon commented Mar 19, 2023

Thanks for providing the detailed information. I think reservoir sampling is a useful feature and won't be too hard to implement. However, I'm not sure if we can manually set the RNG seed...

This feature may be implemented by modifying the code here. I'll see if I can add this feature in my free time.

Meanwhile, I suggest loading each experiments individually and downsample them by yourself. You can retrieve a deterministic results by stacking the downsampled experiments.

@j3soon j3soon added the enhancement New feature or request label Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants