Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will cache eviction logic take previously-existing shards into account? #844

Open
jamin-chen opened this issue Dec 5, 2024 · 3 comments

Comments

@jamin-chen
Copy link

If I run training twice with the same local cache directory, will the second run be able to evict shards that were downloaded by the first run?

@snarayan21
Copy link
Collaborator

@jamin-chen Great question -- yes this should be the case since StreamingDataset tracks the cache usage even for locally present shards. Are you seeing behavior contrary to this?

@jamin-chen
Copy link
Author

Maybe, but there are a few confounding factors on our end so I need to test further. In the meantime could you point me to the code where StreamingDataset lists all pre-existing shards in the local directory?

@snarayan21
Copy link
Collaborator

@jamin-chen Sorry for the delay in responding to this. So in StreamingDataset's prepare_shard function here, all shard states should start as REMOTE. Then, the particular Stream's prepare_shard function is called here, which fetches the shard and also returns the delta of the increase in cache limit space (see here). Importantly, this code path is followed even for locally present shard files, so the delta in cache usage that is returned by prepare_shard is the same even if the file is already locally present. So the cache usage of locally present shards should be correctly tracked.

Does this help with your custom use case / issues you're seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants