Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long process of getting repr of huge objects in torch #461

Open
seniorsolt opened this issue Aug 2, 2024 · 7 comments
Open

Very long process of getting repr of huge objects in torch #461

seniorsolt opened this issue Aug 2, 2024 · 7 comments

Comments

@seniorsolt
Copy link

seniorsolt commented Aug 2, 2024

viztracer --tracer_entries 10000000 --ignore_frozen --ignore_c_function --log_func_args --max_stack_depth 30 -- scripts\train_ui.py

2024-08-02_18-05-19

it's not infinite recursion like in #338, just very long process. Maybe additional flag limiting repr process would solve both issues.
Max Stack depth option helps if it prevents from entrying into this objects, but if we are already in, it will not help us to get out

@gaogaotiantian
Copy link
Owner

VizTracer is not able to limit repr - it can do whatever it wants. If repr itself takes too long, that's not something VizTracer can deal with. Worst case all of the repr process is written in C and VizTracer will not even be triggered. However, the tracer itself should be paused when calculating the repr.

@seniorsolt
Copy link
Author

seniorsolt commented Aug 3, 2024

I didn't understand the part about repr in C, where it says that viztracer will not be triggered (invoked?). It's viztracer that calls repr, isn't it?

I think it is possible to implement this using execution isolation (running repr in a separate thread, process, or asynchronous task to limit execution with a timeout). However, it sounds like a lot of effort just for handling torch, which can be simply filtered out through exclude files.

Another way I can suppose - use sys.getsizeof() before calling repr and skip large files.
Anyway the issue is not so important, maybe the best solution is doing nothing) Let this issue be here just for information

Upd: Another one - print warning about unusual long repr during processing <file.py> to show user file which he might want to exclude

@gaogaotiantian
Copy link
Owner

Yes it's viztracer that calls repr, but it's the user that asks viztracer to call it. The time of repr has nothing to do with the size of the file/object - it only depends on how the object implements it. Running in a separate executor is not a solution, there will be racing issue or it will stuck (if not concurrent).

Overall, if an object decides to make its repr slow, there's nothing viztracer can do.

@seniorsolt
Copy link
Author

The time of repr depends on size of object if the object repr implementation is supposed to iterate all of its stuff as in case of torch storage. However, I agree, that's a corner case, and we can't say wheather time of repr depends on size or not before invoking repr.

What about running repr in separate executor, why will there be racing issue or stuck? And why we don't see it now? Can't we find a way of implementation without introducing new issues?

@seniorsolt
Copy link
Author

seniorsolt commented Aug 10, 2024

think I found one more option - using base class method instead of overloaded method of torch tensor:
image

@gaogaotiantian
Copy link
Owner

What about running repr in separate executor, why will there be racing issue or stuck? And why we don't see it now? Can't we find a way of implementation without introducing new issues?

Concurrency is not the silver bullet for everything that's slow. VizTracer needs the result of repr at that point, it's not helpful to calculate the result in a separate thread/process because VizTracer will be blocked there waiting for the result.

Is it possible to just move forward without the result? Maybe. But first of all - it's just not worth it because it's not worth it. You'll need to change a lot of the current structure to make it work. Then, what if the object changed during the process? That's the racing I talked about.

So, overall, it's infeasible and not worth it to do it in a separate executor.

Also, we will not do anything special just for torch. It's not a TorchTracer. We could potentially provide a way to solve all the similar issues (like I mentioned, use objprint or provide a way to customize all of your repr requests). So any solution specific to torch is not a solution.

@seniorsolt
Copy link
Author

okay, sounds reasonable, thank you for your time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants