-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unambiguous compression setup to resume properly #682
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to introduce a common NNCFNetwork class in this PR?
Introduce a new way of saving and loading NNCF Compression State: model_state_dict = compression_model.state_dict()
compression_state = ctrl.get_compression_state()
...
create_compressed_model(model, config, compression_state=compression_state)
load_state(model, model_state_dict, is_strict=True) Instead of: model_state_dict = compression_model.state_dict()
ctrl_state = ctrl.get_state()
create_compressed_model(model, config, resuming_state_dict=model_state_dict)
ctrl.load_state(ctrl_state) Previously, it was done in a standart PyTorch way via A we discussed, for unambiguous restoring compressed model we need 2 more custom structures besides torch tensors. - builder and controller states. For instance, QuantizerSetup that describes where to insert FQs, their dependencies and parameters. nncf/nncf/torch/quantization/quantizer_setup.py Lines 97 to 103 in f2cc93d
Ideally, we would like to override
{
"model_state": super.state_dict(),
"ctrl_state": ctrl.get_state(),
"builder_state": builder_state
} However, DDP hangs again, because it heavily relies on parameters of modules and expects only Pytorch Tensors. if isinstance(module, DataParallel):
module = module.module
checkpoint = module.get_checkpoint()
|
During debugging I find out that json can't handle scheduler state when it is too big. For example if I have current_step We have to either cast all values in |
this comment is obsolete, since standard Python int should be enough to represent very long numbers, just need to cast numpy.int64 |
Jenkins please retry a build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment, or mark the spots in the PR by comments of your own, that illustrate:
- the changes to the user flow that are mandatory after this PR in order for nothing to break (it would be good if there were no such changes at all)
- the exact way in which the user is supposed to save an NNCF checkpoint in their flow
- the exact way in which the user is supposed to load the NNCF checkpoint
- the additional operations that the NNCF algo developer should do in general in order to mark some or the other part of their algorithm data to become save-able and load-able from such checkpoints
I think that illustrating these points would help with the review.
definitely make sense, will do it shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.
Let's do it iteratively if there's no concern about API, otherwise we would need extra effort to keep this branch merged with upcoming changes in develop. |
I don't have this in plans. |
SOTA eval validation has FAILED🤕, because of breaking changes in the builder related classes. Need to update re-run TF eval and correct checkpoints for PT/
|
Jenkins please retry a build |
1 similar comment
Jenkins please retry a build |
SOTA eval validation is WIP
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the most recent changes, just FYI
tests/tensorflow/data/configs/mask_rcnn_coco2017_magnitude_sparsity_int8.json
Show resolved
Hide resolved
...torch/classification/configs/sparsity_quantization/resnet50_imagenet_rb_sparsity50_int8.json
Show resolved
Hide resolved
SOTA validation is green for PT and TF |
🎉 🎉 🎉 |
Post build sota eval validation is green (PT - 415, TF - 279).
|
Introduced a new way of resuming compression for PyTorch and TensorFlow. The idea is to restore compression state instead of building it from scratch according to the config.
Introduced a new way of resuming compression for PyTorch and TensorFlow.
The idea is to restore compression state instead of building it from scratch according to the config.
It's essential for AutoQ/HAWQ/NAS-like algorithms that are not deterministic and depend on the input data.
Therefore there is a chance that a checkpoint that was saved after AutoQ of one NNCF run will not be loadable/resumable
in another NNCF run. A complete information on how the quantizers are set up in the model should be saved along with
the checkpoints, so as to be able to load a quantized checkpoint for evaluation at all times.
This information is saved by CompressionState class
PyTorch NOW
PyTorch BEFORE
TensorFlow NOW
TensorFlow BEFORE