Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unambiguous compression setup to resume properly #682

Merged
merged 46 commits into from
Jul 6, 2021

Conversation

ljaljushkin
Copy link
Contributor

@ljaljushkin ljaljushkin commented Apr 27, 2021

Introduced a new way of resuming compression for PyTorch and TensorFlow.
The idea is to restore compression state instead of building it from scratch according to the config.

It's essential for AutoQ/HAWQ/NAS-like algorithms that are not deterministic and depend on the input data.
Therefore there is a chance that a checkpoint that was saved after AutoQ of one NNCF run will not be loadable/resumable
in another NNCF run. A complete information on how the quantizers are set up in the model should be saved along with
the checkpoints, so as to be able to load a quantized checkpoint for evaluation at all times.

This information is saved by CompressionState class

PyTorch NOW

model_state_dict = compression_model.state_dict()
compression_state = ctrl.get_compression_state()
...
create_compressed_model(model, config, compression_state=compression_state)
load_state(model, model_state_dict, is_strict=True)

PyTorch BEFORE

model_state_dict = compression_model.state_dict()
ctrl_state = ctrl.get_state()

create_compressed_model(model, config, resuming_state_dict=model_state_dict)
ctrl.load_state(ctrl_state)

TensorFlow NOW

checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
load_checkpoint(checkpoint, ckpt_path)
compression_state = checkpoint.compression_state.state

compression_ctrl, compress_model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=compress_model, compression_state=TFCompressionState(compression_ctrl))
load_checkpoint(checkpoint=checkpoint, ckpt_path=config.ckpt_path)

TensorFlow BEFORE

compression_ctrl, compress_model = create_compressed_model(model, nncf_config, should_init=not resume_training)
checkpoint = tf.train.Checkpoint(model=compress_model, compression_ctrl=compression_ctrl)
load_checkpoint(checkpoint=checkpoint, ckpt_path=config.ckpt_path)

@ljaljushkin ljaljushkin marked this pull request as draft April 27, 2021 08:31
Copy link
Contributor

@alexsu52 alexsu52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to introduce a common NNCFNetwork class in this PR?

nncf/common/quantization/structs.py Show resolved Hide resolved
@ljaljushkin
Copy link
Contributor Author

ljaljushkin commented Apr 27, 2021

  • backward compatibility tests
  • doc strings
  • more tests for corner cases (not matching configs on resume)
  • fix CI (tests and pylint)
  • fix patches of 3rd party integration

Introduce a new way of saving and loading NNCF Compression State:

model_state_dict = compression_model.state_dict()
compression_state = ctrl.get_compression_state()
...
create_compressed_model(model, config, compression_state=compression_state)
load_state(model, model_state_dict, is_strict=True)

Instead of:

model_state_dict = compression_model.state_dict()
ctrl_state = ctrl.get_state()

create_compressed_model(model, config, resuming_state_dict=model_state_dict)
ctrl.load_state(ctrl_state)

Previously, it was done in a standart PyTorch way via state_dict() call, which is defined for torch.Module and its wrappers - NNCFNetwork and Distributed/DataParallel (DDP, DP).
It's a dictionary consisting of PyTorch Tensors and string keys.

A we discussed, for unambiguous restoring compressed model we need 2 more custom structures besides torch tensors. - builder and controller states. For instance, QuantizerSetup that describes where to insert FQs, their dependencies and parameters.

class QuantizerSetupBase:
def __init__(self):
self.quantization_points = {} # type: Dict[QuantizationPointId, QuantizationPointBase]
self.unified_scale_groups = {} # type: Dict[int, Set[QuantizationPointId]]
self.shared_input_operation_set_groups = {} # type: Dict[int, Set[QuantizationPointId]]
self._next_unified_scale_gid = 0
self._next_shared_inputs_gid = 0
)

Ideally, we would like to override state_dict to include these 2 structures.
But all approaches, I am aware of, leads to freezing in DDP:

  1. we can encode builder/ctrl states to ByteTensor (object -> json-compatible dict -> json str -> bytes -> ByteTensor)
    But can't register buffer for this tensor on NNCFNetwork init, which is required by DDP. We can't do that because we don't know about builders in that moment - they are applied later by design. And as soon as we register buffer outside of init, DDP hangs on broadcasting.
    Moreover, we can't guarantee identical sizes of these tensors on each GPU (in case of sophisticated initialization), which is also required for DDP.
  2. we can override state_dict() to return Dict like this:
{ 
  "model_state": super.state_dict(), 
  "ctrl_state": ctrl.get_state(), 
   "builder_state": builder_state
} 

However, DDP hangs again, because it heavily relies on parameters of modules and expects only Pytorch Tensors.
Hence, state_dict can't be overridden for including builder and controller states.
3) New method of NNCFNetwork (e.g. get_checkpoint) is also unacceptable, because DDP, DP doesn't have it, and user will need extract NNCFNetwork each time:

if isinstance(module, DataParallel):
  module = module.module
checkpoint = module.get_checkpoint()
  1. A new method of controller is the only remaining approach.
nncf_checkpoint = ctrl.get_nncf_checkpoint()

nncf/graph/transformations/commands.py Outdated Show resolved Hide resolved
nncf/quantization/algo.py Outdated Show resolved Hide resolved
nncf/quantization/algo.py Outdated Show resolved Hide resolved
tests/quantization/test_serialize_to_json.py Outdated Show resolved Hide resolved
@daniil-lyakhov
Copy link
Collaborator

During debugging I find out that json can't handle scheduler state when it is too big. For example if I have current_step
==251199 with type <class 'numpy.int64'> I'll get TypeError: Object of type 'int64' is not JSON serializable

We have to either cast all values in get_state methods to native python types (from Numpy) or we can expand functionality of json to handle numpy int64ref

@ljaljushkin
Copy link
Contributor Author

During debugging I find out that json can't handle scheduler state when it is too big. For example if I have current_step
==251199 with type <class 'numpy.int64'> I'll get TypeError: Object of type 'int64' is not JSON serializable

We have to either cast all values in get_state methods to native python types (from Numpy) or we can expand functionality of json to handle numpy int64ref

this comment is obsolete, since standard Python int should be enough to represent very long numbers, just need to cast numpy.int64

@vshampor
Copy link
Contributor

Jenkins please retry a build

@ljaljushkin ljaljushkin marked this pull request as ready for review May 26, 2021 08:42
@ljaljushkin ljaljushkin requested review from alexsu52 and vshampor May 26, 2021 09:08
Copy link
Contributor

@vshampor vshampor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment, or mark the spots in the PR by comments of your own, that illustrate:

  1. the changes to the user flow that are mandatory after this PR in order for nothing to break (it would be good if there were no such changes at all)
  2. the exact way in which the user is supposed to save an NNCF checkpoint in their flow
  3. the exact way in which the user is supposed to load the NNCF checkpoint
  4. the additional operations that the NNCF algo developer should do in general in order to mark some or the other part of their algorithm data to become save-able and load-able from such checkpoints

I think that illustrating these points would help with the review.

tests/quantization/resnet18.json Outdated Show resolved Hide resolved
@ljaljushkin
Copy link
Contributor Author

Could you please add a comment, or mark the spots in the PR by comments of your own, that illustrate:

  1. the changes to the user flow that are mandatory after this PR in order for nothing to break (it would be good if there were no such changes at all)
  2. the exact way in which the user is supposed to save an NNCF checkpoint in their flow
  3. the exact way in which the user is supposed to load the NNCF checkpoint
  4. the additional operations that the NNCF algo developer should do in general in order to mark some or the other part of their algorithm data to become save-able and load-able from such checkpoints

I think that illustrating these points would help with the review.

definitely make sense, will do it shortly

Copy link
Contributor

@alexsu52 alexsu52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

@ljaljushkin
Copy link
Contributor Author

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

Let's do it iteratively if there's no concern about API, otherwise we would need extra effort to keep this branch merged with upcoming changes in develop.
BTW, wasn't it planned to involve @daniil-lyakhov to the TF part?

@alexsu52
Copy link
Contributor

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

Let's do it iteratively if there's no concern about API, otherwise we would need extra effort to keep this branch merged with upcoming changes in develop.
BTW, wasn't it planned to involve @daniil-lyakhov to the TF part?

I don't have this in plans.

@ljaljushkin
Copy link
Contributor Author

ljaljushkin commented Jul 5, 2021

SOTA eval validation has FAILED🤕, because of breaking changes in the builder related classes. Need to update re-run TF eval and correct checkpoints for PT/

  • sota eval for TF [build 275]
  • sota eval for PT [build 411]

@daniil-lyakhov
Copy link
Collaborator

Jenkins please retry a build

1 similar comment
@ljaljushkin
Copy link
Contributor Author

Jenkins please retry a build

@ljaljushkin
Copy link
Contributor Author

ljaljushkin commented Jul 6, 2021

SOTA eval validation is WIP

  • sota eval for PT [build 413]
  • sota eval for PT [locally]
  • sota eval for TF [build 278]
  • sota eval for TF [locally]

Copy link
Contributor Author

@ljaljushkin ljaljushkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most recent changes, just FYI

@ljaljushkin
Copy link
Contributor Author

SOTA eval validation is WIP

  • sota eval for PT [build 413]
  • sota eval for PT [locally]
  • sota eval for TF [build 278]
  • sota eval for TF [locally]

SOTA validation is green for PT and TF

@vshampor vshampor merged commit 9ca51e2 into openvinotoolkit:develop Jul 6, 2021
@ljaljushkin
Copy link
Contributor Author

🎉 🎉 🎉
@alexsu52 @andrey-churkin @daniil-lyakhov @vshampor Thank you for the thorough and responsible review! 👍

@ljaljushkin
Copy link
Contributor Author

ljaljushkin commented Jul 7, 2021

Post build sota eval validation is green (PT - 415, TF - 279).
Except for errors with pruning algorithm in TF, which have been happening before the merge:

ERROR:nncf:Invalid NNCF config supplied! jsonschema.exceptions.ValidationError: For algorithm: 'filter_pruning

@evgeniya-egupova @alexsu52

kshpv pushed a commit to kshpv/nncf that referenced this pull request Oct 11, 2022
Introduced a new way of resuming compression for PyTorch and TensorFlow.
The idea is to restore compression state instead of building it from scratch according to the config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NNCF Common Pull request that updates NNCF Common NNCF PT Pull requests that updates NNCF PyTorch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save unambiguous quantizer setup data into NNCF checkpoints
7 participants