Replace models with composite calculator pipeline#34
Replace models with composite calculator pipeline#34zubatyuk wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
Greptile SummaryThis PR replaces the monolithic
Important Files Changed
Reviews (2): Last reviewed commit: "docs: add design codument" | Re-trigger Greptile |
| all_resolved: set[str] = set() | ||
| for request in raw_requests: | ||
| all_resolved |= request | ||
| expand_keys = all_resolved & self.pipeline_contract.additive_result_keys | ||
|
|
||
| if not expand_keys: | ||
| return raw_requests | ||
|
|
||
| expanded_requests: list[frozenset[str]] = [] | ||
| for request, step in zip(raw_requests, steps, strict=True): | ||
| step_expand = expand_keys & step.profile.result_keys | ||
| expanded_requests.append(request | step_expand) |
There was a problem hiding this comment.
Additive-key expansion causes force double-counting in hybrid pipelines
The expansion logic on lines 293–301 asks every step that can produce an additive key (e.g. "forces") to do so, regardless of whether an EnergyDerivativesStep already accounts for that step's energy contribution.
Concrete scenario — pipeline [NL_mlip, MLIP, NL_dsf, DSF, AutoGrad] with active_outputs = {"forces", "energies"}:
-
Backward resolution assigns
{"forces"}toAutoGradand{"energies"}toDSF;MLIPis re-added via expansion only. -
expand_keys = {"forces", "energies"} ∩ additive_result_keys→{"forces", "energies"}. -
Because
"forces"is inDSFCoulombPotential.profile.result_keys, DSF is expanded to produce direct forces (compute_forces=TrueinsideDSFCoulombPotential.compute). -
MLIP's
prepare()writes gradient-trackedpositions_scaledintoruntime_state.input_overrides["positions"]. DSF callsrequire_input(batch, "positions", ctx), which resolves topositions_scaled, so DSF's energy is part of the autograd graph. -
EnergyDerivativesStepdifferentiates the accumulated total energy(E_mlip + E_dsf)throughpositions_scaled, producing forces−∂(E_mlip + E_dsf)/∂r. -
The additive merge then sums DSF's direct forces with the autograd forces:
F_final = F_dsf_direct + F_autograd(E_mlip + E_dsf) ≈ F_dsf + F_mlip + F_dsf = 2 × F_dsf + F_mlip
The same double-counting occurs for stresses.
The root cause is that the expansion step does not know whether a given step's energy is already captured in the autograd graph of a later derivative step. Consider distinguishing "energy-only" mode (when an EnergyDerivativesStep follows) from "direct-output" mode so that a direct-force potential can participate in one way or the other, but not both simultaneously.
There was a problem hiding this comment.
In the current implementation, the reported double-counting does not occur for DSF or Ewald as implemented. Correctness of differentiation is intentionally the responsibility of the calculator-step author, not something the composite planner infers automatically or implicitly.
| def model_is_conservative(self) -> bool: | ||
| """Returns whether or not the model uses conservative forces""" | ||
| return self.model_card.forces_via_autograd | ||
| """Return a best-effort conservative-force indicator.""" | ||
|
|
||
| steps = getattr(self.model, "steps", None) | ||
| if steps is None: | ||
| return False | ||
| return any( | ||
| type(step).__name__ == "EnergyDerivativesStep" for step in steps.values() | ||
| ) |
There was a problem hiding this comment.
model_is_conservative uses fragile class-name string matching
type(step).__name__ == "EnergyDerivativesStep"This returns False for any subclass of EnergyDerivativesStep, and also returns False for non-CompositeCalculator callables that happen to derive conservative forces through other means, since getattr(self.model, "steps", None) would be None.
A more robust approach is to use isinstance(). The only thing preventing it is a potential circular import (autograd.py → base.py), which can be handled with a TYPE_CHECKING-guarded or deferred import:
@property
def model_is_conservative(self) -> bool:
from nvalchemi.models.autograd import EnergyDerivativesStep # deferred
steps = getattr(self.model, "steps", None)
if steps is None:
return False
return any(isinstance(step, EnergyDerivativesStep) for step in steps.values())| in the MRO (for cooperative multiple inheritance). | ||
| """ | ||
| super().__init__(**kwargs) | ||
| if not isinstance(model, BaseModelMixin): | ||
| if not callable(model): | ||
| raise TypeError( | ||
| f"Expected a `BaseModelMixin` instance, got {type(model).__name__}." | ||
| " Please wrap your model with a `BaseModelMixin` subclass." | ||
| f"Expected a callable dynamics calculator, got {type(model).__name__}." | ||
| ) | ||
| self.model = model |
There was a problem hiding this comment.
Validation weakened from structural check to
callable() only
The old guard was isinstance(model, BaseModelMixin), which verified the full structural contract (model card, compute_embeddings, etc.). The replacement is:
if not callable(model):
raise TypeError(...)DynamicsCalculator is already @runtime_checkable, so it supports isinstance checks at runtime (verifying both the model_card attribute and the __call__ signature). Using it here would keep validation meaningful without re-introducing the old coupling:
if not isinstance(model, DynamicsCalculator):
raise TypeError(
f"Expected a DynamicsCalculator, got {type(model).__name__}. "
"The model must be callable and expose a `model_card` attribute."
)| def __call__( | ||
| self, | ||
| batch: Batch, | ||
| *, | ||
| outputs: Iterable[str] | None = None, | ||
| ) -> ModelOutputs: | ||
| """Compute model outputs for one batch.""" | ||
| ... |
There was a problem hiding this comment.
Protocol return type annotation is structurally inconsistent with
CompositeCalculator
DynamicsCalculator.__call__ is declared to return ModelOutputs (an OrderedDict[str, Tensor | None]), but CompositeCalculator.forward returns CalculatorResults (a MutableMapping[str, Any]). While both types support .get() at runtime and the dynamics code works correctly, static type checkers will flag any CompositeCalculator instance as not conforming to DynamicsCalculator. Updating the protocol to use the common base type would make both nominal and structural conformance correct:
| def __call__( | |
| self, | |
| batch: Batch, | |
| *, | |
| outputs: Iterable[str] | None = None, | |
| ) -> ModelOutputs: | |
| """Compute model outputs for one batch.""" | |
| ... | |
| def __call__( | |
| self, | |
| batch: Batch, | |
| *, | |
| outputs: Iterable[str] | None = None, | |
| ) -> Mapping[str, Any]: | |
| """Compute model outputs for one batch.""" | |
| ... |
…cstrings Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
| Fmax: TypeAlias = Float[torch.Tensor, "B 1"] # noqa: F722 | ||
| ModelOutputs: TypeAlias = OrderedDict[ | ||
| str, Energy | Forces | Hessian | Stress | Virials | Dipole | None | ||
| str, Energy | Forces | Stress | Virials | Dipole | NodeCharges | NodeSpins | None |
There was a problem hiding this comment.
Hessian silently removed from ModelOutputs
The previous definition included Hessian:
# before
ModelOutputs: TypeAlias = OrderedDict[
str, Energy | Forces | Hessian | Stress | Virials | Dipole | None
]The new definition drops Hessian entirely:
| Fmax: TypeAlias = Float[torch.Tensor, "B 1"] # noqa: F722 | |
| ModelOutputs: TypeAlias = OrderedDict[ | |
| str, Energy | Forces | Hessian | Stress | Virials | Dipole | None | |
| str, Energy | Forces | Stress | Virials | Dipole | NodeCharges | NodeSpins | None | |
| ModelOutputs: TypeAlias = OrderedDict[ | |
| str, Energy | Forces | Hessian | Stress | Virials | Dipole | NodeCharges | NodeSpins | None | |
| ] # noqa: F722 |
Any downstream model or step that currently produces Hessian outputs (second-order derivatives) and declares its return value as ModelOutputs will silently violate the type contract after this change. If removing Hessian support from the public API is intentional, it should be called out explicitly in the PR description and changelog so callers know to migrate. If it was an accidental omission, adding it back to the union is the safer choice.
ALCHEMI Toolkit Pull Request
Description
This PR replaces the previous monolithic
nvalchemi.modelswrapper architecture with a composable calculator API. The main motivation is to make the physical pipeline explicit instead of hiding it inside wrappers. The central model flow is now naturally expressed as a sequence of steps, likeNL -> MLIP -> NL -> Coulomb -> AutoGrad, with direct physical terms such as DFT-D3 and DSF fitting into the same ordered step model. This improves both composability and correctness. In the previous monolithic implementation, compositions like(NL -> MLIP -> AutoGrad) -> Coulombproduced incorrect force semantics for charge-coupled Coulomb terms. The new pipeline implementation makes the derivative path explicit and easier to reason about.Type of Change
Related Issues
#22
Changes Made
nvalchemi.modelssurface withCompositeCalculator,EnergyDerivativesStep, cards / profiles / metadata / results / registry, and concrete built-in calculator steps.EnergyDerivativesStep, keep DSF as the hybrid Coulomb term, and expose PME / Ewalddirectvsautogradparticipation explicitly.neighbor_requirementand exposeneighbor_list_builder_config(...)as the user-facing bridge from contract to builder config.aimnet2andmace-mp-0b3-medium.BaseModelMixin, and replace the demo wrapper path withDemoPotential.nvalchemi.modelsstory indocs/userguide/models.md,docs/modules/models.rst, anddocs/models/index.md, and replace the old wrapper-orientedtest/models/*architecture surface with a smaller root-package test surface.Testing
PYTHONDONTWRITEBYTECODE=1 .venv/bin/pytest -q test/models/test_composable.py test/models/test_mace.py test/models/test_neighbor_filter.py test/models/test_physical_models.py test/models/test_direct_potentials.py test/models/test_registry.py)make lint)Checklist
Additional Notes
The main change is the explicit pipeline model. Neighbor lists and cached tensors
k_vectors/k_squaredcan already be reused in the new pipeline implementation when they are present on the incoming batch / inputs and the step is configured to reuse them. In the previous monolithic implementation, that state was stored and invalidated inside the wrapper. In the new pipeline implementation, roles are separated more cleanly: calculators consume and optionally reuse provided state, while hooks are the natural place to manage invalidation.One important current limitation remains: stages exchange named values by key, but there is no enforced shape / dtype contract between stages yet.
The implementation is also intentionally future-proof in a few important ways. The same composite model structure can support training-oriented objectives that include charges, forces, and stresses in the loss. The step/result model also leaves room for MLIP outputs beyond energies, such as embeddings and other learned properties, without redesigning the pipeline again.
Tip
This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.