Skip to content

Replace models with composite calculator pipeline#34

Open
zubatyuk wants to merge 4 commits intoNVIDIA:mainfrom
zubatyuk:models-composite-pipeline
Open

Replace models with composite calculator pipeline#34
zubatyuk wants to merge 4 commits intoNVIDIA:mainfrom
zubatyuk:models-composite-pipeline

Conversation

@zubatyuk
Copy link
Copy Markdown
Collaborator

@zubatyuk zubatyuk commented Mar 31, 2026

ALCHEMI Toolkit Pull Request

Description

This PR replaces the previous monolithic nvalchemi.models wrapper architecture with a composable calculator API. The main motivation is to make the physical pipeline explicit instead of hiding it inside wrappers. The central model flow is now naturally expressed as a sequence of steps, like NL -> MLIP -> NL -> Coulomb -> AutoGrad, with direct physical terms such as DFT-D3 and DSF fitting into the same ordered step model. This improves both composability and correctness. In the previous monolithic implementation, compositions like (NL -> MLIP -> AutoGrad) -> Coulomb produced incorrect force semantics for charge-coupled Coulomb terms. The new pipeline implementation makes the derivative path explicit and easier to reason about.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

#22

Changes Made

  • Models: replace the old wrapper-oriented nvalchemi.models surface with CompositeCalculator, EnergyDerivativesStep, cards / profiles / metadata / results / registry, and concrete built-in calculator steps.
  • Models: make force and stress derivation explicit through EnergyDerivativesStep, keep DSF as the hybrid Coulomb term, and expose PME / Ewald direct vs autograd participation explicitly.
  • Models: make external neighbor requirements explicit through neighbor_requirement and expose neighbor_list_builder_config(...) as the user-facing bridge from contract to builder config.
  • Models: add registry-backed artifact resolution for known names such as aimnet2 and mace-mp-0b3-medium.
  • Dynamics: update dynamics to consume calculator-style models directly instead of BaseModelMixin, and replace the demo wrapper path with DemoPotential.
  • Docs and tests: replace the public docs entrypoints so reviewers see the new nvalchemi.models story in docs/userguide/models.md, docs/modules/models.rst, and docs/models/index.md, and replace the old wrapper-oriented test/models/* architecture surface with a smaller root-package test surface.

Testing

  • Unit tests pass locally (PYTHONDONTWRITEBYTECODE=1 .venv/bin/pytest -q test/models/test_composable.py test/models/test_mace.py test/models/test_neighbor_filter.py test/models/test_physical_models.py test/models/test_direct_potentials.py test/models/test_registry.py)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

The main change is the explicit pipeline model. Neighbor lists and cached tensors k_vectors / k_squared can already be reused in the new pipeline implementation when they are present on the incoming batch / inputs and the step is configured to reuse them. In the previous monolithic implementation, that state was stored and invalidated inside the wrapper. In the new pipeline implementation, roles are separated more cleanly: calculators consume and optionally reuse provided state, while hooks are the natural place to manage invalidation.

One important current limitation remains: stages exchange named values by key, but there is no enforced shape / dtype contract between stages yet.

The implementation is also intentionally future-proof in a few important ways. The same composite model structure can support training-oriented objectives that include charges, forces, and stresses in the loss. The step/result model also leaves room for MLIP outputs beyond energies, such as embeddings and other learned properties, without redesigning the pipeline again.

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 31, 2026

Greptile Summary

This PR replaces the monolithic nvalchemi.models wrapper architecture with an explicit composable calculator pipeline. The central idea is that the physical computation is now expressed as an ordered sequence of _CalculationStep instances — e.g. [NeighborListBuilder, MACEPotential, EnergyDerivativesStep] — collected inside a CompositeCalculator, which handles backward dependency analysis, additive result merging, and derivative-target setup in a single forward() pass. The dynamics layer is updated in parallel to consume the new DynamicsCalculator protocol instead of BaseModelMixin.

  • Core pipeline (composite.py, autograd.py, base.py): CompositeCalculator._resolve_child_requests performs backward dependency analysis and additive-key expansion cleanly. EnergyDerivativesStep and the ensure_derivative_targets helper make the autograd differentiation path explicit.
  • Contracts (contracts.py): Layered Pydantic models (StepCard → StepProfile, PotentialCard → PotentialProfile, NeighborListCard → NeighborListProfile) provide frozen, validated per-instance contracts with clear inheritance.
  • Registry (registry.py): Refactored to a family-keyed registry with ArtifactProcessor/ArtifactResolver protocols; SHA-256 streaming is in place but MD5 verification still loads entire files via read_bytes().
  • ModelOutputs type alias (_typing.py): Hessian is silently removed from the union and NodeCharges | NodeSpins is added — this is a breaking change not called out in the PR description.
  • _validate_external_inputs (composite.py): Uses hasattr(batch, key) for dotted neighbor-list keys, which always returns False in Python and can raise false KeyErrors.

Important Files Changed

Filename Overview
nvalchemi/_typing.py ModelOutputs type alias drops Hessian and adds `NodeCharges
nvalchemi/models/composite.py New CompositeCalculator implementing the ordered pipeline composition; backward dependency analysis and additive-key expansion are well-structured, but _validate_external_inputs uses hasattr for dotted neighbor-list keys which will not resolve correctly if data is stored directly on the batch.
nvalchemi/models/autograd.py New EnergyDerivativesStep that materialises forces and stresses from accumulated energy via torch.autograd.grad; derivative target setup is delegated cleanly to the prepare() phase; no issues found.
nvalchemi/models/contracts.py New Pydantic-based step/potential/neighbor-list card and profile hierarchy with cross-field validators; definitions are clean and internally consistent.
nvalchemi/models/base.py Major refactoring from BaseModelMixin to _CalculationStep/Potential/_NeighborListStep; introduces RuntimeState, ForwardContext, and the singleton _UNSET sentinel; solid implementation with no critical issues found.
nvalchemi/models/registry.py Registry refactored from flat ModelRegistryEntry to family-keyed KnownArtifactEntry with ArtifactProcessor/ArtifactResolver protocols; SHA-256 uses streaming but MD5 verification uses read_bytes() which loads entire files into memory.
nvalchemi/models/results.py New thin MutableMapping-backed CalculatorResults container with additive merge semantics; straightforward and correct.
nvalchemi/dynamics/base.py Replaces BaseModelMixin dependency with DynamicsCalculator protocol; model_is_conservative uses string-name matching (flagged in prior threads); validation weakened to callable() (also in prior threads); clean overall migration.
nvalchemi/models/init.py Switches from TYPE_CHECKING-guarded lazy imports to eager imports; safe because optional third-party deps (aimnet, mace) are deferred inside the respective class constructors, not at module level.
nvalchemi/models/dsf.py New DSF Coulomb potential step; correctly dispatches between COO and matrix neighbor formats, handles periodic/non-periodic cases, and computes virial-to-stress conversion; no issues found.
nvalchemi/models/neighbors.py New explicit neighbor-list builder step with NeighborListBuilderConfig/AdaptiveNeighborListConfig; reuse-if-available optimisation and dot-namespaced result keys are well-implemented.

Reviews (2): Last reviewed commit: "docs: add design codument" | Re-trigger Greptile

Comment on lines +290 to +301
all_resolved: set[str] = set()
for request in raw_requests:
all_resolved |= request
expand_keys = all_resolved & self.pipeline_contract.additive_result_keys

if not expand_keys:
return raw_requests

expanded_requests: list[frozenset[str]] = []
for request, step in zip(raw_requests, steps, strict=True):
step_expand = expand_keys & step.profile.result_keys
expanded_requests.append(request | step_expand)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Additive-key expansion causes force double-counting in hybrid pipelines

The expansion logic on lines 293–301 asks every step that can produce an additive key (e.g. "forces") to do so, regardless of whether an EnergyDerivativesStep already accounts for that step's energy contribution.

Concrete scenario — pipeline [NL_mlip, MLIP, NL_dsf, DSF, AutoGrad] with active_outputs = {"forces", "energies"}:

  1. Backward resolution assigns {"forces"} to AutoGrad and {"energies"} to DSF; MLIP is re-added via expansion only.

  2. expand_keys = {"forces", "energies"} ∩ additive_result_keys{"forces", "energies"}.

  3. Because "forces" is in DSFCoulombPotential.profile.result_keys, DSF is expanded to produce direct forces (compute_forces=True inside DSFCoulombPotential.compute).

  4. MLIP's prepare() writes gradient-tracked positions_scaled into runtime_state.input_overrides["positions"]. DSF calls require_input(batch, "positions", ctx), which resolves to positions_scaled, so DSF's energy is part of the autograd graph.

  5. EnergyDerivativesStep differentiates the accumulated total energy (E_mlip + E_dsf) through positions_scaled, producing forces −∂(E_mlip + E_dsf)/∂r.

  6. The additive merge then sums DSF's direct forces with the autograd forces:

    F_final = F_dsf_direct + F_autograd(E_mlip + E_dsf)
            ≈ F_dsf + F_mlip + F_dsf
            = 2 × F_dsf + F_mlip
    

The same double-counting occurs for stresses.

The root cause is that the expansion step does not know whether a given step's energy is already captured in the autograd graph of a later derivative step. Consider distinguishing "energy-only" mode (when an EnergyDerivativesStep follows) from "direct-output" mode so that a direct-force potential can participate in one way or the other, but not both simultaneously.

Copy link
Copy Markdown
Collaborator Author

@zubatyuk zubatyuk Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, the reported double-counting does not occur for DSF or Ewald as implemented. Correctness of differentiation is intentionally the responsibility of the calculator-step author, not something the composite planner infers automatically or implicitly.

Comment on lines 1469 to +1477
def model_is_conservative(self) -> bool:
"""Returns whether or not the model uses conservative forces"""
return self.model_card.forces_via_autograd
"""Return a best-effort conservative-force indicator."""

steps = getattr(self.model, "steps", None)
if steps is None:
return False
return any(
type(step).__name__ == "EnergyDerivativesStep" for step in steps.values()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 model_is_conservative uses fragile class-name string matching

type(step).__name__ == "EnergyDerivativesStep"

This returns False for any subclass of EnergyDerivativesStep, and also returns False for non-CompositeCalculator callables that happen to derive conservative forces through other means, since getattr(self.model, "steps", None) would be None.

A more robust approach is to use isinstance(). The only thing preventing it is a potential circular import (autograd.pybase.py), which can be handled with a TYPE_CHECKING-guarded or deferred import:

@property
def model_is_conservative(self) -> bool:
    from nvalchemi.models.autograd import EnergyDerivativesStep  # deferred

    steps = getattr(self.model, "steps", None)
    if steps is None:
        return False
    return any(isinstance(step, EnergyDerivativesStep) for step in steps.values())

Comment on lines 1444 to 1451
in the MRO (for cooperative multiple inheritance).
"""
super().__init__(**kwargs)
if not isinstance(model, BaseModelMixin):
if not callable(model):
raise TypeError(
f"Expected a `BaseModelMixin` instance, got {type(model).__name__}."
" Please wrap your model with a `BaseModelMixin` subclass."
f"Expected a callable dynamics calculator, got {type(model).__name__}."
)
self.model = model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Validation weakened from structural check to callable() only

The old guard was isinstance(model, BaseModelMixin), which verified the full structural contract (model card, compute_embeddings, etc.). The replacement is:

if not callable(model):
    raise TypeError(...)

DynamicsCalculator is already @runtime_checkable, so it supports isinstance checks at runtime (verifying both the model_card attribute and the __call__ signature). Using it here would keep validation meaningful without re-introducing the old coupling:

if not isinstance(model, DynamicsCalculator):
    raise TypeError(
        f"Expected a DynamicsCalculator, got {type(model).__name__}. "
        "The model must be callable and expose a `model_card` attribute."
    )

Comment on lines +158 to +165
def __call__(
self,
batch: Batch,
*,
outputs: Iterable[str] | None = None,
) -> ModelOutputs:
"""Compute model outputs for one batch."""
...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Protocol return type annotation is structurally inconsistent with CompositeCalculator

DynamicsCalculator.__call__ is declared to return ModelOutputs (an OrderedDict[str, Tensor | None]), but CompositeCalculator.forward returns CalculatorResults (a MutableMapping[str, Any]). While both types support .get() at runtime and the dynamics code works correctly, static type checkers will flag any CompositeCalculator instance as not conforming to DynamicsCalculator. Updating the protocol to use the common base type would make both nominal and structural conformance correct:

Suggested change
def __call__(
self,
batch: Batch,
*,
outputs: Iterable[str] | None = None,
) -> ModelOutputs:
"""Compute model outputs for one batch."""
...
def __call__(
self,
batch: Batch,
*,
outputs: Iterable[str] | None = None,
) -> Mapping[str, Any]:
"""Compute model outputs for one batch."""
...

…cstrings

Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
Signed-off-by: Roman Zubatyuk <rzubatiuk@nvidia.com>
Comment on lines 91 to +93
Fmax: TypeAlias = Float[torch.Tensor, "B 1"] # noqa: F722
ModelOutputs: TypeAlias = OrderedDict[
str, Energy | Forces | Hessian | Stress | Virials | Dipole | None
str, Energy | Forces | Stress | Virials | Dipole | NodeCharges | NodeSpins | None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hessian silently removed from ModelOutputs

The previous definition included Hessian:

# before
ModelOutputs: TypeAlias = OrderedDict[
    str, Energy | Forces | Hessian | Stress | Virials | Dipole | None
]

The new definition drops Hessian entirely:

Suggested change
Fmax: TypeAlias = Float[torch.Tensor, "B 1"] # noqa: F722
ModelOutputs: TypeAlias = OrderedDict[
str, Energy | Forces | Hessian | Stress | Virials | Dipole | None
str, Energy | Forces | Stress | Virials | Dipole | NodeCharges | NodeSpins | None
ModelOutputs: TypeAlias = OrderedDict[
str, Energy | Forces | Hessian | Stress | Virials | Dipole | NodeCharges | NodeSpins | None
] # noqa: F722

Any downstream model or step that currently produces Hessian outputs (second-order derivatives) and declares its return value as ModelOutputs will silently violate the type contract after this change. If removing Hessian support from the public API is intentional, it should be called out explicitly in the PR description and changelog so callers know to migrate. If it was an accidental omission, adding it back to the union is the safer choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant