-
Notifications
You must be signed in to change notification settings - Fork 607
[Feature Request]Warn on missing custom OP during dp --pt convert/dp --pt compress and allow overriding SHARED_LIB_DIR #5368
Description
Summary
Checklist
[x] I have searched existing issues to make sure this feature has not been requested before.
[x] I have checked the documentation of DeepMD-kit.
Description of the problem
When users compile a custom libdeepmd_op_pt.so (e.g., to match specific CUDA/GCC environments) and place it in a custom directory (not the default deepmd/lib/), the PyTorch backend silently fails to load it during dp --pt convert-backend or dp --pt compress.
The root cause is in deepmd/pt/cxx_op.py:
SHARED_LIB_DIR = Path(deepmd.lib.__path__[0])
module_file = (SHARED_LIB_DIR / (prefix + module_name)).with_suffix(ext).resolve()
if module_file.is_file():
# loads the libraryIt strictly looks for the .so in the Python package installation path and ignores environment variables like LD_LIBRARY_PATH. If the file is missing, ENABLE_CUSTOMIZED_OP becomes False, and the model is serialized with dummy placeholder functions (e.g., tabulate_fusion_se_a that just raise NotImplementedError).
This results in a broken .pth model that passes conversion without any warnings, but crashes immediately in LAMMPS.
Currently, the only workaround is to manually symlink the .so into deepmd/lib/, which is non-intuitive and breaks upon environment updates.
Detailed Description
Describe the solution
-
Add a critical warning/error during model serialization (convert/compress/freeze)
When a model architecture requires customized OPs (e.g., usestabulate_fusion_se_afor compressed SeA) butENABLE_CUSTOMIZED_OPis False,dp --pt convertanddp --pt compressshould not proceed silently.
It should raise an explicit error or a prominent warning like:[ERROR] The current model requires customized PyTorch OPs (e.g., tabulate_fusion_se_a), but libdeepmd_op_pt.so was not loaded. The exported model will fail during inference. Please ensure the custom OP library is installed correctly.
-
Allow overriding
SHARED_LIB_DIRvia Environment Variable
Indeepmd/pt/cxx_op.py, the path resolution logic should fall back to an environment variable (e.g.,LD_LIBRARY_PATH) if the hardcodedSHARED_LIB_DIRdoes not contain the library.
Proposed logic
module_file = (SHARED_LIB_DIR / (prefix + module_name)).with_suffix(ext).resolve()
if not module_file.is_file():
# Check environment variable override
env_dir = os.environ.get("DEEPMD_OP_DIR")
if env_dir:
module_file = (Path(env_dir) / (prefix + module_name)).with_suffix(ext).resolve()This would allow users to point to their custom-compiled OP libraries without modifying source code or creating symlinks.
Describe alternatives you’ve considered
Manually symlinking the .so to site-packages/deepmd/lib/ (Current workaround, fragile).
Modifying cxx_op.py source code directly (Gets overwritten on updates).
Setting LD_LIBRARY_PATH (Does not work because cxx_op.py uses explicit absolute path resolution via Path.is_file(), not torch.ops.load_library("deepmd_op_pt") directly).
Additional context
DeepMD-kit version: v3.1.0 (and likely affects v2.x PT backend as well)
PyTorch version: Built with _GLIBCXX_USE_CXX11_ABI=0 (conda-forge)
How to reproduce:
Compile libdeepmd_op_pt.so in a custom directory (e.g., ~/deepmd-kit/lib/).
Set export LD_LIBRARY_PATH=~/deepmd-kit/lib/:$LD_LIBRARY_PATH.
Run dp --pt convert-backend in.pb out.pth. (.pb may have ops like se_a)
Observe no warnings. Check ENABLE_CUSTOMIZED_OP -> it is False.
Run dp --pt compress out.pth compress.pth
Observe no warnings. Check ENABLE_CUSTOMIZED_OP -> it is False.
Run LAMMPS using out.pth is OK, BUT Run LAMMPS using compress.pth -> NotImplementedError. (some what like #4530 )
Further Information, Files, and Links
No response