Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error While Training the Model as per README Instructions. #11

Open
NakataKoo opened this issue May 26, 2024 · 1 comment
Open

Error While Training the Model as per README Instructions. #11

NakataKoo opened this issue May 26, 2024 · 1 comment

Comments

@NakataKoo
Copy link

First of all, I thank you so much for this great project. You have done a really tremendous job, and I am so full of anticipation about exploring and using this model.

I was following the instructions provided in the README to train the model, including the environment setup and dataset acquisition, but I encountered an error during the process. Below are the details:

Error:

  1. After building the environment and acquiring the data set, the following commands were executed:
    PYTHONPATH=. python train.py exp=train_msdm

  2. Then I got the following error:

Traceback (most recent call last):
  File "/raid/m236866/multi-source-diffusion-models/train.py", line 4, in <module>
    import pytorch_lightning as pl
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 34, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 25, in <module>
    from pytorch_lightning.callbacks.progress import ProgressBarBase, RichProgressBar, TQDMProgressBar
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/pytorch_lightning/callbacks/progress/__init__.py", line 22, in <module>
    from pytorch_lightning.callbacks.progress.rich_progress import RichProgressBar  # noqa: F401
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 20, in <module>
    from torchmetrics.utilities.imports import _compare_version
ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports' (/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/torchmetrics/utilities/imports.py)
  1. Then, I installed torchmetrics and executed it again, and the following error occurred:
/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Enter run name: first-test-of-train-0525
[2024-05-25 21:11:39,620][main.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
Global seed set to 12345
[2024-05-25 21:11:39,622][__main__][INFO] - Instantiating datamodule <main.module_base.DatamoduleWithValidation>.
Error executing job with overrides: ['exp=train_msdm']
Error locating target 'main.data.MultiSourceDataset', see chained exception above.
full_key: datamodule.train_dataset

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

  1. Finally, I tried the following and executed it again:
conda install -c conda-forge cudatoolkit=11.3 cudnn=8.2
conda install -c conda-forge libjpeg libpng
pip uninstall torchvision
pip install torchvision
export HYDRA_FULL_ERROR=1
  1. Then the following error occurred:
/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Enter run name: test
[2024-05-25 21:13:40,990][main.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
Global seed set to 12345
[2024-05-25 21:13:40,992][__main__][INFO] - Instantiating datamodule <main.module_base.DatamoduleWithValidation>.
Error executing job with overrides: ['exp=train_msdm']
Traceback (most recent call last):
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 639, in _locate
    obj = getattr(obj, part)
AttributeError: module 'main' has no attribute 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 645, in _locate
    obj = import_module(mod)
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/raid/m236866/multi-source-diffusion-models/main/data.py", line 11, in <module>
    import av
ModuleNotFoundError: No module named 'av'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 134, in _resolve_target
    target = _locate(target)
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 648, in _locate
    raise ImportError(
ImportError: Error loading 'main.data.MultiSourceDataset':
ModuleNotFoundError("No module named 'av'")
Are you sure that 'data' is importable from module 'main'?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raid/m236866/multi-source-diffusion-models/train.py", line 99, in <module>
    main()
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/raid/m236866/multi-source-diffusion-models/train.py", line 28, in main
    datamodule = hydra.utils.instantiate(config.datamodule, _convert_="partial")
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 222, in instantiate
    return instantiate_node(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 334, in instantiate_node
    value = instantiate_node(
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 325, in instantiate_node
    _target_ = _resolve_target(node.get(_Keys.TARGET), full_key)
  File "/home/m236866/.conda/envs/msdm/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 139, in _resolve_target
    raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error locating target 'main.data.MultiSourceDataset', see chained exception above.
full_key: datamodule.train_dataset

I would greatly appreciate any guidance or solutions you could provide to resolve this issue.

Thank you once again for your incredible work.

@syl4356
Copy link

syl4356 commented Jun 10, 2024

Hi, I suffered from the same problem. It was resolved by installing av.

pip install av

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants