bug: export fails when reloading models with less GPUs than they are trained at #730

fchouteau · 2024-11-21T11:27:36Z

When running scripts.export.py with an existing config file, if you that from a machine that has less GPUs than the original config file, it fails,
example :

python scripts/export_model.py \
  --train_config ${MODELS_ROOT}/some_model/train_config.json \
  --model_in_file ${MODELS_ROOT}/some_model/latest_net_G_A.pth \
  --model_out_file ${MODELS_ROOT}/models/some_model.pt \
  --export_type=jit \
  --img_size=256 \
  --cuda

Because in the config file you have something like :

{
    "checkpoints_dir": "/data1/beniz/checkpoints/deepzoom/pan/",
    "dataroot": "/data1/beniz/data/deepzoom/202407/pan/dz_phrsimus_pan_256/",
    "ddp_port": "12355",
    "gpu_ids": "2",
}

So you get

RuntimeError: CUDA error: invalid device ordinal

One ugly fix would be in scripts/export.py L173

    if args.train_config:
        with open(args.train_config) as train_config_file:
            train_config_json = json.load(train_config_file)
            train_config_json["gpu_ids"] = "0" # force the GPU ID before parsing the train options
            opt = TrainOptions().parse_json(train_config_json)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: export fails when reloading models with less GPUs than they are trained at #730

bug: export fails when reloading models with less GPUs than they are trained at #730

fchouteau commented Nov 21, 2024

bug: export fails when reloading models with less GPUs than they are trained at #730

bug: export fails when reloading models with less GPUs than they are trained at #730

Comments

fchouteau commented Nov 21, 2024