Update Habana environment to 1.20.0 and PyTorch 2.6.0 #406

jmduarte · 2025-03-16T16:20:50Z

Supersedes #309

jmduarte · 2025-03-16T16:44:51Z

So habana software requires numpy < 2 and numba 0.60.1 is incompatible, so I downgraded it to 0.60.0.

Now I get past that issue, but I notice when I try to covert the dataset with tfds it complains that tensorflow is not installed. But I was under the impression that tensorflow is not needed for tfds. Do you know if it's possible to get around this?

jpata · 2025-03-16T18:34:47Z

The tfds documentation says that "TensorFlow is no longer a dependency to read datasets": https://www.tensorflow.org/datasets/tfless_tfds

The CI job also tries to create the dataset, so potentially we'd want to skip the CI for the habana branch. In any case, to run the ML training, you would use a pre-existing dataset.

jmduarte · 2025-03-16T22:00:15Z

OK thanks! I managed to install a compatible tf, but i agree it's not necessary since I can probably read a pre-made dataset.

I now see another minor error during validation/plotting, but the training runs (on CPU). Will try to get the training to run on HPU now.

Traceback (most recent call last):
  File "/particleflow/mlpf/pipeline.py", line 183, in <module>
    main()
  File "/particleflow/mlpf/pipeline.py", line 179, in main
    device_agnostic_run(config, world_size, outdir)
  File "/particleflow/mlpf/model/training.py", line 864, in device_agnostic_run
    run(rank, world_size, config, outdir, logfile)
  File "/particleflow/mlpf/model/training.py", line 757, in run
    train_all_epochs(
  File "/particleflow/mlpf/model/training.py", line 384, in train_all_epochs
    losses_valid = eval_epoch(
  File "/particleflow/mlpf/model/training.py", line 252, in eval_epoch
    validation_plots(batch, ypred_raw, ytarget, ypred, tensorboard_writer, epoch, outdir)
  File "/particleflow/mlpf/model/plots.py", line 68, in validation_plots
    plt.hist2d(etarget, epred, bins=b, cmap="hot", norm=matplotlib.colors.LogNorm())
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/_api/deprecation.py", line 453, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.py", line 3526, in hist2d
    __ret = gca().hist2d(
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/_api/deprecation.py", line 453, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/__init__.py", line 1521, in inner
    return func(
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/axes/_axes.py", line 7504, in hist2d
    h, xedges, yedges = np.histogram2d(x, y, bins=bins, range=range,
  File "<__array_function__ internals>", line 180, in histogram2d
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/twodim_base.py", line 825, in histogram2d
    hist, edges = histogramdd([x, y], bins, range, normed, weights, density)
  File "<__array_function__ internals>", line 180, in histogramdd
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/histograms.py", line 1031, in histogramdd
    raise ValueError(
ValueError: The dimension of bins must be equal to the dimension of the  sample x.

jmduarte · 2025-03-17T03:27:33Z

made it to this error, but I actually don't see where it's coming from at the moment. It seems num_classes=-1 should be allowed in torch.nn.functional.one_hot (and it's the default).

Traceback (most recent call last):
  File "/particleflow/mlpf/pipeline.py", line 160, in <module>
    main()
  File "/particleflow/mlpf/pipeline.py", line 156, in main
    device_agnostic_run(config, world_size, experiment_dir, args.habana)
  File "/particleflow/mlpf/model/training.py", line 855, in device_agnostic_run
    run(rank, world_size, config, outdir, logfile)
  File "/particleflow/mlpf/model/training.py", line 740, in run
    train_all_epochs(
  File "/particleflow/mlpf/model/training.py", line 369, in train_all_epochs
    losses_train = train_epoch(
  File "/particleflow/mlpf/model/training.py", line 144, in train_epoch
    loss_opt, loss, _, _, _ = model_step(batch, model, mlpf_loss)
  File "/particleflow/mlpf/model/training.py", line 75, in model_step
    loss_opt, losses_detached = loss_fn(ytarget, ypred, batch)
  File "/particleflow/mlpf/model/losses.py", line 115, in mlpf_loss
    was_input_true = torch.concat([torch.nn.functional.one_hot((y["cls_id"] != 0).to(torch.long)), y["momentum"]], axis=-1) * batch.mask.unsqueeze(
RuntimeError: Number of classes cannot be -1

jmduarte added 11 commits April 8, 2024 13:36

dockerfile

f1d9e21

fix

553304e

update

c668010

Merge remote-tracking branch 'jpata/main' into habana130_pytorch210

8670adb

update docker

4dbfbbf

update

ffd0674

Merge remote-tracking branch 'jpata/main' into habana130_pytorch210

f1679e0

Update requirements_base.txt

956b2b0

Merge remote-tracking branch 'jpata/main' into habana130_pytorch210

88cae28

update

34e84d4

numba 0.60.0

b81874c

add tf

a1d34b6

jmduarte added 5 commits March 16, 2025 15:26

try habana

72747be

update

f96eecc

generalize device

c57048c

resolve conflicts

2f23910

test

5cb7387

jmduarte added 2 commits March 16, 2025 20:28

num_classes=2

b57319d

fix

770caca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Habana environment to 1.20.0 and PyTorch 2.6.0 #406

Update Habana environment to 1.20.0 and PyTorch 2.6.0 #406

Uh oh!

jmduarte commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 16, 2025

Uh oh!

jpata commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update Habana environment to 1.20.0 and PyTorch 2.6.0 #406

Are you sure you want to change the base?

Update Habana environment to 1.20.0 and PyTorch 2.6.0 #406

Uh oh!

Conversation

jmduarte commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 16, 2025

Uh oh!

jpata commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 16, 2025

Uh oh!

jmduarte commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants