Skip to content

Conversation

@jmduarte
Copy link
Contributor

Supersedes #309

@jmduarte
Copy link
Contributor Author

@jpata

So habana software requires numpy < 2 and numba 0.60.1 is incompatible, so I downgraded it to 0.60.0.

Now I get past that issue, but I notice when I try to covert the dataset with tfds it complains that tensorflow is not installed. But I was under the impression that tensorflow is not needed for tfds. Do you know if it's possible to get around this?

@jpata
Copy link
Owner

jpata commented Mar 16, 2025

The tfds documentation says that "TensorFlow is no longer a dependency to read datasets": https://www.tensorflow.org/datasets/tfless_tfds

The CI job also tries to create the dataset, so potentially we'd want to skip the CI for the habana branch. In any case, to run the ML training, you would use a pre-existing dataset.

@jmduarte
Copy link
Contributor Author

OK thanks! I managed to install a compatible tf, but i agree it's not necessary since I can probably read a pre-made dataset.

I now see another minor error during validation/plotting, but the training runs (on CPU). Will try to get the training to run on HPU now.

Traceback (most recent call last):
  File "/particleflow/mlpf/pipeline.py", line 183, in <module>
    main()
  File "/particleflow/mlpf/pipeline.py", line 179, in main
    device_agnostic_run(config, world_size, outdir)
  File "/particleflow/mlpf/model/training.py", line 864, in device_agnostic_run
    run(rank, world_size, config, outdir, logfile)
  File "/particleflow/mlpf/model/training.py", line 757, in run
    train_all_epochs(
  File "/particleflow/mlpf/model/training.py", line 384, in train_all_epochs
    losses_valid = eval_epoch(
  File "/particleflow/mlpf/model/training.py", line 252, in eval_epoch
    validation_plots(batch, ypred_raw, ytarget, ypred, tensorboard_writer, epoch, outdir)
  File "/particleflow/mlpf/model/plots.py", line 68, in validation_plots
    plt.hist2d(etarget, epred, bins=b, cmap="hot", norm=matplotlib.colors.LogNorm())
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/_api/deprecation.py", line 453, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.py", line 3526, in hist2d
    __ret = gca().hist2d(
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/_api/deprecation.py", line 453, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/__init__.py", line 1521, in inner
    return func(
  File "/usr/local/lib/python3.10/dist-packages/matplotlib/axes/_axes.py", line 7504, in hist2d
    h, xedges, yedges = np.histogram2d(x, y, bins=bins, range=range,
  File "<__array_function__ internals>", line 180, in histogram2d
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/twodim_base.py", line 825, in histogram2d
    hist, edges = histogramdd([x, y], bins, range, normed, weights, density)
  File "<__array_function__ internals>", line 180, in histogramdd
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/histograms.py", line 1031, in histogramdd
    raise ValueError(
ValueError: The dimension of bins must be equal to the dimension of the  sample x.

@jmduarte
Copy link
Contributor Author

made it to this error, but I actually don't see where it's coming from at the moment. It seems num_classes=-1 should be allowed in torch.nn.functional.one_hot (and it's the default).

Traceback (most recent call last):
  File "/particleflow/mlpf/pipeline.py", line 160, in <module>
    main()
  File "/particleflow/mlpf/pipeline.py", line 156, in main
    device_agnostic_run(config, world_size, experiment_dir, args.habana)
  File "/particleflow/mlpf/model/training.py", line 855, in device_agnostic_run
    run(rank, world_size, config, outdir, logfile)
  File "/particleflow/mlpf/model/training.py", line 740, in run
    train_all_epochs(
  File "/particleflow/mlpf/model/training.py", line 369, in train_all_epochs
    losses_train = train_epoch(
  File "/particleflow/mlpf/model/training.py", line 144, in train_epoch
    loss_opt, loss, _, _, _ = model_step(batch, model, mlpf_loss)
  File "/particleflow/mlpf/model/training.py", line 75, in model_step
    loss_opt, losses_detached = loss_fn(ytarget, ypred, batch)
  File "/particleflow/mlpf/model/losses.py", line 115, in mlpf_loss
    was_input_true = torch.concat([torch.nn.functional.one_hot((y["cls_id"] != 0).to(torch.long)), y["momentum"]], axis=-1) * batch.mask.unsqueeze(
RuntimeError: Number of classes cannot be -1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants