Skip to content

Floating point exception (core dumped) problem #273

@wykk00

Description

@wykk00

Description

I face a problem when I try to reproduce the paper code GIANT. I used my own text-atttibuted graph dataset and followed the data processing instruction by GIANT.

It seems really strange that this problem occurred at training level 1, while it can be well at training level 0.
I try to direct this issue, and the only problem I can find is that it may occur at sparse_matmul() function in matcher._predict().

Steps to reproduce

The command is

CUDA_VISIBLE_DEVICES=1 python3 -m pecos.xmc.xtransformer.train -t X.trn.txt -x X.trn.tfidf.npz -y Y.trn.npz -m xrt_models --batch-gen-workers 0

Error message or code output

12/29/2023 13:02:58 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7150/  7220] | 1373/1444 batches | ms/batch 451.6586 | train_loss 7.300417e-01 | lr 9.695291e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7200/  7220] | 1423/1444 batches | ms/batch 451.0563 | train_loss 7.260027e-01 | lr 2.770083e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | **** saving model (avg_prec=0) to /tmp/tmpo8wg3j8h at global_step 7200 ****
12/29/2023 13:03:26 - INFO - pecos.xmc.xtransformer.matcher - -----------------------------------------------------------------------------------------
12/29/2023 13:03:36 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmpo8wg3j8h
Floating point exception (core dumped)

Environment

  • Operating system: Ubuntu-22.04.1 (X86)
  • Python version: 3.9.18
  • PECOS version: 1.2.2
  • torch: 1.13.1
  • numpy: 1.26.2
  • scipy: 1.11.4
  • transformers: 4.36.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions