You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines.
However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:
======================================================================FAIL: test_compressed_exact (__main__.CGXTests)----------------------------------------------------------------------Traceback (most recent call last): File "test/test_qmpi.py", line 95, in test_compressed_exact self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))AssertionError: Tensors are not equal: tensor([2.], device='cuda:0', dtype=torch.float16) != tensor([3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 1
In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?
For my understanding, in the readme when dist.init_process_group is called, the local rank is used. Does this assume that there is only one node?
Thanks!
The text was updated successfully, but these errors were encountered:
@ly-muc Thank you for the filing the issue!
The problem was in the code. It is fixed in the commit and new release.
The test was only ran on a single node but should also work in multinode setting.
I think it is sufficient to have dist.init_process_group(backend="cgx", init_method="env://", rank=self.rank). The rank is taken from OMPI_COMM_WORLD_RANK which is supposed to be global rank, not local.
I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines.
I execute the test with the following line:
mpirun -np 2 -x PATH --hostfile hostfile --tag-output --allow-run-as-root -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca coll ^hcoll -- python test/test_qmpi.py --masterhost=$MASTER_HOST
However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:
or this one:
In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?
For my understanding, in the readme when
dist.init_process_group
is called, the local rank is used. Does this assume that there is only one node?Thanks!
The text was updated successfully, but these errors were encountered: