Unable to run Unittest

I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines. 

```python
os.environ['MASTER_ADDR'] = args.masterhost
os.environ['MASTER_PORT'] = '4040'
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]

dist.init_process_group(backend="cgx",  init_method="env://", rank=self.rank % torch.cuda.device_count())
``` 

I execute the test with the following line:

```console
mpirun -np 2 -x PATH --hostfile hostfile --tag-output --allow-run-as-root -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca coll ^hcoll -- python test/test_qmpi.py --masterhost=$MASTER_HOST
```

However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:

```console
======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
    	self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2.], device='cuda:0', dtype=torch.float16) != tensor([3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 128
```
or this one:

```console
======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
    self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2.], device='cuda:0', dtype=torch.float16) != tensor([3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 1
```

In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?

For my understanding, in the readme when ```dist.init_process_group``` is called, the local rank is used.  Does this assume that there is only one node?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run Unittest #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to run Unittest #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions