Hivemind strategy fails #15

the-beee · 2022-12-22T20:25:14Z

Bug description

I'm trying to use HivemindStrategy to train a ResNet model on Cifar-10 using two machines (one w/ a gpu and the other no).
I start the CPU machine first, and the training starts without a problem. Then, I copy the initial_peers value to the GPU machine, start training but it fails.

How to reproduce the bug

import os

import pandas as pd
import seaborn as sn
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.callbacks.progress import TQDMProgressBar
from pytorch_lightning.loggers import CSVLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 16
NUM_WORKERS = int(os.cpu_count() / 2)


train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


model = LitResnet(lr=0.05)


from pytorch_lightning.strategies import HivemindStrategy

trainer = Trainer(
    max_epochs=30,
    accelerator="auto",
    devices=1 if torch.cuda.is_available() else None,  
    strategy=HivemindStrategy(target_batch_size=2048) # for the machine without a gpu

    strategy=HivemindStrategy(target_batch_size=2048,
initial_peers='/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ')      # for the machine with a gpu
)

trainer.fit(model, cifar10_dm)

Error messages and logs

The first machine (without gpu) proceeds with the training normally, here's a sample of its output:

Other machines can connect running the same command:
INITIAL_PEERS=/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ python ...
or passing the peers to the strategy:
HivemindStrategy(initial_peers='/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ')

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Files already downloaded and verified
Files already downloaded and verified

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.696    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                                            | 0/3125 [00:00<?, ?it/s]Found per machine batch size automatically from the batch: 16
Epoch 0:   2%|██▏                                                                                                                                            | 48/3125 [00:11<12:03,  4.25it/s, loss=2.37, v_num=7]

The other machine, however, fails:

/opt/conda/lib/python3.10/site-packages/pl_bolts/callbacks/data_monitor.py:20: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("wandb")
/opt/conda/lib/python3.10/site-packages/pl_bolts/utils/semi_supervised.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("sklearn", pypi_name="scikit-learn")
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:35: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  "lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:93: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/opt/conda/lib/python3.10/site-packages/pl_bolts/losses/self_supervised_learning.py:234: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  self.nce_loss = AmdimNCELoss(tclip)
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/experience_source.py:18: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("gym")
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/sklearn_datamodule.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  warn_missing_pkg("sklearn")
Global seed set to 7
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/workspace/cifar10.py", line 122, in <module>
    strategy=HivemindStrategy(target_batch_size=2048,
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/hivemind.py", line 142, in __init__
    self.dht = hivemind.DHT(
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 88, in __init__
    self.run_in_background(await_ready=await_ready)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 148, in run_in_background
    self.wait_until_ready(timeout)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 151, in wait_until_ready
    self._ready.result(timeout=timeout)
  File "/opt/conda/lib/python3.10/site-packages/hivemind/utils/mpfuture.py", line 258, in result
    return super().result(timeout)
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2022/12/22 20:13:57 failed to parse multiaddr "": empty multiaddr

Environment

CPU machine:

* CUDA:
	- GPU:               None
	- available:         False
	- version:           11.7
* Lightning:
	- lightning-bolts:   0.6.0.post1
	- lightning-lite:    1.8.0
	- lightning-utilities: 0.3.0
	- pytorch-lightning: 1.8.0
	- torch:             1.13.1
	- torchmetrics:      0.10.0
	- torchvision:       0.14.1
* Packages:
	- absl-py:           1.3.0
	- accelerate:        0.15.0
	- aiohttp:           3.8.3
	- aiosignal:         1.3.1
	- asttokens:         2.2.1
	- async-timeout:     4.0.2
	- attrs:             21.2.0
	- automat:           20.2.0
	- babel:             2.8.0
	- backcall:          0.2.0
	- base58:            2.1.1
	- bcrypt:            3.2.0
	- blinker:           1.4
	- cachetools:        5.2.0
	- certifi:           2020.6.20
	- chardet:           4.0.0
	- charset-normalizer: 2.1.1
	- click:             8.0.3
	- cloud-init:        22.4.2
	- colorama:          0.4.4
	- command-not-found: 0.3
	- configargparse:    1.5.3
	- configobj:         5.0.6
	- constantly:        15.1.0
	- contourpy:         1.0.6
	- cryptography:      3.4.8
	- cycler:            0.11.0
	- datasets:          2.8.0
	- dbus-python:       1.2.18
	- decorator:         5.1.1
	- diffusers:         0.11.1
	- dill:              0.3.6
	- distro:            1.7.0
	- distro-info:       1.1build1
	- executing:         1.2.0
	- filelock:          3.8.2
	- fire:              0.5.0
	- fonttools:         4.38.0
	- frozenlist:        1.3.3
	- fsspec:            2022.11.0
	- ftfy:              6.1.1
	- google-auth:       2.15.0
	- google-auth-oauthlib: 0.4.6
	- grpcio:            1.51.1
	- grpcio-tools:      1.48.2
	- hivemind:          1.1.4
	- httplib2:          0.20.2
	- huggingface-hub:   0.11.1
	- hyperlink:         21.0.0
	- idna:              3.3
	- importlib-metadata: 4.6.4
	- incremental:       21.3.0
	- ipython:           8.7.0
	- jedi:              0.18.2
	- jeepney:           0.7.1
	- jinja2:            3.0.3
	- jsonpatch:         1.32
	- jsonpointer:       2.0
	- jsonschema:        3.2.0
	- keyring:           23.5.0
	- kiwisolver:        1.4.4
	- launchpadlib:      1.10.16
	- lazr.restfulclient: 0.14.4
	- lazr.uri:          1.0.6
	- lgg:               0.2.4
	- lightning-bolts:   0.6.0.post1
	- lightning-lite:    1.8.0
	- lightning-utilities: 0.3.0
	- markdown:          3.4.1
	- markupsafe:        2.1.1
	- matplotlib:        3.6.2
	- matplotlib-inline: 0.1.6
	- more-itertools:    8.10.0
	- msgpack:           1.0.4
	- multiaddr:         0.0.9
	- multidict:         6.0.3
	- multiprocess:      0.70.14
	- netaddr:           0.8.0
	- netifaces:         0.11.0
	- numpy:             1.24.0
	- nvidia-cublas-cu11: 11.10.3.66
	- nvidia-cuda-nvrtc-cu11: 11.7.99
	- nvidia-cuda-runtime-cu11: 11.7.99
	- nvidia-cudnn-cu11: 8.5.0.96
	- oauthlib:          3.2.0
	- packaging:         22.0
	- pandas:            1.5.2
	- parso:             0.8.3
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            9.3.0
	- pip:               22.0.2
	- prefetch-generator: 1.0.3
	- prompt-toolkit:    3.0.36
	- protobuf:          3.20.1
	- psutil:            5.9.4
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyarrow:           10.0.1
	- pyasn1:            0.4.8
	- pyasn1-modules:    0.2.1
	- pydantic:          1.10.2
	- pygments:          2.13.0
	- pygobject:         3.42.1
	- pyhamcrest:        2.0.2
	- pyjwt:             2.3.0
	- pymultihash:       0.8.2
	- pyopenssl:         21.0.0
	- pyparsing:         2.4.7
	- pyrsistent:        0.18.1
	- pyserial:          3.5
	- python-apt:        2.3.0+ubuntu2.1
	- python-dateutil:   2.8.2
	- python-debian:     0.1.43ubuntu1
	- python-magic:      0.4.24
	- pytorch-lightning: 1.8.0
	- pytz:              2022.1
	- pyyaml:            5.4.1
	- regex:             2022.10.31
	- requests:          2.25.1
	- requests-oauthlib: 1.3.1
	- responses:         0.18.0
	- rsa:               4.9
	- scipy:             1.9.3
	- seaborn:           0.12.1
	- secretstorage:     3.3.1
	- service-identity:  18.1.0
	- setuptools:        59.6.0
	- six:               1.16.0
	- sortedcontainers:  2.4.0
	- sos:               4.4
	- ssh-import-id:     5.11
	- stack-data:        0.6.2
	- systemd-python:    234
	- tensorboard:       2.11.0
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- tensorboardx:      2.5.1
	- termcolor:         2.1.1
	- tokenizers:        0.13.2
	- torch:             1.13.1
	- torchmetrics:      0.10.0
	- torchvision:       0.14.1
	- tqdm:              4.64.1
	- traitlets:         5.8.0
	- transformers:      4.25.1
	- twisted:           22.1.0
	- typing-extensions: 4.4.0
	- ubuntu-advantage-tools: 27.12
	- ubuntu-drivers-common: 0.0.0
	- ufw:               0.36.1
	- unattended-upgrades: 0.1
	- urllib3:           1.26.5
	- uvloop:            0.17.0
	- varint:            1.0.2
	- wadllib:           1.3.6
	- wcwidth:           0.2.5
	- werkzeug:          2.2.2
	- wheel:             0.37.1
	- xkit:              0.0.0
	- xxhash:            3.1.0
	- yarl:              1.8.2
	- zipp:              1.0.0
	- zope.interface:    5.4.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.10.6
	- version:           Lightning-AI/lightning#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022

GPU machine:

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 3090
        - available:         True
        - version:           11.6
* Lightning:
        - lightning-bolts:   0.6.0.post1
        - lightning-utilities: 0.5.0
        - pytorch-lightning: 1.8.6
        - torch:             1.13.1
        - torchelastic:      0.2.2
        - torchmetrics:      0.10.0
        - torchtext:         0.14.1
        - torchvision:       0.14.1
* Packages:
        - absl-py:           1.3.0
        - accelerate:        0.15.0
        - aiohttp:           3.8.3
        - aiosignal:         1.3.1
        - anyio:             3.6.2
        - argon2-cffi:       21.3.0
        - argon2-cffi-bindings: 21.2.0
        - arrow:             1.2.3
        - asttokens:         2.0.5
        - astunparse:        1.6.3
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - babel:             2.11.0
        - backcall:          0.2.0
        - base58:            2.1.1
        - bash-kernel:       0.9.0
        - beautifulsoup4:    4.11.1
        - bleach:            5.0.1
        - brotlipy:          0.7.0
        - cachetools:        5.2.0
        - certifi:           2022.9.24
        - cffi:              1.15.1
        - chardet:           4.0.0
        - charset-normalizer: 2.0.4
        - comm:              0.1.2
        - conda:             22.11.1
        - conda-build:       3.23.3
        - conda-package-handling: 1.9.0
        - configargparse:    1.5.3
        - contourpy:         1.0.6
        - cryptography:      38.0.1
        - cycler:            0.11.0
        - datasets:          2.8.0
        - debugpy:           1.6.4
        - decorator:         5.1.1
        - defusedxml:        0.7.1
        - diffusers:         0.11.1
        - dill:              0.3.6
        - dnspython:         2.2.1
        - entrypoints:       0.4
        - exceptiongroup:    1.0.4
        - executing:         0.8.3
        - expecttest:        0.1.4
        - fastjsonschema:    2.16.2
        - filelock:          3.6.0
        - flit-core:         3.6.0
        - fonttools:         4.38.0
        - fqdn:              1.5.1
        - frozenlist:        1.3.3
        - fsspec:            2022.11.0
        - ftfy:              6.1.1
        - future:            0.18.2
        - glob2:             0.7
        - google-auth:       2.15.0
        - google-auth-oauthlib: 0.4.6
        - grpcio:            1.51.1
        - grpcio-tools:      1.48.2
        - hivemind:          1.1.4
        - huggingface-hub:   0.11.1
        - hypothesis:        6.61.0
        - idna:              3.4
        - importlib-metadata: 5.2.0
        - iniconfig:         1.1.1
        - ipykernel:         6.19.4
        - ipython:           8.7.0
        - ipython-genutils:  0.2.0
        - ipywidgets:        8.0.3
        - isoduration:       20.11.0
        - jedi:              0.18.1
        - jinja2:            3.1.2
        - json5:             0.9.10
        - jsonpointer:       2.3
        - jsonschema:        4.17.3
        - jupyter:           1.0.0
        - jupyter-archive:   3.3.3
        - jupyter-client:    7.4.8
        - jupyter-console:   6.4.4
        - jupyter-core:      5.1.0
        - jupyter-events:    0.5.0
        - jupyter-http-over-ws: 0.0.8
        - jupyter-server:    1.23.4
        - jupyter-server-terminals: 0.4.3
        - jupyterlab:        3.5.2
        - jupyterlab-pygments: 0.2.2
        - jupyterlab-server: 2.16.5
        - jupyterlab-widgets: 3.0.4
        - kiwisolver:        1.4.4
        - lgg:               0.2.4
        - libarchive-c:      2.9
        - lightning-bolts:   0.6.0.post1
        - lightning-utilities: 0.5.0
        - markdown:          3.4.1
        - markupsafe:        2.1.1
        - matplotlib:        3.6.2
        - matplotlib-inline: 0.1.6
        - mistune:           2.0.4
        - mkl-fft:           1.3.1
        - mkl-random:        1.2.2
        - mkl-service:       2.4.0
        - mpmath:            1.2.1
        - msgpack:           1.0.4
        - multiaddr:         0.0.9
        - multidict:         6.0.3
        - multiprocess:      0.70.14
        - nbclassic:         0.4.8
        - nbclient:          0.7.2
        - nbconvert:         7.2.7
        - nbformat:          5.7.1
        - nbzip:             0.1.0
        - nest-asyncio:      1.5.6
        - netaddr:           0.8.0
        - notebook:          6.5.2
        - notebook-shim:     0.2.2
        - numpy:             1.22.3
        - oauthlib:          3.2.2
        - packaging:         22.0
        - pandas:            1.5.2
        - pandocfilters:     1.5.0
        - parso:             0.8.3
        - pexpect:           4.8.0
        - pickleshare:       0.7.5
        - pillow:            9.3.0
        - pip:               22.3.1
        - pkginfo:           1.8.3
        - platformdirs:      2.6.0
        - pluggy:            1.0.0
        - prefetch-generator: 1.0.3
        - prometheus-client: 0.15.0
        - prompt-toolkit:    3.0.20
        - protobuf:          3.20.1
        - psutil:            5.9.0
        - ptyprocess:        0.7.0
        - pure-eval:         0.2.2
        - pyarrow:           10.0.1
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pycosat:           0.6.4
        - pycparser:         2.21
        - pydantic:          1.10.2
        - pygments:          2.11.2
        - pymultihash:       0.8.2
        - pyopenssl:         22.0.0
        - pyparsing:         3.0.9
        - pyrsistent:        0.19.2
        - pysocks:           1.7.1
        - pytest:            7.2.0
        - python-dateutil:   2.8.2
        - python-etcd:       0.4.5
        - python-json-logger: 2.0.4
        - pytorch-lightning: 1.8.6
        - pytz:              2022.1
        - pyyaml:            6.0
        - pyzmq:             24.0.1
        - qtconsole:         5.4.0
        - qtpy:              2.3.0
        - regex:             2022.10.31
        - requests:          2.28.1
        - requests-oauthlib: 1.3.1
        - responses:         0.18.0
        - rfc3339-validator: 0.1.4
        - rfc3986-validator: 0.1.1
        - rsa:               4.9
        - ruamel.yaml:       0.17.21
        - ruamel.yaml.clib:  0.2.6
        - scipy:             1.9.3
        - seaborn:           0.12.1
        - send2trash:        1.8.0
        - setuptools:        65.5.0
        - six:               1.16.0
        - sniffio:           1.3.0
        - sortedcontainers:  2.4.0
        - soupsieve:         2.3.2.post1
        - stack-data:        0.2.0
        - sympy:             1.11.1
        - tensorboard:       2.11.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - tensorboardx:      2.5.1
        - terminado:         0.17.1
        - tinycss2:          1.2.1
        - tokenizers:        0.13.2
        - toml:              0.10.2
        - tomli:             2.0.1
        - toolz:             0.12.0
        - torch:             1.13.1
        - torchelastic:      0.2.2
        - torchmetrics:      0.10.0
        - torchtext:         0.14.1
        - torchvision:       0.14.1
        - tornado:           6.2
        - tqdm:              4.64.1
        - traitlets:         5.7.1
        - transformers:      4.25.1
        - types-dataclasses: 0.6.6
        - typing-extensions: 4.4.0
        - uri-template:      1.2.0
        - urllib3:           1.26.13
        - uvloop:            0.17.0
        - varint:            1.0.2
        - wcwidth:           0.2.5
        - webcolors:         1.12
        - webencodings:      0.5.1
        - websocket-client:  1.4.2
        - werkzeug:          2.2.2
        - wheel:             0.37.1
        - widgetsnbextension: 4.0.4
        - xxhash:            3.1.0
        - yarl:              1.8.2
        - zipp:              3.11.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.10.8
        - version:           Lightning-AI/lightning#142~18.04.1-Ubuntu SMP Thu Sep 1 16:25:16 UTC 2022

More info

The code I used for training is here.
This CIFAR-10 example worked perfectly fine.

The text was updated successfully, but these errors were encountered:

carmocca added the bug Something isn't working label Dec 23, 2022

carmocca mentioned this issue Jan 17, 2023

Remove the HivemindStrategy Lightning-AI/pytorch-lightning#16407

Merged

Borda transferred this issue from Lightning-AI/pytorch-lightning May 3, 2023

Borda assigned justusschock Sep 1, 2023

Lightning-Universe deleted a comment from github-actions bot Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hivemind strategy fails #15

Hivemind strategy fails #15

the-beee commented Dec 22, 2022

Hivemind strategy fails #15

Hivemind strategy fails #15

Comments

the-beee commented Dec 22, 2022

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info