Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to utilize multiple gpus config_.device < getNumDevices()' failed: Invalid GPU device #3550

Open
2 of 4 tasks
FalsoMoralista opened this issue Jun 25, 2024 · 3 comments

Comments

@FalsoMoralista
Copy link

FalsoMoralista commented Jun 25, 2024

I'm trying to replace the cpu index by a gpu one but can't seem to do it on a distributed context.

Faiss version:
faiss 1.8.0 pypi_0 pypi
faiss-gpu 1.8.0 py3.12_h4c7d538_0_cuda12.1.1 pytorch

Installed from miniconda.

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Context

After initializing the K-means centroids for each value of K, I try to replace the default (cpu) index by a gpu one. This works for a single gpu device but fails when using multiple devices.

class KMeansModule:

    def __init__(self, nb_classes, dimensionality=256, n_iter=50, tol=1e-4, k_range=[2,3,4,5], resources=None, config=None):
    
        self.k_range = k_range
        self.d = dimensionality
        self.max_iter = n_iter
        self.tol = tol
        
        # Create the K-means object
        if len(k_range) == 1:
            self.n_kmeans = [faiss.Kmeans(d=dimensionality, k=k_range[0], niter=1, verbose=True, min_points_per_centroid = 1 ) for _ in range(nb_classes)]   
        else:
            # For each class, create n K-Means objects (one for each value of K), where n = len(k_range)
            # (this will be used to select the best K). 
            self.n_kmeans = []   
            for _ in range(nb_classes):
                self.n_kmeans.append([faiss.Kmeans(d=dimensionality, k=k, niter=1, verbose=False, min_points_per_centroid = 1) for k in k_range])                                                            

    def initialize_centroids(self, batch_x, class_id, resources, rank, device, config, cached_features):
        image_list = cached_features[class_id] # Use the features cached from the previous epoch                
        batch_x = torch.stack(image_list)

        # For each K (model selection)
        for k in range(len(self.k_range)):
            self.n_kmeans[class_id][k].train(batch_x.detach().cpu()) # Train K-means model for one iteration to initialize centroids 

            # Replace the regular index by a gpu one
            index_flat = self.n_kmeans[class_id][k].index

            gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
            self.n_kmeans[class_id][k].index = gpu_index_flat

res = faiss.StandardGpuResources()
initialize_centroids(batch_x = None, class_id, resources=res, rank=rank, device=device, cached_features)

Each rank (0, 1, 2, ..., 8) specifies the corresponding gpu device id.

Output

RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 7
Process Process-4:
Traceback (most recent call last):
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=res, rank=rank, cached_features=cached_features_last_epoch, config=cfg, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 92, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12799, in index_cpu_to_gpu
    return _swigfaiss_avx2.index_cpu_to_gpu(provider, device, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 3
Process Process-7:
Traceback (most recent call last):
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=res, rank=rank, cached_features=cached_features_last_epoch, config=cfg, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 92, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12799, in index_cpu_to_gpu
    return _swigfaiss_avx2.index_cpu_to_gpu(provider, device, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 6

Attempts

I have tried this as well (https://github.com/facebookresearch/DeeperCluster/blob/main/src/distributed_kmeans.py#L182), wondering that each process would initialize its own resources specifying the device number accordingly, but the same error happens.

res = faiss.StandardGpuResources()
cfg = faiss.GpuIndexFlatConfig()
cfg.device = rank

# Replace the regular index by a gpu one
index_flat = self.n_kmeans[class_id][k].index
gpu_index_flat = faiss.GpuIndexFlatL2(resources, self.d, config)
self.n_kmeans[class_id][k].index = gpu_index_flat 

Output

  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    logger.info('Initializing centroids...')
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 91, in initialize_centroids
    gpu_index_flat = faiss.GpuIndexFlatL2(resources, self.d, config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 11575, in __init__
    _swigfaiss_avx2.GpuIndexFlatL2_swiginit(self, _swigfaiss_avx2.new_GpuIndexFlatL2(*args))
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 7
Process Process-6:

Other than that i have also tried the solution proposed here (https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU)

resources = [faiss.StandardGpuResources() for _ in range(world_size)]

index_flat = self.n_kmeans[class_id][k].index
gpu_index_flat = faiss.index_cpu_to_gpu_multiple(resources, devices=[0,1,2,3,4,5,6,7], index=index_flat)
self.n_kmeans[class_id][k].index = gpu_index_flat

Which generates:

  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=resources, rank=rank, cached_features=cached_features_last_epoch, config=None, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 100, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 93, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu_multiple(resources, devices=[0,1,2,3,4,5,6,7],index=index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12802, in index_cpu_to_gpu_multiple
    return _swigfaiss_avx2.index_cpu_to_gpu_multiple(provider, devices, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'index_cpu_to_gpu_multiple'.
  Possible C/C++ prototypes are:
    faiss::gpu::index_cpu_to_gpu_multiple(std::vector< faiss::gpu::GpuResourcesProvider * > &,std::vector< int > &,faiss::Index const *,faiss::gpu::GpuMultipleClonerOptions const *)
    faiss::gpu::index_cpu_to_gpu_multiple(std::vector< faiss::gpu::GpuResourcesProvider * > &,std::vector< int > &,faiss::Index const *)
@FalsoMoralista
Copy link
Author

FalsoMoralista commented Jun 25, 2024

Tried this as well (from issue #878) , but without success: https://gist.github.com/mdouze/bfa06e7dc0869f0c0495928aab25800f

@brendon-ribeiro918
Copy link

It depends on which devices you use.

@FalsoMoralista
Copy link
Author

It depends on which devices you use.

What do you mean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants