Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate impact of JIT deserialization during CUDA object spilling #162

Open
beckernick opened this issue Jan 13, 2021 · 4 comments
Open
Assignees

Comments

@beckernick
Copy link
Member

beckernick commented Jan 13, 2021

This functionality is now optionally available in nightlies, and we should evaluate how this affects performance (particularly with UCX).

@beckernick beckernick changed the title Evaluate memory impact of JIT deserialization during CUDA object spilling Evaluate impact of JIT deserialization during CUDA object spilling Jan 13, 2021
@beckernick beckernick self-assigned this Jan 22, 2021
@beckernick
Copy link
Member Author

beckernick commented Jan 22, 2021

As a baseline, we ran every query 5 times on a standard cluster 8 GPUs of a DGX-2 with a 15GB device memory limit and 30GB RMM pool.

With DASK_JIT_UNSPILL=True, Q02 ran hit some memory issues in some runs with UCX, as well the following in others with TCP:

QUERY=02; cd queries/q$QUERY; python tpcx_bb_query_$QUERY\.py --config_file ../../benchmark_runner/benchmark_config.yaml ; cd ../../
Using default arguments
{
  "type": "Scheduler",
  "id": "Scheduler-d57bdb26-1dd2-4478-82f7-92fb51a39c09",
  "address": "tcp://10.33.228.70:8786",
  "services": {
    "dashboard": 8787
  },
  "started": 1611349272.8086827,
  "workers": {}
}
Connected!
Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "tpcx_bb_query_02.py", line 143, in main
    result_df = result_df.head(q02_limit)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/core.py", line 1036, in head
    return self._head(n=n, npartitions=npartitions, compute=compute, safe=True)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/core.py", line 1069, in _head
    result = result.compute()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/base.py", line 279, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 1996, in gather
    asynchronous=asynchronous,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 837, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 1162, in shuffle_group
    ind = hash_object_dispatch(df[cols] if cols else df, index=False)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/series.py", line 906, in __getitem__
    return self._get_with(key)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/series.py", line 946, in _get_with
    return self.loc[key]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['wcs_user_sk'], dtype='object')] are in the [index]"
conda list | grep "rapids\|blazing\|dask\|distr\|pandas"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122:
blazingsql                0.18.0a0                 pypi_0    pypi
cudf                      0.18.0a210122   cuda_10.2_py37_g6c116e382f_191    rapidsai-nightly
cuml                      0.18.0a210122   cuda10.2_py37_g29f7d08e9_82    rapidsai-nightly
dask                      2021.1.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.1.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a201211           py37_39    http://conda-mirror.gpuci.io/rapidsai-nightly
dask-cudf                 0.18.0a210122   py37_g6c116e382f_191    http://conda-mirror.gpuci.io/rapidsai-nightly
distributed               2021.1.0         py37h89c1867_1    conda-forge
faiss-proc                1.0.0                      cuda    http://conda-mirror.gpuci.io/rapidsai-nightly
libcudf                   0.18.0a210122   cuda10.2_g6c116e382f_191    rapidsai-nightly
libcuml                   0.18.0a210122   cuda10.2_g29f7d08e9_82    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    http://conda-mirror.gpuci.io/rapidsai-nightly
librmm                    0.18.0a210122   cuda10.2_g1502058_24    rapidsai-nightly
pandas                    1.1.5            py37hdc94413_0    conda-forge
rmm                       0.18.0a210122   cuda_10.2_py37_g1502058_24    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda10.2_0    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-proc                  1.0.0                       gpu    http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-py                    0.18.0a210122   py37_gcd9efd3_10    http://conda-mirror.gpuci.io/rapidsai-nightly

@madsbk
Copy link
Member

madsbk commented Jan 27, 2021

Sorry for the late reply, I wasn't aware of this issue :/
For some reason rapidsai-nightly contains a old version of dask-cuda (0.18.0a201211) thus setting DASK_JIT_UNSPILL=True uses the old JIT spilling from last year.

Having said that, I am debugging a deadlock with the new JIT spilling that Q02 triggers when the device limit is 15GB (as opposed to the 20GB, which I have been using when testing). Will let you know when I have a fix.

@VibhuJawa
Copy link
Member

CC: @ChrisJar for awareness.

rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this issue Jan 27, 2021
Fixes a deadlock where multiple threads accesses `ProxifyHostFile.maybe_evict()` but none of them can acquire both the `ProxifyHostFile` lock and the `ProxyObject` lock simultaneously. 

Should fix_ rapidsai/gpu-bdb#162 (comment)

Authors:
  - Mads R. B. Kristensen (@madsbk)

Approvers:
  - Peter Andreas Entschev (@pentschev)

URL: #501
@madsbk
Copy link
Member

madsbk commented Jan 28, 2021

Having said that, I am debugging a deadlock with the new JIT spilling that Q02 triggers when the device limit is 15GB (as opposed to the 20GB, which I have been using when testing). Will let you know when I have a fix.

The deadlock issue should be fixed in the latest version of dask-cuda rapidsai/dask-cuda#501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants