Skip to content

[BUG] len after fillna operation uses way more memory than expected #2283

@ayushdg

Description

@ayushdg

I have a dataframe that occupies ~700-800 MB when persisted. I fill all the nulls in the Dataframe using fill_na and call len on the new Dataframe. I notice an explosion in memory usage.

Reproducer:

# Create a dataframe and write to file
import numpy as np
import pandas as pd
import dask.dataframe

pdf = pd.DataFrame()
for i in range(80): 
    pdf[str(i)] = pd.Series([12,None]*100000)
ddf = dask.dataframe.from_pandas(pdf,1)
ddf.to_parquet('temp_data.parquet')

# Read the dataframe from file
import os
import dask
import dask_cudf
import cudf

path = 'temp_data.parquet/'
files = [fn for fn in os.listdir(path) if fn.endswith('.parquet')]
parts= [dask.delayed(cudf.io.parquet.read_parquet)
         (path=path+fn) for fn in files]

temp = dask_cudf.from_delayed(parts)

Now when I do len(temp)

Nvidia-smi usage shoots to a max state here:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |    685MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    404162      C   /conda/envs/rapids/bin/python                841MiB |
+-----------------------------------------------------------------------------+

Now for the fill_na operation

%%time
for col in temp.columns:
    temp[col] = temp[col].fillna(-1)

CPU times: user 35.6 s, sys: 1.26 s, total: 36.8 s
Wall time: 38.7 s (Which is slow)

(No change in memory usage leading me to believe this operation is only done at a metadata level but not on the complete data)

Finally:
len(temp)

Nvidia-smi usage

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   46C    P0    28W /  70W |  13681MiB / 15079MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    404162      C   /conda/envs/rapids/bin/python              13755MiB |
+-----------------------------------------------------------------------------+

Which is more than a 16x spike in memory usage. Not sure if my approach is wrong or there is some other underlying issue.

Environment Info
cudf: Built from source at commit: 79af3a8806bbe01a
dask-cudf:Built from source at commit 24798dd8cf9502

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdaskDask issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions