-
Couldn't load subscription status.
- Fork 981
Description
I have a dataframe that occupies ~700-800 MB when persisted. I fill all the nulls in the Dataframe using fill_na and call len on the new Dataframe. I notice an explosion in memory usage.
Reproducer:
# Create a dataframe and write to file
import numpy as np
import pandas as pd
import dask.dataframe
pdf = pd.DataFrame()
for i in range(80):
pdf[str(i)] = pd.Series([12,None]*100000)
ddf = dask.dataframe.from_pandas(pdf,1)
ddf.to_parquet('temp_data.parquet')
# Read the dataframe from file
import os
import dask
import dask_cudf
import cudf
path = 'temp_data.parquet/'
files = [fn for fn in os.listdir(path) if fn.endswith('.parquet')]
parts= [dask.delayed(cudf.io.parquet.read_parquet)
(path=path+fn) for fn in files]
temp = dask_cudf.from_delayed(parts)
Now when I do len(temp)
Nvidia-smi usage shoots to a max state here:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 47C P0 28W / 70W | 685MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 33C P8 10W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 32C P8 10W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 32C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 404162 C /conda/envs/rapids/bin/python 841MiB |
+-----------------------------------------------------------------------------+
Now for the fill_na operation
%%time
for col in temp.columns:
temp[col] = temp[col].fillna(-1)
CPU times: user 35.6 s, sys: 1.26 s, total: 36.8 s
Wall time: 38.7 s (Which is slow)
(No change in memory usage leading me to believe this operation is only done at a metadata level but not on the complete data)
Finally:
len(temp)
Nvidia-smi usage
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 46C P0 28W / 70W | 13681MiB / 15079MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 33C P8 10W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 32C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 32C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 404162 C /conda/envs/rapids/bin/python 13755MiB |
+-----------------------------------------------------------------------------+
Which is more than a 16x spike in memory usage. Not sure if my approach is wrong or there is some other underlying issue.
Environment Info
cudf: Built from source at commit: 79af3a8806bbe01a
dask-cudf:Built from source at commit 24798dd8cf9502