-
Notifications
You must be signed in to change notification settings - Fork 285
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I frequently run benchmarks in k8s, and noticed that on the PR for upgrading to DF 52, the pod is being killed due to OOM:
comet-pr-3470-c3886859859-9c2jd 0/1 OOMKilled
I grep'd the log for errors and warnings and saw this output:
26/02/11 21:01:14 WARN DAGScheduler: Broadcasting large task binary with size 1077.4 KiB
26/02/11 21:01:21 WARN DAGScheduler: Broadcasting large task binary with size 1077.5 KiB
26/02/11 21:01:38 WARN DAGScheduler: Broadcasting large task binary with size 1204.7 KiB
The final few lines in the log were:
26/02/11 21:02:02 INFO CometExecIterator: memoryPoolType=fair_unified, offHeapSize=24576 MB, memoryFraction=1.0, memoryLimit=24576 MB, memoryLimitPerTask=3072 MB
26/02/11 21:02:02 INFO CometExecIterator: memoryPoolType=fair_unified, offHeapSize=24576 MB, memoryFraction=1.0, memoryLimit=24576 MB, memoryLimitPerTask=3072 MB
26/02/11 21:02:02 INFO CometExecIterator: memoryPoolType=fair_unified, offHeapSize=24576 MB, memoryFraction=1.0, memoryLimit=24576 MB, memoryLimitPerTask=3072 MB
26/02/11 21:02:02 INFO CometExecIterator: memoryPoolType=fair_unified, offHeapSize=24576 MB, memoryFraction=1.0, memoryLimit=24576 MB, memoryLimitPerTask=3072 MB
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Getting 48 (1843.0 KiB) non-empty blocks including 48 (1843.0 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Getting 48 (1846.8 KiB) non-empty blocks including 48 (1846.8 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Getting 48 (1850.7 KiB) non-empty blocks including 48 (1850.7 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Getting 48 (1846.8 KiB) non-empty blocks including 48 (1846.8 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
26/02/11 21:02:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
In this instance, it was running q21 when the OOM happened.
Steps to reproduce
Spark is running in local mode local[*] against local sf=100 Parquet files.
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--jars $jar \
--driver-class-path $jar \
--conf spark.driver.memory=32G \
--conf spark.driver.cores=8 \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=24g \
--conf spark.driver.extraClassPath=$jar \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.comet.expression.Cast.allowIncompatible=true \
$EXTRA_CONF_ARGS \
tpcbench.py \
--name comet \
--benchmark tpch \
--data $TPCH_DATA \
--queries $TPCH_QUERIES \
--output "$output_dir" \
--iterations $ITERATIONS \
--format parquetpod settings:
Limits:
cpu: 8
memory: 64Gi
Requests:
cpu: 8
memory: 64Gi
Expected behavior
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working