[Bug]: Main process does not fail on OOM of spark-job #131

Phil1602 · 2023-03-28T17:25:32Z

What happened?

Sometimes Spark Jobs are failing, because of OOM (There are other open issues related to the OOM itself, e.g. this one).

But even worse: the Job running as a Kubernetes Pod with embedded Spark mode (no dedicated Spark master) will not fail, but run forever without completion.

If the spark-job fails, the main-process should fail as well instead of just catching the OOM-Exception without any handling. Otherwise it will block resources without realizing the failure.

Unfortunately I'm not into the Spark framework to suggest a implementable solution.

Steps to reproduce

Deploy Spark-Dependency Job
Collect a huge amount of traces
Wait for the spark-dependency Job "failing", because of OOM

Expected behavior

The main process should exit with a non-zero error code, so that the Container failed and can be restarted.

Relevant log output

023-03-23T10:30:01.200696521Z WARNING: An illegal reflective access operation has occurred
2023-03-23T10:30:01.200724780Z WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar) to method java.nio.Bits.unaligned()
2023-03-23T10:30:01.200729036Z WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
2023-03-23T10:30:01.200732490Z WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
2023-03-23T10:30:01.200735314Z WARNING: All illegal access operations will be denied in a future release
2023-03-23T10:30:01.480102373Z 23/03/23 10:30:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-23T10:30:02.292442909Z 23/03/23 10:30:02 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2023-03-23T00:00Z, reading from jaeger-span-2023-03-23 index, result storing to jaeger-dependencies-2023-03-23
2023-03-23T11:04:55.809920514Z 
2023-03-23T11:04:55.809944477Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"
2023-03-23T11:05:24.292046946Z 
2023-03-23T11:05:24.292075347Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
2023-03-23T11:07:32.650964109Z 
2023-03-23T11:07:32.650988857Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "netty-rpc-env-timeout"
2023-03-23T11:08:58.216093703Z 
2023-03-23T11:08:58.216120176Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "driver-heartbeater"
2023-03-23T11:09:55.117794169Z 
2023-03-23T11:09:55.117816339Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeat-receiver-event-loop-thread"
2023-03-23T11:11:50.397346152Z Exception in thread "Executor task launch worker for task 8" java.lang.OutOfMemoryError: Java heap space

Screenshot

No response

Additional context

No response

Jaeger backend version

1.35.2

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

Kubernetes via Jaeger Operator

Deployment configs

- name: JAVA_OPTS
      value: -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:-UseCompressedOops
        -XX:+UseG1GC -XX:+UseContainerSupport
----
Image: ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:latest

frittentheke · 2024-11-11T14:05:57Z

Does instructing the JVM to exit / crash on OOM help?

-XX:+ExitOnOutOfMemoryError
-XX:+CrashOnOutOfMemoryError

(see e.g. https://www.baeldung.com/java-shutting-down-outofmemoryerror#killing-the-jvm)

Otherwise the JVM turn into a zombie with the OOM exception (sic) not being handled.

Phil1602 added the bug label Mar 28, 2023

Phil1602 mentioned this issue Mar 28, 2023

[Feature]: Allow setting "activeDeadlineSeconds" on spark-dependency Pod jaegertracing/jaeger-operator#2202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Main process does not fail on OOM of spark-job #131

[Bug]: Main process does not fail on OOM of spark-job #131

Phil1602 commented Mar 28, 2023

frittentheke commented Nov 11, 2024

[Bug]: Main process does not fail on OOM of spark-job #131

[Bug]: Main process does not fail on OOM of spark-job #131

Comments

Phil1602 commented Mar 28, 2023

What happened?

Steps to reproduce

Expected behavior

Relevant log output

Screenshot

Additional context

Jaeger backend version

SDK

Pipeline

Stogage backend

Operating system

Deployment model

Deployment configs

frittentheke commented Nov 11, 2024