Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Main process does not fail on OOM of spark-job #131

Open
Phil1602 opened this issue Mar 28, 2023 · 1 comment
Open

[Bug]: Main process does not fail on OOM of spark-job #131

Phil1602 opened this issue Mar 28, 2023 · 1 comment
Labels

Comments

@Phil1602
Copy link

What happened?

Sometimes Spark Jobs are failing, because of OOM (There are other open issues related to the OOM itself, e.g. this one).

But even worse: the Job running as a Kubernetes Pod with embedded Spark mode (no dedicated Spark master) will not fail, but run forever without completion.

If the spark-job fails, the main-process should fail as well instead of just catching the OOM-Exception without any handling. Otherwise it will block resources without realizing the failure.

Unfortunately I'm not into the Spark framework to suggest a implementable solution.

Steps to reproduce

  1. Deploy Spark-Dependency Job
  2. Collect a huge amount of traces
  3. Wait for the spark-dependency Job "failing", because of OOM

Expected behavior

The main process should exit with a non-zero error code, so that the Container failed and can be restarted.

Relevant log output

023-03-23T10:30:01.200696521Z WARNING: An illegal reflective access operation has occurred
2023-03-23T10:30:01.200724780Z WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar) to method java.nio.Bits.unaligned()
2023-03-23T10:30:01.200729036Z WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
2023-03-23T10:30:01.200732490Z WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
2023-03-23T10:30:01.200735314Z WARNING: All illegal access operations will be denied in a future release
2023-03-23T10:30:01.480102373Z 23/03/23 10:30:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-23T10:30:02.292442909Z 23/03/23 10:30:02 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2023-03-23T00:00Z, reading from jaeger-span-2023-03-23 index, result storing to jaeger-dependencies-2023-03-23
2023-03-23T11:04:55.809920514Z 
2023-03-23T11:04:55.809944477Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"
2023-03-23T11:05:24.292046946Z 
2023-03-23T11:05:24.292075347Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
2023-03-23T11:07:32.650964109Z 
2023-03-23T11:07:32.650988857Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "netty-rpc-env-timeout"
2023-03-23T11:08:58.216093703Z 
2023-03-23T11:08:58.216120176Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "driver-heartbeater"
2023-03-23T11:09:55.117794169Z 
2023-03-23T11:09:55.117816339Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeat-receiver-event-loop-thread"
2023-03-23T11:11:50.397346152Z Exception in thread "Executor task launch worker for task 8" java.lang.OutOfMemoryError: Java heap space

Screenshot

No response

Additional context

No response

Jaeger backend version

1.35.2

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

Kubernetes via Jaeger Operator

Deployment configs

- name: JAVA_OPTS
      value: -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:-UseCompressedOops
        -XX:+UseG1GC -XX:+UseContainerSupport
----
Image: ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:latest
@frittentheke
Copy link
Contributor

Does instructing the JVM to exit / crash on OOM help?

-XX:+ExitOnOutOfMemoryError
-XX:+CrashOnOutOfMemoryError

(see e.g. https://www.baeldung.com/java-shutting-down-outofmemoryerror#killing-the-jvm)

Otherwise the JVM turn into a zombie with the OOM exception (sic) not being handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants