You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes Spark Jobs are failing, because of OOM (There are other open issues related to the OOM itself, e.g. this one).
But even worse: the Job running as a Kubernetes Pod with embedded Spark mode (no dedicated Spark master) will not fail, but run forever without completion.
If the spark-job fails, the main-process should fail as well instead of just catching the OOM-Exception without any handling. Otherwise it will block resources without realizing the failure.
Unfortunately I'm not into the Spark framework to suggest a implementable solution.
Steps to reproduce
Deploy Spark-Dependency Job
Collect a huge amount of traces
Wait for the spark-dependency Job "failing", because of OOM
Expected behavior
The main process should exit with a non-zero error code, so that the Container failed and can be restarted.
Relevant log output
023-03-23T10:30:01.200696521Z WARNING: An illegal reflective access operation has occurred
2023-03-23T10:30:01.200724780Z WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar) to method java.nio.Bits.unaligned()
2023-03-23T10:30:01.200729036Z WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
2023-03-23T10:30:01.200732490Z WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
2023-03-23T10:30:01.200735314Z WARNING: All illegal access operations will be denied in a future release
2023-03-23T10:30:01.480102373Z 23/03/23 10:30:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-23T10:30:02.292442909Z 23/03/23 10:30:02 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2023-03-23T00:00Z, reading from jaeger-span-2023-03-23 index, result storing to jaeger-dependencies-2023-03-23
2023-03-23T11:04:55.809920514Z
2023-03-23T11:04:55.809944477Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"
2023-03-23T11:05:24.292046946Z
2023-03-23T11:05:24.292075347Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
2023-03-23T11:07:32.650964109Z
2023-03-23T11:07:32.650988857Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "netty-rpc-env-timeout"
2023-03-23T11:08:58.216093703Z
2023-03-23T11:08:58.216120176Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "driver-heartbeater"
2023-03-23T11:09:55.117794169Z
2023-03-23T11:09:55.117816339Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeat-receiver-event-loop-thread"
2023-03-23T11:11:50.397346152Z Exception in thread "Executor task launch worker for task 8" java.lang.OutOfMemoryError: Java heap space
What happened?
Sometimes Spark Jobs are failing, because of OOM (There are other open issues related to the OOM itself, e.g. this one).
But even worse: the Job running as a Kubernetes Pod with embedded Spark mode (no dedicated Spark master) will not fail, but run forever without completion.
If the spark-job fails, the main-process should fail as well instead of just catching the OOM-Exception without any handling. Otherwise it will block resources without realizing the failure.
Unfortunately I'm not into the Spark framework to suggest a implementable solution.
Steps to reproduce
Expected behavior
The main process should exit with a non-zero error code, so that the Container failed and can be restarted.
Relevant log output
Screenshot
No response
Additional context
No response
Jaeger backend version
1.35.2
SDK
No response
Pipeline
No response
Stogage backend
No response
Operating system
No response
Deployment model
Kubernetes via Jaeger Operator
Deployment configs
The text was updated successfully, but these errors were encountered: