Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download OOMs in driver #392

Open
marcos-lg opened this issue Feb 26, 2025 · 2 comments
Open

Download OOMs in driver #392

marcos-lg opened this issue Feb 26, 2025 · 2 comments
Assignees

Comments

@marcos-lg
Copy link
Contributor

marcos-lg commented Feb 26, 2025

Some downloads fail because of OOMs in the driver and get errors in the logs like these (the OOM error can only be seen in the pod https://github.com/gbif/gbif-airflow-dags/issues/16):

[2025-02-25, 21:13:45 UTC] {rest.py:231} DEBUG - response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
[2025-02-25, 21:13:45 UTC] {extended_stackable_spark_sensor.py:128} WARNING - 
                K8s API Response: None.
                Underlying exception: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
                
[2025-02-25, 21:13:45 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/stackable/app/git/current/dags/gbif_modules/sensors/extended_stackable_spark_sensor.py", line 117, in poke
    response = self.hook.get_custom_object(
  File "/stackable/app/lib64/python3.9/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 332, in get_custom_object
    response = api.get_namespaced_custom_object(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 1484, in get_namespaced_custom_object
    return self.get_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 1591, in get_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/stackable/app/lib64/python3.9/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
  File "/stackable/app/lib64/python3.9/site-packages/airflow/sensors/base.py", line 265, in execute
    raise e
  File "/stackable/app/lib64/python3.9/site-packages/airflow/sensors/base.py", line 247, in execute
    poke_return = self.poke(context)
  File "/stackable/app/git/current/dags/gbif_modules/sensors/extended_stackable_spark_sensor.py", line 135, in poke
    raise AirflowException(f"SparkApplication failed cause: {ex}")
airflow.exceptions.AirflowException: SparkApplication failed cause: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}

e.g.:

@marcos-lg
Copy link
Contributor Author

I suggest we could set the spark driver memory overhead parameter: https://github.com/gbif/gbif-airflow-dags/pull/14

When I monitored some of these downloads in grafana, the driver had 6GB of memory allocated and spark was using 3.5GB which makes me think the problem is not the heap memory.

@muttcg
Copy link
Member

muttcg commented Feb 26, 2025

I think large memory settings for the driver indicate that the issue is with the app. The default is 1GB, and that should be enough for almost all cases, except when the job uses broadcast joins. So I would keep to 1-2GB.

OOMKill can have different causes. One of them is when the K8s node is overcommitted. K8s uses QoS to evict the app by class. I guess in the case of downloads, the node could be overcommitted, and if the driver has the BestEffort class, it is evicted first.

https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#besteffort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants