Download OOMs in driver #392

marcos-lg · 2025-02-26T09:33:31Z

Some downloads fail because of OOMs in the driver and get errors in the logs like these (the OOM error can only be seen in the pod https://github.com/gbif/gbif-airflow-dags/issues/16):

[2025-02-25, 21:13:45 UTC] {rest.py:231} DEBUG - response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
[2025-02-25, 21:13:45 UTC] {extended_stackable_spark_sensor.py:128} WARNING - 
                K8s API Response: None.
                Underlying exception: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
                
[2025-02-25, 21:13:45 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/stackable/app/git/current/dags/gbif_modules/sensors/extended_stackable_spark_sensor.py", line 117, in poke
    response = self.hook.get_custom_object(
  File "/stackable/app/lib64/python3.9/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 332, in get_custom_object
    response = api.get_namespaced_custom_object(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 1484, in get_namespaced_custom_object
    return self.get_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 1591, in get_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/stackable/app/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/stackable/app/lib64/python3.9/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
  File "/stackable/app/lib64/python3.9/site-packages/airflow/sensors/base.py", line 265, in execute
    raise e
  File "/stackable/app/lib64/python3.9/site-packages/airflow/sensors/base.py", line 247, in execute
    poke_return = self.poke(context)
  File "/stackable/app/git/current/dags/gbif_modules/sensors/extended_stackable_spark_sensor.py", line 135, in poke
    raise AirflowException(f"SparkApplication failed cause: {ex}")
airflow.exceptions.AirflowException: SparkApplication failed cause: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Audit-Id': '67306bb0-6628-4c10-a6d6-e890f2d05958', 'Cache-Control': 'no-cache, private', 'Date': 'Tue, 25 Feb 2025 21:13:45 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}

e.g.:

0000443-250225085111116 small download: https://airflow.gbif.org/log?execution_date=2025-02-25T12%3A50%3A56.488494%2B00%3A00&task_id=download_monitor&dag_id=gbif_occurrence_small_download_dag&map_index=-1
0000065-250225202704447 big download: https://airflow.gbif.org/log?execution_date=2025-02-25T21%3A09%3A41.101717%2B00%3A00&task_id=download_query_monitor&dag_id=gbif_occurrence_download_dag&map_index=-1

The text was updated successfully, but these errors were encountered:

marcos-lg · 2025-02-26T10:00:01Z

I suggest we could set the spark driver memory overhead parameter: https://github.com/gbif/gbif-airflow-dags/pull/14

When I monitored some of these downloads in grafana, the driver had 6GB of memory allocated and spark was using 3.5GB which makes me think the problem is not the heap memory.

muttcg · 2025-02-26T10:26:41Z

I think large memory settings for the driver indicate that the issue is with the app. The default is 1GB, and that should be enough for almost all cases, except when the job uses broadcast joins. So I would keep to 1-2GB.

OOMKill can have different causes. One of them is when the K8s node is overcommitted. K8s uses QoS to evict the app by class. I guess in the case of downloads, the node could be overcommitted, and if the driver has the BestEffort class, it is evicted first.

https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#besteffort

marcos-lg self-assigned this Feb 27, 2025

marcos-lg added a commit that referenced this issue Feb 27, 2025

#392 added spark driver memory overhead to download launcher

9473e60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download OOMs in driver #392

Download OOMs in driver #392

marcos-lg commented Feb 26, 2025 •

edited

Loading

marcos-lg commented Feb 26, 2025

muttcg commented Feb 26, 2025

Download OOMs in driver #392

Download OOMs in driver #392

Comments

marcos-lg commented Feb 26, 2025 • edited Loading

marcos-lg commented Feb 26, 2025

muttcg commented Feb 26, 2025

marcos-lg commented Feb 26, 2025 •

edited

Loading