You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I monitored some of these downloads in grafana, the driver had 6GB of memory allocated and spark was using 3.5GB which makes me think the problem is not the heap memory.
I think large memory settings for the driver indicate that the issue is with the app. The default is 1GB, and that should be enough for almost all cases, except when the job uses broadcast joins. So I would keep to 1-2GB.
OOMKill can have different causes. One of them is when the K8s node is overcommitted. K8s uses QoS to evict the app by class. I guess in the case of downloads, the node could be overcommitted, and if the driver has the BestEffort class, it is evicted first.
Some downloads fail because of OOMs in the driver and get errors in the logs like these (the OOM error can only be seen in the pod https://github.com/gbif/gbif-airflow-dags/issues/16):
e.g.:
The text was updated successfully, but these errors were encountered: