KubernetesPodOperator matches the task pod and goes into deadlock #22555

sushi30 · 2022-03-23T10:39:13Z

sushi30
Mar 23, 2022

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==3.1.1

Apache Airflow version

2.2.2

Operating System

Centos 7

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow deployed on kubernetes

What happened

These tasks have been running for several months. When trying to upgrade to the latest Kubernetes provider, the KubernetesPodOperator the worker pod so it does not start a new pod. This just hangs until the task times out.

[2022-03-23 09:49:49,950] {kubernetes_pod.py:525} INFO - Creating pod engines-distant-sharer.5bfdc82ad54d4dc1b58367f3d6d6a94f with labels: {'dag_id': 'tag_engine_user', 'task_id': 'engines_distant_sharer', 'run_id': 'manual__2022-03-23T094931.4187280000-fe99c2456', 'try_number': '1'}
[2022-03-23 09:49:49,972] {kubernetes_pod.py:336} INFO - Found matching pod tagengineuserenginesdistantsharer.1e4548b0558448f9a7aaa69da5f1e69d with labels {'airflow-worker': '2258913', 'airflow_version': '2.2.2', 'component': 'worker', 'dag_id': 'tag_engine_user', 'kubernetes_executor': 'True', 'release': 'airflow', 'run_id': 'manual__2022-03-23T094931.4187280000-fe99c2456', 'task_id': 'engines_distant_sharer', 'tier': 'airflow', 'try_number': '1'}

What you think should happen instead

The task should run succesfully

How to reproduce

Run a kubernetes operator with the pod namespace the same as the airfloe deployment namespace.

Anything else

I see that the label "execution_date" has been changed to "run_id". This is most likely the cause.

{'dag_id': 'tag_engine', 'task_id': 'engines_stationary_sharer', 'execution_date': '2022-03-23T0730000000-b016b00b9', 'try_number': '1'}

{'dag_id': 'tag_engine', 'task_id': 'engines_stationary_sharer', 'run_id': 'scheduled__2022-03-23T0930000000-fdfd231b4', 'try_number': '1'}

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by sushi30

Mar 29, 2022

I can confirm this also occurs with apache/airflow:2.2.5rc2.

When trying this on another namespace the pod fails. This is expected because out-of-box helm chart does not support starting pods in different namespaces. This requires enabling multiNamespaceMode.

View full answer

raphaelauv · 2022-03-23T11:24:17Z

raphaelauv
Mar 23, 2022

I can't reproduce with LocalExecutor and KPO run containers in my local K8S-KIND in the default namespace (where airflow is not installed):

airflow 2.2.2 - apache-airflow-providers-cncf-kubernetes==3.1.1
or
airflow 2.2.4 - apache-airflow-providers-cncf-kubernetes==3.1.1

the operator can re-attach the running pod if I restart the sceduler

[2022-03-23, 11:22:27 UTC] {kubernetes_pod.py:525} INFO - Creating pod airflow-test-pod.9a94208666254a0099f44c2ce811c62e with labels: {'dag_id': 'kubernetes_dag', 'task_id': 'task-one', 'run_id': 'scheduled__2022-03-13T0000000000-eef165281', 'try_number': '4'}
[2022-03-23, 11:22:27 UTC] {kubernetes_pod.py:336} INFO - Found matching pod airflow-test-pod.485275d944964e8188c0db5f10776956 with labels {'airflow_version': '2.2.2', 'dag_id': 'kubernetes_dag', 'kubernetes_pod_operator': 'True', 'run_id': 'scheduled__2022-03-13T0000000000-eef165281', 'task_id': 'task-one', 'try_number': '3'}
[2022-03-23, 11:22:27 UTC] {kubernetes_pod.py:337} INFO - `try_number` of task_instance: 4
[2022-03-23, 11:22:27 UTC] {kubernetes_pod.py:338} INFO - `try_number` of pod: 3

could you give us more context on the KPO ( what kind of K8S it use, in_cluster ? ... ) and the FULL airflow logs of the error

0 replies

sushi30 · 2022-03-23T12:32:07Z

sushi30
Mar 23, 2022
Author

It is on EKS
in_cluster=True
Using the airflow namespace. Same as the Airflow deployment.

No special annotations. Using the default template.

kpo_error.log

0 replies

raphaelauv · 2022-03-23T12:38:07Z

raphaelauv
Mar 23, 2022

[2022-03-23 11:18:42,437] {kubernetes_pod.py:338} INFO - `try_number` of pod: 2
[2022-03-23 11:18:42,457] {pod_manager.py:203} INFO - 
[2022-03-23 11:18:42,457] {pod_manager.py:203} INFO - [2022-03-23 11:18:41,495] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/product_dags/tag_engine.py
[2022-03-23 12:18:42,414] {timeout.py:36} ERROR - Process timed out, PID: 15

it look like your have a 1 hour timeout, can you check

0 replies

sushi30 · 2022-03-23T12:59:51Z

sushi30
Mar 23, 2022
Author

I do have a one-hour timeout for the task. This task takes 30s~1:30s when I downgrade to apache-airflow-providers-cncf-kubernetes=3.0.2. When I upgraded, it timed out.

It is supposed to create a new pod called engines-distant-sharer-{randomString} but instead it matches on the pod tagengineuserenginesdistantsharer which is the airflow worker that is running the task.

0 replies

raphaelauv · 2022-03-23T16:06:44Z

raphaelauv
Mar 23, 2022

you had this problem just after updating apache-airflow-providers-cncf-kubernetes ?

scenario :

KPO start a pod with apache-airflow-providers-cncf-kubernetes=3.0.2
restart airlfow with apache-airflow-providers-cncf-kubernetes=3.1.1
KPO do not re-attach correctly the pod and timeout

?

0 replies

sushi30 · 2022-03-23T16:35:22Z

sushi30
Mar 23, 2022
Author

only clarification is that regarding (3), the scheduler is able to start a worker pod which fails to start the task pod.

…

On Wed, 23 Mar 2022 at 16:06, raphaelauv ***@***.***> wrote: you had this problem just after updating apache-airflow-providers-cncf-kubernetes ? scenario : 1. start a pod with apache-airflow-providers-cncf-kubernetes=3.0.2 2. restart airlfow with apache-airflow-providers-cncf-kubernetes=3.1.1 3. scheduler do not re-attach correctly the pod and timeout ? — Reply to this email directly, view it on GitHub <#22485 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGJAQRITIN5M74U6CMDH3LDVBM6SBANCNFSM5RNS6EWQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

raphaelauv · 2022-03-23T16:42:01Z

raphaelauv
Mar 23, 2022

the scheduler is able to start a worker pod ?

0 replies

sushi30 · 2022-03-23T16:46:23Z

sushi30
Mar 23, 2022
Author

didnt see the edit. the scenario you laid out is correct.

0 replies

raphaelauv · 2022-03-23T16:51:10Z

raphaelauv
Mar 23, 2022

so your issue is that

3.1.1 is not backward compatible with 3.0.2

so the provider should had bump to v4 because it's a breaking change ? (@potiuk )

0 replies

sushi30 · 2022-03-23T16:53:03Z

sushi30
Mar 23, 2022
Author

I dont think is related to backward compatibility. Maybe the correction in (3) is that the KPO is failing to start the pod. Not reattach,.

0 replies

potiuk · 2022-03-23T19:49:57Z

potiuk
Mar 23, 2022
Collaborator

Hmm. My question Is it always the case with 3.1.1 or is it something specific for @sushi30 setup?

0 replies

sushi30 · 2022-03-24T10:29:35Z

sushi30
Mar 24, 2022
Author

I cannot identify anything unique in my setup. The tasks in this setup have been working without fault for the past year or so. This broke with the change to 3.1.1.

0 replies

sushi30 · 2022-03-24T11:15:46Z

sushi30
Mar 24, 2022
Author

I am able to reproduce this with a minimal configuration helm chart on minikube. The latest apache/airflow image is using version 3.0.2 so you need to create a custom image with the new provider version. This DAG works fine with 3.0.2 and hangs in 3.1.1:

from datetime import datetime
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
    KubernetesPodOperator,
)

with DAG(
    dag_id="example", start_date=datetime(2022, 1, 1), schedule_interval="@once"
) as dag:
    k = KubernetesPodOperator(
        namespace="airflow",
        name="hello",
        image="debian",
        cmds=["bash", "-cx"],
        arguments=["echo", "10"],
        labels={"foo": "bar"},
        task_id="dry_run_demo",
    )

# Dockerfile
FROM apache/airflow
RUN pip install apache-airflow-providers-cncf-kubernetes==3.1.1

Screenshots

Task ran on 3.0.2

Then hangs on 3.1.1:

0 replies

raphaelauv · 2022-03-24T11:17:00Z

raphaelauv
Mar 24, 2022

If 3.1.1 is not managing correctly the POD's started in 3.0.2 then can you delete the pod and clear the airflow TASK and the XCOM
so the task ( with 3.1.1) can start a real new POD and not try to re attach a running POD

0 replies

sushi30 · 2022-03-24T11:18:35Z

sushi30
Mar 24, 2022
Author

If 3.1.1 is not managing correctly the POD's started in 3.0.2 then can you delete the pod and clear the airflow TASK and the XCOM so the task ( with 3.1.1) can start a real new POD and not try to re attach a running POD

This has nothing to do with pods started in 3.0.2. The KPO is not able to start new pods in 3.1.1.

0 replies

potiuk · 2022-03-27T16:16:14Z

potiuk
Mar 27, 2022
Collaborator

This has nothing to do with pods started in 3.0.2. The KPO is not able to start new pods in 3.1.1.

@sushi30 - it is very likely something environmental for you. People often argue that "it worked before so it must be backwards compatibiliity problem" where in fact there might be other - environmental factors where misconfiguration or wrong deployment caused things to "work" (or rather mask problem) before, only to be revealed when for example new library provides more thoropugh check. Or maybe a library change causes some more resource usage and you simply need to increase resources (memory/disk/the like). There can be many things that could go wrong. I would not jump into conclusion this is backwards compatibility. It might be, but does not have to - and it is not at all obvious.

That would be rather surprising to have some general problem - we do not see other people reporting problems like that one.

Do you have any logs telling more what's happening? Maybe you can take a look at the logs of K8S creating PODs and maybe they will tell you what's wrong. The informaton that PODs are "hanging" makes it impossible to diagnose what's wrong - without more details we have even less information than you have.

And looking at logs of what's happening when it fails is something that only you can do, I am afraid. Also it would be great to get some more information - which K8S version you have for example.

What I Can you also try the 2.2.5rc1 release of Airflow (We just put it up for voting). The images we have in dockerhub contain both - latest Airflow and latest cncf.kubernetes so if you could try it and see if the problem persists there would be helpfi.

3 replies

sushi30 Mar 27, 2022
Author

environmental factors where misconfiguration or wrong deployment caused things to "work" (or rather mask problem) before, only to be revealed when for example new library provides more thoropugh check

I have deployed a vanilla installation using helm, deployed the KPO and was able to reproduce this issue. If you see anything in my reproduction setup you think is leading to this problem please point it out. This seems like something that should be easy to point with an airflow deployment that has minimal configurations.

That would be rather surprising to have some general problem - we do not see other people reporting problems like that one.
Since 3.1.1 is not bundled into the airflow image yet, this issue will only effect deployments which:

create there own airflow image.
have not pinned the version apache-airflow-providers-cncf-kubernetes.
run Airflow using the KubernetesExecutor.
use KPO.
run the KPO the same namespace as the airflow deployments

I think this resolves to a small subset of the users so I suggest waiting a while to see if this pops up.

Maybe you can take a look at the logs of K8S creating PODs and maybe they will tell you what's wrong

I will be more that happy to provide any more information, to me it looks pretty clear that the lables on the worker pod match the labels that the KPO is selecting and that is why the KPO "selects" the worker pod and waits for it to finish. But this is a deadlock because the worker pod is waiting for the KPO to run its own pod. That was not the case with 3.0.2. I tried looking at the logs of the airflow worker but it really gave no more information.

And looking at logs of what's happening when it fails is something that only you can do, I am afraid. Also it would be great to get some more information - which K8S version you have for example.

This fails in my production deployment on EKS but also on minikube on my local machine so I doubt this is related to the k8s setup.

I will also try this on 2.2.5rc1 if the image is out.

potiuk Mar 27, 2022
Collaborator

I will also try this on 2.2.5rc1 if the image is out.

Please do.

I have deployed a vanilla installation using helm, deployed the KPO and was able to reproduce this issue. If you see anything in my reproduction setup you think is leading to this problem please point it out. This seems like something that should be easy to point with an airflow deployment that has minimal configurations.

if this is the case that it is easily reproducible in the Vanilla Airflow with just KPO - then it is quite a problem. I was under the impression, that this is something specific to your case, but if you will tell me that this is easily reproducible with the vanilla "2.2.5.rc1" then for me this could be a blocker for 2.2.5rc1 @ephraimbuddy @dstandish - might be worth quick checking if it can be easily reproduced.

potiuk Mar 27, 2022
Collaborator

@sushi30 -> if you could confirm with 2.2.5rc1 and re-state here what are the min-reproduction steps that would be great, as it would give us an easy way to verify it (seems it should be easy for you).

BTW. @sushi30 The "conversion to discussion" does not mean we ignore it, it's quite the opposite - it means that we need more discussion and information is needed to be able to carry it and get to conclusion whether this is a bug or not. From the discussion above I thought this was somehow modified and customized installation of yours. We simply need to get some certainty that it is the case.

potiuk · 2022-03-27T16:16:33Z

potiuk
Mar 27, 2022
Collaborator

Converting it into a discussion until we have more information.

0 replies

potiuk · 2022-03-27T16:21:10Z

potiuk
Mar 27, 2022
Collaborator

BTW. Yes the label has been changed to "run_id" but I do not thinkj it has anything to do with ti (however maybe @dstandish could tell)

0 replies

potiuk · 2022-03-27T16:22:16Z

potiuk
Mar 27, 2022
Collaborator

Another question @sushi30 - does your deployment rely in any way on the "execution_date" label?

0 replies

raphaelauv · 2022-03-27T18:17:48Z

raphaelauv
Mar 27, 2022

@sushi30 run the KPO the same namespace as the airflow deployments

could you try to run the pod in another namespace to see if it work ? (to isolate the problem)

3 replies

potiuk Mar 29, 2022
Collaborator

We have just yanked the 3.0.1+ providers because of this and similar incmpatibilities. Please downgrate to 3.0.0.

Note! Because of incompatibilities with kubernetes library versions, the next version of the provider will only be compatible with 2.3.0 + version of airflow. So if you would like to upgrade to more recent cncf.kubernetes, you will have to update to airflow 2.3.0 once it gets released.

potiuk Mar 29, 2022
Collaborator

Announcement on devlist and slack will follow shortly (After we merge fixes for the upcoming releases).

potiuk Mar 29, 2022
Collaborator

More details here: #22573

sushi30 · 2022-03-29T14:56:25Z

sushi30
Mar 29, 2022
Author

I can confirm this also occurs with apache/airflow:2.2.5rc2.

When trying this on another namespace the pod fails. This is expected because out-of-box helm chart does not support starting pods in different namespaces. This requires enabling multiNamespaceMode.

7 replies

potiuk Mar 29, 2022
Collaborator

BTW. Did you install RUN pip install apache-airflow-providers-cncf-kubernetes==3.1.1 ? We have just yanked latest cncf versions as they indeed contained unwanted dependency upgrade. So yeah. 3.1.1 is bad (as well as anything above 3.0.0 and you should not use it (more info in #22573)

The next upgrade to cncf.kubernetes provider will be 2.3.0+ only because of that.

sushi30 Mar 29, 2022
Author

Can you provide super minimal reproduction steps on a vanilla image please?

yep..

# Dockerfile
FROM apache/airflow:2.2.5rc2
RUN pip install apache-airflow-providers-cncf-kubernetes==3.1.1

potiuk Mar 29, 2022
Collaborator

So that's it. Don't upgrade. It's BAD APPLE.

potiuk Mar 29, 2022
Collaborator

The #22573 has an explanation and it's already yanked in PyPI

potiuk Mar 29, 2022
Collaborator

3.0.0 is the latest that should be installed (See https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/)

It was 3.0.2 in 2.2.4 image, and it was fine there because kubernetes library was limited anyway and it did not have any real fixes except dependencies.

akrava · 2022-09-02T10:24:22Z

akrava
Sep 2, 2022

I've faced with the same issue on Airflow 2.3.4. I've deployed Airflow on Kubernetes and used Kubernetes executor. Any dag with KubernetesPodOperator launched task pod and couldn't create a desired pod because it matched the task pod, so it was deadlocked.

I've found this PR #23371, which fixes this issue - but unfortunately it is not included in the latest Airflow release at this moment (2.3.4). So I've cherry-picked this commit 8e3abe4 on top of the v2-3-stable branch and built prod docker image (with option install providers from sources). I've tested such approach and it works fine.

1 reply

potiuk Sep 2, 2022
Collaborator

Thanks for pointing it out @akrava . I've marked it to add to 2.3.5 (in case we release 2.3.5 before 2.4.0).

KubernetesPodOperator matches the task pod and goes into deadlock #22555

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 22 comments · 14 replies

sushi30 Mar 23, 2022 Author

sushi30 Mar 23, 2022 Author

sushi30 Mar 23, 2022 Author

sushi30 Mar 23, 2022 Author

sushi30 Mar 23, 2022 Author

potiuk Mar 23, 2022 Collaborator

sushi30 Mar 24, 2022 Author

sushi30 Mar 24, 2022 Author

Screenshots

sushi30 Mar 24, 2022 Author

potiuk Mar 27, 2022 Collaborator

sushi30 Mar 27, 2022 Author

potiuk Mar 27, 2022 Collaborator

potiuk Mar 27, 2022 Collaborator

potiuk Mar 27, 2022 Collaborator

potiuk Mar 27, 2022 Collaborator

potiuk Mar 27, 2022 Collaborator

potiuk Mar 29, 2022 Collaborator

potiuk Mar 29, 2022 Collaborator

potiuk Mar 29, 2022 Collaborator

sushi30 Mar 29, 2022 Author

potiuk Mar 29, 2022 Collaborator

sushi30 Mar 29, 2022 Author

potiuk Mar 29, 2022 Collaborator

potiuk Mar 29, 2022 Collaborator

potiuk Mar 29, 2022 Collaborator

potiuk Sep 2, 2022 Collaborator

Replies: 22 comments 14 replies

sushi30
Mar 23, 2022
Author

sushi30
Mar 23, 2022
Author

sushi30
Mar 23, 2022
Author

sushi30
Mar 23, 2022
Author

sushi30
Mar 23, 2022
Author

potiuk
Mar 23, 2022
Collaborator

sushi30
Mar 24, 2022
Author

sushi30
Mar 24, 2022
Author

sushi30
Mar 24, 2022
Author

potiuk
Mar 27, 2022
Collaborator

sushi30 Mar 27, 2022
Author

potiuk Mar 27, 2022
Collaborator

potiuk Mar 27, 2022
Collaborator

potiuk
Mar 27, 2022
Collaborator

potiuk
Mar 27, 2022
Collaborator

potiuk
Mar 27, 2022
Collaborator

potiuk Mar 29, 2022
Collaborator

potiuk Mar 29, 2022
Collaborator

potiuk Mar 29, 2022
Collaborator

sushi30
Mar 29, 2022
Author

potiuk Mar 29, 2022
Collaborator

sushi30 Mar 29, 2022
Author

potiuk Mar 29, 2022
Collaborator

potiuk Mar 29, 2022
Collaborator

potiuk Mar 29, 2022
Collaborator

potiuk Sep 2, 2022
Collaborator