-
Apache Airflow Provider(s)cncf-kubernetes Versions of Apache Airflow Providersapache-airflow-providers-cncf-kubernetes==3.1.1 Apache Airflow version2.2.2 Operating SystemCentos 7 DeploymentOfficial Apache Airflow Helm Chart Deployment detailsAirflow deployed on kubernetes What happenedThese tasks have been running for several months. When trying to upgrade to the latest Kubernetes provider, the KubernetesPodOperator the worker pod so it does not start a new pod. This just hangs until the task times out. [2022-03-23 09:49:49,950] {kubernetes_pod.py:525} INFO - Creating pod engines-distant-sharer.5bfdc82ad54d4dc1b58367f3d6d6a94f with labels: {'dag_id': 'tag_engine_user', 'task_id': 'engines_distant_sharer', 'run_id': 'manual__2022-03-23T094931.4187280000-fe99c2456', 'try_number': '1'}
[2022-03-23 09:49:49,972] {kubernetes_pod.py:336} INFO - Found matching pod tagengineuserenginesdistantsharer.1e4548b0558448f9a7aaa69da5f1e69d with labels {'airflow-worker': '2258913', 'airflow_version': '2.2.2', 'component': 'worker', 'dag_id': 'tag_engine_user', 'kubernetes_executor': 'True', 'release': 'airflow', 'run_id': 'manual__2022-03-23T094931.4187280000-fe99c2456', 'task_id': 'engines_distant_sharer', 'tier': 'airflow', 'try_number': '1'} What you think should happen insteadThe task should run succesfully How to reproduceRun a kubernetes operator with the pod namespace the same as the airfloe deployment namespace. Anything elseI see that the label "execution_date" has been changed to "run_id". This is most likely the cause. {'dag_id': 'tag_engine', 'task_id': 'engines_stationary_sharer', 'execution_date': '2022-03-23T0730000000-b016b00b9', 'try_number': '1'} {'dag_id': 'tag_engine', 'task_id': 'engines_stationary_sharer', 'run_id': 'scheduled__2022-03-23T0930000000-fdfd231b4', 'try_number': '1'} Are you willing to submit PR?
Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 22 comments 14 replies
-
I can't reproduce with
the operator can re-attach the running pod if I restart the sceduler
could you give us more context on the KPO ( what kind of K8S it use, in_cluster ? ... ) and the FULL airflow logs of the error |
Beta Was this translation helpful? Give feedback.
-
No special annotations. Using the default template. |
Beta Was this translation helpful? Give feedback.
-
it look like your have a 1 hour timeout, can you check |
Beta Was this translation helpful? Give feedback.
-
I do have a one-hour timeout for the task. This task takes 30s~1:30s when I downgrade to It is supposed to create a new pod called |
Beta Was this translation helpful? Give feedback.
-
you had this problem just after updating scenario :
? |
Beta Was this translation helpful? Give feedback.
-
only clarification is that regarding (3), the scheduler is able to start a
worker pod which fails to start the task pod.
…On Wed, 23 Mar 2022 at 16:06, raphaelauv ***@***.***> wrote:
you had this problem just after updating
apache-airflow-providers-cncf-kubernetes ?
scenario :
1.
start a pod with apache-airflow-providers-cncf-kubernetes=3.0.2
2.
restart airlfow with apache-airflow-providers-cncf-kubernetes=3.1.1
3.
scheduler do not re-attach correctly the pod and timeout
?
—
Reply to this email directly, view it on GitHub
<#22485 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGJAQRITIN5M74U6CMDH3LDVBM6SBANCNFSM5RNS6EWQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
didnt see the edit. the scenario you laid out is correct. |
Beta Was this translation helpful? Give feedback.
-
so your issue is that
so the provider should had bump to v4 because it's a breaking change ? (@potiuk ) |
Beta Was this translation helpful? Give feedback.
-
I dont think is related to backward compatibility. Maybe the correction in (3) is that the KPO is failing to start the pod. Not reattach,. |
Beta Was this translation helpful? Give feedback.
-
Hmm. My question Is it always the case with 3.1.1 or is it something specific for @sushi30 setup? |
Beta Was this translation helpful? Give feedback.
-
I cannot identify anything unique in my setup. The tasks in this setup have been working without fault for the past year or so. This broke with the change to 3.1.1. |
Beta Was this translation helpful? Give feedback.
-
I am able to reproduce this with a minimal configuration helm chart on minikube. The latest apache/airflow image is using version 3.0.2 so you need to create a custom image with the new provider version. This DAG works fine with 3.0.2 and hangs in 3.1.1: from datetime import datetime
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
KubernetesPodOperator,
)
with DAG(
dag_id="example", start_date=datetime(2022, 1, 1), schedule_interval="@once"
) as dag:
k = KubernetesPodOperator(
namespace="airflow",
name="hello",
image="debian",
cmds=["bash", "-cx"],
arguments=["echo", "10"],
labels={"foo": "bar"},
task_id="dry_run_demo",
) # Dockerfile
FROM apache/airflow
RUN pip install apache-airflow-providers-cncf-kubernetes==3.1.1 Screenshots |
Beta Was this translation helpful? Give feedback.
-
If |
Beta Was this translation helpful? Give feedback.
-
This has nothing to do with pods started in |
Beta Was this translation helpful? Give feedback.
-
@sushi30 - it is very likely something environmental for you. People often argue that "it worked before so it must be backwards compatibiliity problem" where in fact there might be other - environmental factors where misconfiguration or wrong deployment caused things to "work" (or rather mask problem) before, only to be revealed when for example new library provides more thoropugh check. Or maybe a library change causes some more resource usage and you simply need to increase resources (memory/disk/the like). There can be many things that could go wrong. I would not jump into conclusion this is backwards compatibility. It might be, but does not have to - and it is not at all obvious. That would be rather surprising to have some general problem - we do not see other people reporting problems like that one. Do you have any logs telling more what's happening? Maybe you can take a look at the logs of K8S creating PODs and maybe they will tell you what's wrong. The informaton that PODs are "hanging" makes it impossible to diagnose what's wrong - without more details we have even less information than you have. And looking at logs of what's happening when it fails is something that only you can do, I am afraid. Also it would be great to get some more information - which K8S version you have for example. What I Can you also try the 2.2.5rc1 release of Airflow (We just put it up for voting). The images we have in dockerhub contain both - latest Airflow and latest cncf.kubernetes so if you could try it and see if the problem persists there would be helpfi. |
Beta Was this translation helpful? Give feedback.
-
Converting it into a discussion until we have more information. |
Beta Was this translation helpful? Give feedback.
-
BTW. Yes the label has been changed to "run_id" but I do not thinkj it has anything to do with ti (however maybe @dstandish could tell) |
Beta Was this translation helpful? Give feedback.
-
Another question @sushi30 - does your deployment rely in any way on the "execution_date" label? |
Beta Was this translation helpful? Give feedback.
-
@sushi30 could you try to run the pod in another namespace to see if it work ? (to isolate the problem) |
Beta Was this translation helpful? Give feedback.
-
I can confirm this also occurs with When trying this on another namespace the pod fails. This is expected because out-of-box helm chart does not support starting pods in different namespaces. This requires enabling |
Beta Was this translation helpful? Give feedback.
-
I've faced with the same issue on Airflow 2.3.4. I've deployed Airflow on Kubernetes and used Kubernetes executor. Any dag with KubernetesPodOperator launched task pod and couldn't create a desired pod because it matched the task pod, so it was deadlocked. I've found this PR #23371, which fixes this issue - but unfortunately it is not included in the latest Airflow release at this moment (2.3.4). So I've cherry-picked this commit 8e3abe4 on top of the v2-3-stable branch and built prod docker image (with option install providers from sources). I've tested such approach and it works fine. |
Beta Was this translation helpful? Give feedback.
I can confirm this also occurs with
apache/airflow:2.2.5rc2
.When trying this on another namespace the pod fails. This is expected because out-of-box helm chart does not support starting pods in different namespaces. This requires enabling
multiNamespaceMode
.