Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(AL2023): Pre-nodeadm script doesn't run and post-nodeadm prevents nodes from joining #2123

Open
darox opened this issue Jan 24, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@darox
Copy link

darox commented Jan 24, 2025

What happened:

I have to run the following script at boot to configure the interfaces for XDP, I don't matter if it's pre or post nodeadm.

With post-noeadm

cat /var/lib/cloud/instance/scripts/part-003 
#!/usr/bin/env bash
ip link set dev ens5 mtu 3498
ethtool -L ens5 combined 2

In this case nodes don't join the cluster.

The status of Kubelet:

[root@ip-10-1-0-222 ec2-user]# service kubelet status
Redirecting to /bin/systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; preset: disabled)
     Active: activating (auto-restart) (Result: resources) since Fri 2025-01-24 07:53:17 UTC; 3s ago
       Docs: https://github.com/kubernetes/kubernetes
        CPU: 0

Kubelet service logs:

Jan 24 08:06:51 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: kubelet.service: Failed with result 'resources'.
Jan 24 08:06:51 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 442.
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=kubelet comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=kubelet comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: Stopped kubelet.service - Kubernetes Kubelet.
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: kubelet.service: Failed to load environment files: No such file or directory
Jan 24 08:06:56 ip-10-1-0-222.eu-central-1.compute.internal systemd[1]: kubelet.service: Failed to run 'start-pre' task: No such file or directory

The user data is as follows:

cat /var/lib/cloud/instance/user-data.txt
ontent-Type: multipart/mixed; boundary="MIMEBOUNDARY"
MIME-Version: 1.0

--MIMEBOUNDARY
Content-Transfer-Encoding: 7bit
Content-Type: application/node.eks.aws
Mime-Version: 1.0

---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
  cluster:
    name: <redacted>
    apiServerEndpoint: <redacted>
    certificateAuthority: <redacted>
    cidr: 172.20.0.0/16

--MIMEBOUNDARY
Content-Transfer-Encoding: 7bit
Content-Type: text/x-shellscript
Mime-Version: 1.0

#!/usr/bin/env bash

ip link set dev ens5 mtu 3498
ethtool -L ens5 combined 2
--MIMEBOUNDARY--

The script ran, because we can see the changed MTU:

2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 3498 qdisc mq state UP group default qlen 1000

With pre-noeadm

sudo cat /var/lib/cloud/instance/scripts/part-003 
#!/usr/bin/env bash

ip link set dev ens5 mtu 3498
ethtool -L ens5 combined 2

In this case nodes join the cluster, but the interface MTU is still the same:

2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000

What you expected to happen:

I expect that post or pre scripts run successfully and the nodes can join the cluster.

How to reproduce it (as minimally and precisely as possible):

Define some pre and post nodeadm scripts and check if scripts ran and if nodes joined the cluster.

Environment: EKS

  • AWS Region: eu-central-1
  • Instance Type(s): m5n.xlarge
  • Cluster Kubernetes version: 1.30
  • Node Kubernetes version: 1.30
  • AMI Version: amazon-eks-node-al2023-x86_64-standard-1.30-v20250116
@darox darox added the bug Something isn't working label Jan 24, 2025
@ndbaker1
Copy link
Member

Hi @darox, sorry not quite groking the pre/post script setup. If you're relying on cloud-init to execute user data scripts then everything will run before nodeadm is completed. The bootstrap is split into 2 parts so it looks something like nodeadm-config > cloud-init > nodeadm-run.

Have you checked the logs for those services via journalctl -u nodeadm-run

@darox
Copy link
Author

darox commented Jan 27, 2025

@ndbaker1 thank you for having a look.

Yes, the output for journalctl -u nodeadm-run is:

[root@ip-10-1-1-47 ec2-user]# journalctl -u nodeadm-run
Jan 27 07:50:55 localhost systemd[1]: Dependency failed for nodeadm-run.service - EKS Nodeadm Run.
Jan 27 07:50:55 localhost systemd[1]: nodeadm-run.service: Job nodeadm-run.service/start failed with result 'dependency'.

In this case I specified a cloudinit_post_nodeadm of:

#!/usr/bin/env bash

ip link set dev ens5 mtu 3498
ethtool -L ens5 combined 2

@ndbaker1
Copy link
Member

ndbaker1 commented Jan 28, 2025

that service not completing would explain the nodes aren't joining to the cluster, so based on the dependency failure you should also pull up journalctl -u nodeadm-config to see earlier issue in the process. I can't yet gauge if this is some user-data formatting issue, but in the case where the MTU is unchanged it sounds like the user-data document isn't being recognized by ec2 (by cloud-init) and the script isn't being executed 🤔

@MageshSrinivasulu
Copy link

MageshSrinivasulu commented Jan 28, 2025

Facing similar issues but my approach is different #2128

@MageshSrinivasulu
Copy link

@darox Where you able to fix the issue?

@cartermckinnon
Copy link
Member

@darox can you check journalctl -u nodeadm-config?

@maiconrocha
Copy link

maiconrocha commented Feb 2, 2025

Facing the same issue after building CIS hardened AMIs.
Nodes are not joining the cluster.
Script used to build the AMI:

make k8s=1.31 os_distro=al2023 aws_region=$AWS_REGION source_ami_id=$AMI_ID source_ami_owners=XXXXXXXX source_ami_filter_name="CIS Amazon Linux 2023 Benchmark - Level 2*" subnet_id=$SUBNET_ID associate_public_ip_address=true remote_folder=/home/ec2-user ami_name=$ami_name pause_container_image=602401143452.dkr.ecr.$AWS_REGION.amazonaws.com/eks/pause:3.10 iam_instance_profile=packer-role-CIS_AL2023

as requested @cartermckinnon, this is the error

# journalctl -u nodeadm-config
Feb 02 03:47:49 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Feb 02 03:47:49 localhost (nodeadm)[1748]: nodeadm-config.service: Failed to locate executable /usr/bin/nodeadm: Permission denied
Feb 02 03:47:49 localhost (nodeadm)[1748]: nodeadm-config.service: Failed at step EXEC spawning /usr/bin/nodeadm: Permission denied
Feb 02 03:47:49 localhost systemd[1]: nodeadm-config.service: Main process exited, code=exited, status=203/EXEC
Feb 02 03:47:49 localhost systemd[1]: nodeadm-config.service: Failed with result 'exit-code'.
Feb 02 03:47:49 localhost systemd[1]: Failed to start nodeadm-config.service - EKS Nodeadm Config.

Checking permissions for /usr/bin/nodeadm

# ls -ltrh /usr/bin/nodeadm
-rwxr-xr-x. 1 root root 66M Feb  2 03:18 /usr/bin/nodeadm

Checking if the config is valid:

# /usr/bin/nodeadm config check
{"level":"info","ts":1738470807.365282,"caller":"config/check.go:27","msg":"Checking configuration","source":"imds://user-data"}
{"level":"info","ts":1738470807.3677034,"caller":"config/check.go:36","msg":"Configuration is valid"}

Trying to start nodeadm

# /usr/bin/nodeadm init
{"level":"info","ts":1738470419.8272219,"caller":"init/init.go:51","msg":"Checking user is root.."}
{"level":"info","ts":1738470419.8280857,"caller":"init/init.go:59","msg":"Loading configuration..","configSource":"imds://user-data"}

....

{"level":"info","ts":1738470419.838661,"caller":"init/init.go:70","msg":"Enriching configuration.."}
{"level":"info","ts":1738470419.838675,"caller":"init/init.go:152","msg":"Fetching instance details.."}
SDK 2025/02/02 04:26:59 DEBUG attempting waiter request, attempt count: 1
...
{"level":"info","ts":1738470420.0136263,"caller":"init/init.go:170","msg":"Fetching default options..."}
{"level":"info","ts":1738470420.0136392,"caller":"init/init.go:174","msg":"Default options populated","defaults":{"sandboxImage":"localhost/kubernetes/pause"}}
{"level":"info","ts":1738470420.0136523,"caller":"init/init.go:75","msg":"Validating configuration.."}
{"level":"info","ts":1738470420.0136793,"caller":"init/init.go:80","msg":"Creating daemon manager.."}
{"level":"info","ts":1738470420.0166996,"caller":"init/init.go:98","msg":"Configuring daemons..."}
{"level":"info","ts":1738470420.0167263,"caller":"init/init.go:105","msg":"Configuring daemon...","name":"containerd"}
{"level":"info","ts":1738470420.0167384,"caller":"containerd/base_runtime_spec.go:20","msg":"Writing containerd base runtime spec...","path":"/etc/containerd/base-runtime-spec.json"}
{"level":"info","ts":1738470420.0174632,"caller":"containerd/runtime_config.go:51","msg":"No instance specific containerd runtime configuration needed..","instanceType":"m6i.large"}
{"level":"info","ts":1738470420.0174797,"caller":"containerd/runtime_config.go:63","msg":"Configuring default runtime.."}
{"level":"info","ts":1738470420.0174983,"caller":"containerd/config.go:57","msg":"Writing containerd config to file..","path":"/etc/containerd/config.toml"}
{"level":"info","ts":1738470420.0175738,"caller":"init/init.go:109","msg":"Configured daemon","name":"containerd"}
{"level":"info","ts":1738470420.0175889,"caller":"init/init.go:105","msg":"Configuring daemon...","name":"kubelet"}
{"level":"info","ts":1738470420.0183494,"caller":"kubelet/config.go:299","msg":"Detected kubelet version","version":"v1.31.4"}
{"level":"info","ts":1738470420.0192964,"caller":"kubelet/config.go:210","msg":"Setup IP for node","ip":"XXXXXXX"}
{"level":"info","ts":1738470420.0197873,"caller":"kubelet/config.go:371","msg":"Writing kubelet config to file..","path":"/etc/kubernetes/kubelet/config.json"}
{"level":"info","ts":1738470420.0201876,"caller":"init/init.go:109","msg":"Configured daemon","name":"kubelet"}
{"level":"info","ts":1738470420.0202048,"caller":"init/init.go:114","msg":"Setting up system aspects..."}
{"level":"info","ts":1738470420.0202284,"caller":"init/init.go:117","msg":"Setting up system aspect..","name":"local-disk"}
{"level":"info","ts":1738470420.0202591,"caller":"system/local_disk.go:26","msg":"Not configuring local disks!"}
{"level":"info","ts":1738470420.0202699,"caller":"init/init.go:121","msg":"Set up system aspect","name":"local-disk"}
{"level":"info","ts":1738470420.0202813,"caller":"init/init.go:117","msg":"Setting up system aspect..","name":"networking"}
{"level":"info","ts":1738470420.0203004,"caller":"system/networking.go:79","msg":"writing eks_primary_eni_only network configuration"}
{"level":"info","ts":1738470420.0266056,"caller":"init/init.go:121","msg":"Set up system aspect","name":"networking"}
{"level":"info","ts":1738470420.0266244,"caller":"init/init.go:130","msg":"Ensuring daemon is running..","name":"containerd"}
{"level":"info","ts":1738470420.028408,"caller":"init/init.go:134","msg":"Daemon is running","name":"containerd"}
{"level":"info","ts":1738470420.0284274,"caller":"init/init.go:136","msg":"Running post-launch tasks..","name":"containerd"}
{"level":"info","ts":1738470420.0284398,"caller":"init/init.go:140","msg":"Finished post-launch tasks","name":"containerd"}
{"level":"info","ts":1738470420.0284522,"caller":"init/init.go:130","msg":"Ensuring daemon is running..","name":"kubelet"}
--->STUCK HERE

Checking kubelet logs:

journalctl -u kubelet.service
Feb 02 04:20:21 systemd[1]: kubelet.service: Failed to load environment files: No such file or directory
Feb 02 04:20:21 : kubelet.service: Failed to run 'start-pre' task: No such file or directory
Feb 02 04:20:21 : kubelet.service: Failed with result 'resources'.
Feb 02 04:20:21 : Failed to start kubelet.service - Kubernetes Kubelet.

It seems containerd is having issues to start:

#nerdctl --namespace k8s.io ps -a
FATA[0000] cannot access containerd socket "/run/containerd/containerd.sock": no such file or directory

#sudo nerdctl images -a
FATA[0000] cannot access containerd socket "/run/containerd/containerd.sock": no such file or directory
[root@ip-10-0-38-115 ~]#
[root@ip-10-0-38-115 ~]#
[root@ip-10-0-38-115 ~]#
[root@ip-10-0-38-115 ~]# journalctl -u containerd
-- No entries --
[root@ip-10-0-38-115 ~]# journalctl -u containerd.service
-- No entries --
[root@ip-10-0-38-115 ~]# systemctl status containerd
○ containerd.service - containerd container runtime
     Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled)
    Drop-In: /etc/systemd/system/containerd.service.d
             └─00-runtime-slice.conf
     Active: inactive (dead)
       Docs: https://containerd.io

@maiconrocha
Copy link

Sorry my issue is related to SELINUX=enforcing from the base AMI CIS Amazon Linux 2023 Benchmark - Level 2.
Setting SELINUX=permissive seems to fix the issue.
Sorry for jumping on this thread, I thought the issues were related.

@bluerockjp
Copy link

I am running into a similar issue. Running journalctl -u nodeadm-config.service:
Feb 05 01:55:46 ip-172-31-28-111.ec2.internal systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config... Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.576076,"caller":"init/init.go:52","msg":"Checking user is root.."} Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.577003,"caller":"init/init.go:60","msg":"Loading configuration..","configSource":"imds://user-data"} Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.6408865,"caller":"init/init.go:69","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{},"containerd":{},"instance":{"localStorage":{}},"kubelet":{}},"status":{"instance":{},"default":{}}}} Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.6412463,"caller":"init/init.go:71","msg":"Enriching configuration.."} Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.6412528,"caller":"init/init.go:153","msg":"Fetching instance details.."} Feb 05 01:55:52 ip-172-31-28-111.ec2.internal nodeadm[7702]: SDK 2025/02/05 01:55:52 DEBUG attempting waiter request, attempt count: 1 Feb 05 01:55:53 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720553.2977772,"caller":"init/init.go:170","msg":"Instance details populated","details":{"id":"i-0e63fcecdbcc1cb7d","region":"us-east-1","type":"c5.metal","availabilityZone":"us-east-1b","mac":"0e:76:66:f9:2c:ff","privateDnsName":"ip-172-31-34-252.ec2.internal"}} Feb 05 01:55:53 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720553.297869,"caller":"init/init.go:171","msg":"Fetching default options..."} Feb 05 01:55:53 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720553.2988718,"caller":"init/init.go:179","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"}} Feb 05 01:55:53 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720553.298916,"caller":"init/init.go:76","msg":"Validating configuration.."} Feb 05 01:55:53 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"fatal","ts":1738720553.2989328,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"Name is missing in cluster configuration","stacktrace":"main.main\n\t/workdir/cmd/nodeadm/main.go:36\nruntime.main\n\t/root/sdk/go1.23.2/src/runtime/proc.go:272"} Feb 05 01:55:53 ip-172-31-28-111.ec2.internal systemd[1]: nodeadm-config.service: Main process exited, code=exited, status=1/FAILURE Feb 05 01:55:53 ip-172-31-28-111.ec2.internal systemd[1]: nodeadm-config.service: Failed with result 'exit-code'. Feb 05 01:55:53 ip-172-31-28-111.ec2.internal systemd[1]: Failed to start nodeadm-config.service - EKS Nodeadm Config.

The relevant line would appear to be:
Feb 05 01:55:51 ip-172-31-28-111.ec2.internal nodeadm[7702]: {"level":"info","ts":1738720551.6408865,"caller":"init/init.go:69","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{},"containerd":{},"instance":{"localStorage":{}},"kubelet":{}},"status":{"instance":{},"default":{}}}}

Which leads to:
{"level":"fatal","ts":1738720553.2989328,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"Name is missing in cluster configuration","stacktrace":"main.main\n\t/workdir/cmd/nodeadm/main.go:36\nruntime.main\n\t/root/sdk/go1.23.2/src/runtime/proc.go:272"}

However, nodeadm config check shows that the config is valid:
{"level":"info","ts":1738721771.338757,"caller":"config/check.go:27","msg":"Checking configuration","source":"imds://user-data"} {"level":"info","ts":1738721771.3415587,"caller":"config/check.go:36","msg":"Configuration is valid"}

And grabbing the user data with curl produces the expected output:

`MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: application/node.eks.aws

--
APIVersion: node.eks.aws/v1alpha1
Kind: NodeConfig
Spec:
Cluster:
Name: demo-24-52-0_cluster
APIServerEndpoint: 'https://4FA4874DFF6B3ADE99B5541E7ADC2D95.gr7.us-east-1.eks.amazonaws.com'
CIDR: 10.100.0.0/16
`
(truncated for brevity)

If I dump the user data into a local YAML file and run nodeadm config check -c file://./user-data.yaml, it also shows that the config is valid. When I run nodeadm init -c file://./user-data.yaml it fails with the same errors as when the user data is pulled from IMDS.

@ndbaker1
Copy link
Member

ndbaker1 commented Feb 5, 2025

hi @bluerockjp, the shape of your NodeConfig is what's causing some issues. the YAML formatting needs to be fixed, and I've scrounged up an example based on the user data you provided:

like you mentioned, this wont work:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: application/node.eks.aws

---
APIVersion: node.eks.aws/v1alpha1
Kind: NodeConfig
Spec:
Cluster:
Name: foo

--//--
{"level":"fatal","ts":1738746121.5297415,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"Name is missing in cluster configuration","stacktrace":"main.main\n\t/home/nbakerd/workspace/ami/nodeadm/cmd/nodeadm/main.go:36\nruntime.main\n\t/home/nbakerd/go/pkg/mod/golang.org/[email protected]/src/runtime/proc.go:272"}

however if you fix-up the casing and the indentation change you'll have a valid name (and will move on to validating the rest of the fields, which is now the api server endpoint).

-Spec:
-Cluster:
-Name: demo-24-52-0_cluster
+spec:
+  cluster:
+    name: foo
{"level":"fatal","ts":1738746125.708913,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"Apiserver endpoint is missing in cluster configuration","stacktrace":"main.main\n\t/home/nbakerd/workspace/ami/nodeadm/cmd/nodeadm/main.go:36\nruntime.main\n\t/home/nbakerd/go/pkg/mod/golang.org/[email protected]/src/runtime/proc.go:272"}

Take a look out our docs for more examples of valid NodeConfigs: https://awslabs.github.io/amazon-eks-ami/nodeadm/

It's true that the config check behavior is poor/incorrect here though, i've put up #2138 to address that.

@bluerockjp
Copy link

@ndbaker1 Indentation must have gotten messed up during copy/paste but casing was definitely wrong. Changed it per your guidance and the issue is resolved. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants