-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Pacemaker awsvip failing with different errors #1876
Comments
Try running |
Thank you. The debug command completed without any errors. is there anything else to check? |
You can run The trace files will be available in /var/lib//heartbeat/trace_ra/. |
Thank you. I will enable the trace. |
Hello Good people, This thread is a bit old, but I have the same error: Oct 29 08:30:22 [1749] auto-2.dhsscegypt.local lrmd: notice: operation_finished: awsvip_start_0:6361:stderr [ ocf-exit-reason:instance_id not found. Is this a EC2 instance? ] Did anyone found a solution for this? |
It sounds like the AWS metadata service isnt replying to the requests. You can try running Does it happen every time, or just at random? |
No it's just random, and there are no indicators of ay errors from any kind before that. |
I did use the link, and it returns the instance ID successfully, however, sometimes it doesn't which creates this error, any reason you know that might do that? |
Ah. If it's random it's probably the requests getting throttled. You should check if there's a resource-agents update for your distro, as there has been added retry-functionality to avoid it: If the latest version for your distro still has issues you should report the bug to them and provide the link to the fix, so they can fix it. |
Thanks man. So to make sure, that's a resource-agent error not from AWS side? |
The issue can be due to some hiccup in the AWS metadata service, network, or simply throttling requests if they receive too many over a short period. The fix makes the agent retry a specific amount of times before failing, and allows the user to set other amount of retries, and sleep between retries to make it work well with their setup. |
Thanks man. Great help. |
Hi All,
We are running a two node pacemaker cluster in AWS and we use "awsvip" resource type to configure the vip IP. Below is the conf
pcs resource show privip_node1
Resource: privip_node1 (class=ocf provider=heartbeat type=awsvip)
Attributes: secondary_private_ip=10.x.x.x
Operations: migrate_from interval=0s timeout=30s (privip_node1-migrate_from-interval-0s)
migrate_to interval=0s timeout=30s (privip_node1-migrate_to-interval-0s)
monitor interval=20s timeout=30s (privip_node1-monitor-interval-20s)
start interval=0s timeout=30s (privip_node1-start-interval-0s)
stop interval=0s timeout=30s (privip_node1-stop-interval-0s)
validate interval=0s timeout=10s (privip_node1-validate-interval-0s)
pcs resource show node1_vip
Resource: node1_vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.x.x.x
Operations: monitor interval=10s timeout=20s (node1_vip-monitor-interval-10s)
start interval=0s timeout=20s (node1_vip-start-interval-0s)
stop interval=0s timeout=20s (node1_vip-stop-interval-0s)
The EC2 instance is configured to use IMDSV2.The fence_aws agent and resource-agent have also been upgraded to the most recent versions, which support imdsv2. Additionally, the resource is set up to use the IAM Profile credentials.
fence-agents-aws-4.2.1-41.el7_9.3.x86_64
python-s3transfer-0.1.13-1.0.1.el7.noarch
resource-agents-4.1.1-61.el7_9.15.x86_64
pip list | grep -i boto
boto3 (1.10.0)
botocore (1.13.50)
aws --version
aws-cli/2.9.4 Python/3.9.11 Linux/3.10.0-1160.80.1.0.1.el7.x86_64 exe/x86_64.oracle.7 prompt/off
pip3 list | grep -i boto
boto3 1.23.10
botocore 1.26.10
The privip resource consistently fails with the different errors:
pengine: warning: unpack_rsc_op_failure: Processing failed monitor of privip_node2 on node2: unknown error | rc=1
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000:109357 - timed out after 30000ms
Jun 16 10:01:43 node2 lrmd[36967]: notice: privip_node2_monitor_20000:13042:stderr [ Unable to locate credentials. You can configure credentials by running "aws configure". ]
Jun 16 10:01:43 node2 crmd[36970]: notice: privip_node2_monitor_20000:91 [ % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 359 100 359 0 0 37513 0 --:--:-- --:--:-- --:--:-- 39888\n\nUnable to locate credentials. You can configure credentials by running "aws configure".\n ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ An error occurred (MissingParameter) when calling the DescribeInstances operation: The request must contain the parameter InstanceId ]
Failed Resource Actions:
last-rc-change='Fri May 26 07:27:46 2023', queued=0ms, exec=6597ms
Any advice would be great.
The text was updated successfully, but these errors were encountered: