-
Notifications
You must be signed in to change notification settings - Fork 145
updated mount_fsx.sh to configure FSxL EFA client #849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
case "$instance_type" in | ||
p6-b200.*|p6e-gb200.*|p4d.*|p5.*|p5e.*|p5en.*|trn1.*|trn1n.*|c5n.*|c6gn.*|c6in.*|c7gn.*|g6.*|g6e.*|m5dn.*|m5n.*|m5zn.*|m6a.*|m6i.*|m6in.*|r5dn.*|r5n.*|r6in.*|x2gd.*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all p5.*
, g*
instances support EFA (ex: p5.4xl, g6.xl)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, oversight on my side, thank you. Will update, and in fact will update the function to check for efa device and drivers on the instance instead. Will address the other concerns from Matt as well.
return 0 | ||
fi | ||
|
||
# EFA-supported instance types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHile I agree with this methodolog for fetching instance type, just because customer is using EFA supported instance, doesn't mean that customer will automatically have deployed EFA enabled FSxL FileSystem. Both need to be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are checking the instance type to make sure it's supported by EFA and is supported by EFA enabled FSxL to configure EFA<>FSxL client. There is no drawback to having this client installed. We can't check the FSxL is created with EFA enabled or not unless we make the describe fsx api call.
While I agree, both needs to be true for comms to work over EFA, it's not required for client installation. We need to make sure the instance is supported and EFA drivers are installed on the instance for client configure.
I'll update the function to check for EFA drivers on the instance.
p6-b200.*|p6e-gb200.*|p4d.*|p5.*|p5e.*|p5en.*|trn1.*|trn1n.*|c5n.*|c6gn.*|c6in.*|c7gn.*|g6.*|g6e.*|m5dn.*|m5n.*|m5zn.*|m6a.*|m6i.*|m6in.*|r5dn.*|r5n.*|r6in.*|x2gd.*) | ||
echo "[INFO] Instance type $instance_type supports EFA" | ||
|
||
# Download EFA configuration script with validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if FSxL is not EFA enabled? Is their potential for this install to create issues if customer is using non-EFA enabled FileSystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the customer is using Non-EFA enabled FSxL, the comms fallback to TCP automatically. Installing EFA client on the instance has no drawback.
echo "Mount_fsx called with fsx_dns_name: $FSX_DNS_NAME, fsx_mountname: $FSX_MOUNTNAME" | ||
echo "Using mount_point: $MOUNT_POINT" | ||
echo "LUSTRE CLIENT CONFIGURATION $(print_lustre_version)" | ||
configure_efa_lustre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, what if FSxL FileSystem is not EFA enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the customer is using Non-EFA enabled FSxL, the comms fallback to TCP automatically. Installing EFA client on the instance has no drawback.
Also, do we have any usecase Cx can benefit from FSxL + EFA? I have not seen any case so far. |
ansible localhost -m ansible.builtin.unarchive -a "src=/tmp/configure-efa-fsx-lustre-client.zip dest=/tmp remote_src=yes" | ||
|
||
# Make script executable and run it | ||
ansible localhost -b -m ansible.builtin.file -a "path=/tmp/configure-efa-fsx-lustre-client.sh mode='0755'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
localhost | FAILED! => {
"changed": false,
"msg": "file (/tmp/configure-efa-fsx-lustre-client.sh) is absent, cannot continue",
"path": "/tmp/configure-efa-fsx-lustre-client.sh",
"state": "absent"
}
should be /tmp/configure-efa-fsx-lustre-client/setup.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thank you.
Trying... mount -t lustre -o noatime,flock,_netdev,x-systemd.automount,x-systemd.requires=network-online.target,net=efa fs-05b6aafaa26ba451a.fsx.eu-north-1.amazonaws.com@tcp:/xduwpbev /mnt/leo2 |
no no, so |
One tip, you need to disable this when the Lustre AZ is different from the instance AZ otherwise write to /fsx tilts forever and I had to wait 60-90 minutes until it timed out. I added I check on mount-fsx to fix it:
|
…not and if the FSx is in the same AZ as instance. Cross AZ EFA comms is not supported and validating these will make sure client is only configured if needed, reducing any overhead. 2. Also updated the configure client function to check for EFA device and provider on the instance instead of hardcoding instance types for EFA.
@mayankgupta14 , LGTM. |
ack, sure. |
@mayankgupta14 is this good to test or still being worked? |
It's good to be tested and merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please review the EFA checks as using this LCS is blocking the deployment of compute nodes.
|
||
# Check if instance has EFA drivers installed and configured | ||
local efa_output | ||
efa_output=$(fi_info -p efa 2>/dev/null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is failing. I suggest you either use the full path /opt/amazon/efa/bin/fi_info
or try to check for EFA using lsmod |grep efa
(driver loaded) or lspci |grep EFA
(PCI device exists).
As discussed, the latest commit should address this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ran the example on a working SLURM cluster with an EFA-enabled FSx Lustre file system. I confirm it works.
Issue #, if available:
Description of changes:
Updated the mount_fsx.sh to install and configure EFA for FSxL on supported instance types.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.