Skip to content

Conversation

mayankgupta14
Copy link
Collaborator

Issue #, if available:

Description of changes:

Updated the mount_fsx.sh to install and configure EFA for FSxL on supported instance types.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment on lines 67 to 68
case "$instance_type" in
p6-b200.*|p6e-gb200.*|p4d.*|p5.*|p5e.*|p5en.*|trn1.*|trn1n.*|c5n.*|c6gn.*|c6in.*|c7gn.*|g6.*|g6e.*|m5dn.*|m5n.*|m5zn.*|m6a.*|m6i.*|m6in.*|r5dn.*|r5n.*|r6in.*|x2gd.*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all p5.*, g* instances support EFA (ex: p5.4xl, g6.xl)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, oversight on my side, thank you. Will update, and in fact will update the function to check for efa device and drivers on the instance instead. Will address the other concerns from Matt as well.

return 0
fi

# EFA-supported instance types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHile I agree with this methodolog for fetching instance type, just because customer is using EFA supported instance, doesn't mean that customer will automatically have deployed EFA enabled FSxL FileSystem. Both need to be true

Copy link
Collaborator Author

@mayankgupta14 mayankgupta14 Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are checking the instance type to make sure it's supported by EFA and is supported by EFA enabled FSxL to configure EFA<>FSxL client. There is no drawback to having this client installed. We can't check the FSxL is created with EFA enabled or not unless we make the describe fsx api call.
While I agree, both needs to be true for comms to work over EFA, it's not required for client installation. We need to make sure the instance is supported and EFA drivers are installed on the instance for client configure.

I'll update the function to check for EFA drivers on the instance.

p6-b200.*|p6e-gb200.*|p4d.*|p5.*|p5e.*|p5en.*|trn1.*|trn1n.*|c5n.*|c6gn.*|c6in.*|c7gn.*|g6.*|g6e.*|m5dn.*|m5n.*|m5zn.*|m6a.*|m6i.*|m6in.*|r5dn.*|r5n.*|r6in.*|x2gd.*)
echo "[INFO] Instance type $instance_type supports EFA"

# Download EFA configuration script with validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if FSxL is not EFA enabled? Is their potential for this install to create issues if customer is using non-EFA enabled FileSystem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the customer is using Non-EFA enabled FSxL, the comms fallback to TCP automatically. Installing EFA client on the instance has no drawback.

echo "Mount_fsx called with fsx_dns_name: $FSX_DNS_NAME, fsx_mountname: $FSX_MOUNTNAME"
echo "Using mount_point: $MOUNT_POINT"
echo "LUSTRE CLIENT CONFIGURATION $(print_lustre_version)"
configure_efa_lustre
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, what if FSxL FileSystem is not EFA enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the customer is using Non-EFA enabled FSxL, the comms fallback to TCP automatically. Installing EFA client on the instance has no drawback.

@nghtm nghtm requested a review from paragao September 11, 2025 01:10
@KeitaW
Copy link
Contributor

KeitaW commented Sep 11, 2025

Also, do we have any usecase Cx can benefit from FSxL + EFA? I have not seen any case so far.

ansible localhost -m ansible.builtin.unarchive -a "src=/tmp/configure-efa-fsx-lustre-client.zip dest=/tmp remote_src=yes"

# Make script executable and run it
ansible localhost -b -m ansible.builtin.file -a "path=/tmp/configure-efa-fsx-lustre-client.sh mode='0755'"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localhost | FAILED! => {
"changed": false,
"msg": "file (/tmp/configure-efa-fsx-lustre-client.sh) is absent, cannot continue",
"path": "/tmp/configure-efa-fsx-lustre-client.sh",
"state": "absent"
}

should be /tmp/configure-efa-fsx-lustre-client/setup.sh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you.

@elgalu
Copy link

elgalu commented Sep 12, 2025

Trying...

mount -t lustre -o noatime,flock,_netdev,x-systemd.automount,x-systemd.requires=network-online.target,net=efa fs-05b6aafaa26ba451a.fsx.eu-north-1.amazonaws.com@tcp:/xduwpbev /mnt/leo2

@elgalu
Copy link

elgalu commented Sep 12, 2025

mount -t lustre -o noatime,flock,_netdev,x-systemd.automount,x-systemd.requires=network-online.target,net=efa fs-05b6aafaa26ba451a.fsx.eu-north-1.amazonaws.com@tcp:/xduwpbev /mnt/leo2

no no, so net=efa doesn't exist, I guess we don't need to specify anything to mount -t lustre ... right?

@elgalu
Copy link

elgalu commented Sep 16, 2025

One tip, you need to disable this when the Lustre AZ is different from the instance AZ otherwise write to /fsx tilts forever and I had to wait 60-90 minutes until it timed out. I added I check on mount-fsx to fix it:

mount-fsx: Cross-AZ detected: EFA mount will be disabled

…not and if the FSx is in the same AZ as instance. Cross AZ EFA comms is not supported and validating these will make sure client is only configured if needed, reducing any overhead.

2. Also updated the configure client function to check for EFA device and provider on the instance instead of hardcoding instance types for EFA.
@allela-roy
Copy link
Contributor

@mayankgupta14 , LGTM.
However, can you make the cross-AZ check to happen first in verify_fsx_efa_compatibility() to avoid the timeouts?

@mayankgupta14
Copy link
Collaborator Author

@mayankgupta14 , LGTM. However, can you make the cross-AZ check to happen first in verify_fsx_efa_compatibility() to avoid the timeouts?

ack, sure.

@paragao paragao changed the title updated mount_fsx.sh to configure FSxL EFS client updated mount_fsx.sh to configure FSxL EFA client Oct 3, 2025
@bluecrayon52
Copy link
Contributor

@mayankgupta14 is this good to test or still being worked?

@mayankgupta14
Copy link
Collaborator Author

@mayankgupta14 is this good to test or still being worked?

It's good to be tested and merged.

Copy link

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review the EFA checks as using this LCS is blocking the deployment of compute nodes.


# Check if instance has EFA drivers installed and configured
local efa_output
efa_output=$(fi_info -p efa 2>/dev/null)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is failing. I suggest you either use the full path /opt/amazon/efa/bin/fi_info or try to check for EFA using lsmod |grep efa (driver loaded) or lspci |grep EFA (PCI device exists).

@mayankgupta14
Copy link
Collaborator Author

please review the EFA checks as using this LCS is blocking the deployment of compute nodes.

As discussed, the latest commit should address this.

Copy link

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ran the example on a working SLURM cluster with an EFA-enabled FSx Lustre file system. I confirm it works.

@paragao paragao merged commit 59a0c1b into main Oct 16, 2025
4 checks passed
@paragao paragao deleted the fsx_efa_mayankpg branch October 16, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants