When creating infrastructure for distributed ML or HPC systems, networking between different nodes of a cluster becomes important. Typically distributed applications such as distributed training of deep learning models use Message Passing Interface (MPI) or NVIDIA's Network Collective Communication Library (NCCL) to interface with the Libfabric API. The Libfabric API bypasses the operating system kernel and communicates directly with the EFA device to put packets on the network. This reduces overhead and enables the distributed application to run more efficiently.
Many CPU and GPU based instances are supported by EFA. Please see list of supported instance types
Created clusters will automatically be EFA enabled and will need no further action
Follow steps here: https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/2.aws-parallelcluster
Once the cluster is created you can install the AWS EFA Kubernetes Device Plugin as follows:
helm repo add eks https://aws.github.io/eks-charts
helm install efa eks/aws-efa-k8s-device-plugin -n kube-systemOnce this is done, you should see the following pods:
root@cb9511473ccc:/eks/deployment/efa-device-plugin# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-efa-k8s-device-plugin-daemonset-78x4q 1/1 Running 0 38m
kube-system aws-efa-k8s-device-plugin-daemonset-tgfbk 1/1 Running 0 38mYou can use the EKS node viewer tool to view nodes and their status in your cluster. Once it is installed, you can simply type eks-node-viewer in the console or nv in the aws-do-eks container to get the following view:
3 nodes (650m/199290m) 0.3% cpu ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ $82.272/hour | $60058.195/month
21 pods (0 pending 21 running 21 bound)
ip-192-168-120-214.us-west-2.compute.internal cpu ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 4% (8 pods) c5.2xlarge/$0.3400 On-Demand - Ready
ip-192-168-165-37.us-west-2.compute.internal cpu ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0% (7 pods) p4de.24xlarge/$40.9657 On-Demand - Ready
ip-192-168-164-33.us-west-2.compute.internal cpu ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0% (6 pods) p4de.24xlarge/$40.9657 On-Demand - Ready
•
←/→ page • q: quitHere the node viewer shows the IP addresses of my 2 p4de.24xlarge compute nodes. We can take one of the IP addresses to describe the node as:
kubectl describe node ip-192-168-165-37.us-west-2.compute.internalThe above command describes a lot of detail of the node. To make sure EFA is installed correctly make sure you see the following:
Allocatable:
cpu: 95690m
ephemeral-storage: 868645791124
hugepages-1Gi: 0
hugepages-2Mi: 21122Mi
memory: 1146004920Ki
nvidia.com/gpu: 8
pods: 250
vpc.amazonaws.com/efa: 4For p4 nodes you will see vpc.amazonaws.com/efa: 4 and for p5.48xlarge nodes you should see vpc.amazonaws.com/efa: 32.
Tip
NOTE: If EFA is enabled in the node group, edit the security group that the nodes are attached to and add a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work.
Tip
NOTE: Do this if your pods are labeled.
First, ssh into a compute node and run fi_info as below. On a p4de.24xlarge node, you should see the following output. For a p5.48xlarge node you should see 32 rdmapxxsxx-rdm instances. The numerical string following rdmap remians the same across nodes.
ubuntu@awsankur-p4de-st-awsankur-p4de-1:~$ fi_info -p efa -t FI_EP_RDM
provider: efa
fabric: efa
domain: rdmap16s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap32s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap144s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap160s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
Follow steps here: https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/efaburn
To test if the EFA security group is setup correctly, you can run the pingpong test over EFA
To test this, we will do a pingpong test over EFA. SSH into a compute node and start a server:
fi_pingpong -p efa
Then in another shell, ssh into anothe compute node and run a client:
fi_pingpong -p efa <node-name on which server was created>
On a 2 node p4de.24xlarge cluster I see:
bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec
64 10 =10 1.2k 0.00s 1.88 34.10 0.03
256 10 =10 5k 0.00s 12.93 19.80 0.05
1k 10 =10 20k 0.00s 52.51 19.50 0.05
4k 10 =10 80k 0.00s 200.78 20.40 0.05
Follow steps here: https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/efaburn
NCCL tests are a standard way to test network bandwidth over EFA. Follow steps here to run NCCL Tests
| EC2 Instance | # EFA devices | Total network bandwidth | Max bandwidth per GPU | RDMA |
|---|---|---|---|---|
| p4de.24xlarge | 4 | 50 GB/s | Only Reads | |
| p5.48xlarge | 32 | 400 GB/s | Only Writes(for now?) |
With RDMA, GPUs across nodes can directly read and write to each other's memory.
Huge pages are a memory management technique used in modern computer systems to improve performance by using larger memory blocks than the default page size. Set FI_EFA_USE_HUGE_PAGE=0 when you see os.fork() causes OSError: Cannot allocate memory. Typically happen by multi-process PyTorch data loader. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages.