AWS HPC Blog

Multiple Availability Zones now supported in AWS ParallelCluster 3.4

Previous versions of AWS ParallelCluster enabled HPC clusters in a single Amazon EC2 Availability Zone.

With AWS ParallelCluster 3.4.0, you can now create clusters that use multiple Availability Zones in a Region. This gives you more options to provision computing capacity for your HPC workloads.

What’s New

AWS ParallelCluster helps you build HPC clusters that can elastically scale to the size of your computing workload. The Amazon EC2 capacity in a single Availability Zone is sufficient for many customers’ HPC workloads. But larger scale-out workloads can require even bigger resource pools. Customers running such workloads have asked for the ability to combine Amazon EC2 capacity across Availability Zones.

For example, Electronic Design Automation (EDA) jobs typically spawn thousands of independent, loosely coupled tasks. These typically have no inter-process communication. That means they don’t need be restricted to a single Availability Zone. Instead, they can benefit from combining Amazon EC2 Spot capacity from multiple Availability Zones. Monte Carlo simulations in the Financial Services sector often follow a similar pattern.

Some customers manage access to multiple Availability Zones by creating an HPC cluster in each. However, it can be helpful to be able to manage these multi-Zone resources in a single HPC cluster. That way, common aspects like user management, job accounting, and software installation only have to be done once, while the cluster has access to instances running in multiple Availability Zones.

Tightly coupled workloads like Computational Fluid Dynamics (CFD) and Weather Modeling need many homogeneous instances powered by high-performance networking. You may still benefit from being able to launch instances across multiple Availability Zones, but you would need to take care to restrict any given multi-node job request to instances in a single Availability Zone.

Using Multi-Availability Zone Clusters

In November 2022 AWS ParallelCluster 3.3 added the ability to specify multiple instance types in a Slurm job queue as a way to aggregate compute capacity. AWS ParallelCluster 3.4 adds to this flexibility by allowing you to associate one or more subnets from different Availability Zones to job queues. Doing so will allow ParallelCluster to launch instances of the requested type(s) in the requested Availability Zone(s) as it scales up those queues.

To get started with multi-zone clusters, you’ll need ParallelCluster 3.4.0 or higher. You can follow this online guide to help you upgrade. Next, edit your cluster configuration as described in the examples below and the AWS ParallelCluster documentation. Finally, create a cluster using the new configuration.

Let’s look at two examples.

Example 1: Single Queue with Multiple Instance Types and Multiple Availability Zones

Here is an Slurm queue that aggregates instance capacity from six on-demand instance types and three Availability Zones in a Region. This configuration creates a single pool of loosely coupled instances spanning Availability Zones. It could power large numbers of independent jobs, such as EDA or Monte Carlo simulations.

Scheduling:
  Scheduler: slurm
  SlurmQueues:
  - Name: q1
    ComputeResources:
    - Name: cr1
      Efa:
        Enabled: false
      Instances:
        - InstanceType: c6a.48xlarge
        - InstanceType: m6a.48xlarge
        - InstanceType: r6a.48xlarge
      MinCount: 0
      MaxCount: 128
    Networking:
      SubnetIds:
        - subnet-012abc01
        - subnet-012abc02
        - subnet-012abc03

If we log into this cluster and submit a few jobs, Amazon EC2 Instances will be launched to handle the work. In Figure 1, we have submitted 6 jobs to the cluster, then inspected the state of the cluster queues with squeue. We see that six jobs are running (R).

Figure 1: HPC jobs running in an autoscaling Slurm queue

Figure 1: HPC jobs running in an autoscaling Slurm queue

Now, let’s look at the Amazon EC2 Console to see where the instances are running (Figure 2).

Figure 2: Elastically-scaled Slurm jobs running in three Availability Zones

Figure 2: Elastically-scaled Slurm jobs running in three Availability Zones

Rather than running in a single Availability Zone, the instances are spread among us-east-2aus-east-2b, and us-east-2c. But they behave as parts of a singular HPC cluster. They can communicate with one another and the cluster head node, and they each mount the cluster’s default Network File System (NFS) directories. Note that there are some limits to the inter-node communication and filesystem access, which will be outlined later in this article when I discuss the implications of spreading instances across Availability Zones.

Example 2: Two Queues Using Multiple Instance Types and Different Availability Zones

In this second example, I configured one cluster with two Slurm queues. Each queue is configured with one Availability Zone, but the Availability Zones are different. This gives us two elastic pools of c6a.48xlarge instances, one in each Availability Zone.

Scheduling:
  Scheduler: slurm
  CapacityType: ONDEMAND
  SlurmQueues:
  - Name: compute
    ComputeResources:
    - Name: usea
      Instances:
      - InstanceType: c6a.48xlarge
      MinCount: 0
      MaxCount: 128
      Efa:
        Enabled: true
      Networking:
        SubnetIds:
          - subnet-xxxxxaaa
  - Name: useb
    ComputeResources:
    - Name: cr1
      Instances:
      - InstanceType: c6a.48xlarge
      Efa:
        Enabled: true
      MinCount: 0
      MaxCount: 128
    Networking:
      SubnetIds:
        - subnet-xxxxxbbb

Instances launched in these queues will all be in the same Availability Zone and will be able to communicate with one another. However, rather than having a standalone HPC cluster in each Availability Zone, the user management, accounting, and other common infrastructure (such as the home filesystem) is consolidated into a single cluster. This configuration could be good for tightly-coupled workloads such as CFD.

The Implications of Multiple Availability Zones

We’ve shown you how to configure multi-zone queues in a cluster. It’s equally important to discuss some of the limitations of the approach including:

  • Elastic Fabric Adapter (EFA) cannot be enabled in queues than span Availability Zones.
  • Cluster Placement Groups are not supported in queues than span Availability Zones.
  • Network traffic between Availability Zones is subject to higher latency and incurs network charges.

Each can have implications on your HPC queue and storage architecture.

Queue architecture

We don’t recommend running tightly-coupled workloads in a multi-zone queue such as Example 1. This is because they will experience higher network latency. This will be driven by three factors. First, without EFA support, communications will run over regular TCP/IP rather than scalable reliable datagram (SRD). Second, without cluster placement groups, there is no way to request that instances launch close together. Third, network traffic that crosses Availability Zones has higher latency due to longer network paths. We suggest using single Availability Zone queues for latency-sensitive applications, as illustrated in the Example 2 above.

Storage architecture

Two key concepts determine how instances that span multiple Availability Zones can affect your shared storage architecture.  First, network security groups and VPC routes must be configured to allow compute nodes to access the shared filesystems they mount. In a multi-zone setup, this needs extra steps to configure. Second, filesystem network traffic between Availability Zones has added latency and cost. The nature of these impacts varies depending on how your HPC workloads use shared storage.

For example, AWS ParallelCluster automatically exports some /opt subdirectories and the /home directory from the cluster head node using NFS. AWS ParallelCluster will configure your multi-zone cluster so all compute nodes mount these NFS shares. If those compute nodes are in a different Availability Zone from the cluster head node, file system IO to the shared directories will have higher network latency and accrue additional costs. That might, in turn, degrade the performance of jobs running on those nodes. One way to avoid such impacts is to minimize reliance on the shared filesystems. Consider using them only to hold common software and configuration files, rather than temporary files or job outputs.

If you are using a configuration similar to Example 2, you can set up a shared filesystem in each Availability Zone. Amazon FSx for Lustre is a good option for this since it’s designed to provide high-performance filesystems in a single Availability Zone. AWS ParallelCluster can provision an Amazon FSx for Lustre filesystem for you. You can also create one in the AWS Console and configure your cluster to use it.

If you need compute instances in multiple Availability Zones with access to the same shared storage, consider Amazon Elastic Filesystem (EFS). It provides serverless, elastic file storage. Since Amazon EFS filesystems are redundant across Availability Zones, they are well-suited for use in a multi-zone cluster. You can either create an Amazon EFS and configure your cluster to use it, or AWS ParallelCluster can create one for you automatically.

Conclusion

With AWS ParallelCluster 3.4, it is now possible to build HPC clusters that combine EC2 capacity across multiple Availability Zones. This makes it easier to operate larger, more complex HPC clusters capable of handling even more challenging workloads. You’ll need to update your ParallelCluster installation and update your clusters to take advantage of this new capability.

We are excited to hear what you think of multi-zone clusters in AWS ParallelCluster, and how we can improve this new feature. Reach out to us on Twitter at @TechHPC with your success stories, feedback, and ideas.

Matt Vaughn

Matt Vaughn

Matt Vaughn is a Principal Developer Advocate for HPC and scientific computing. He has a background in life sciences and building user-friendly HPC and cloud systems for long-tail users. When not in front of his laptop, he’s drawing, reading, traveling the world, or playing with the nearest dog.