This reposity contains sample Terraform to deploy a cluster using AWS Parallel Computing Service.
The architure consistes of
- An LDAP server for authentication and authorization.
- A single login node.
- FSx OpenZFS file system for applications and users home directories, mounted at
/swand/home. - FSx Lustre parallel file system, mounted at
/fsxover EFA on the compute instances. - An X86 compute queue consisting of hpc6a.48xlarge and hpc7a.96xlarge instances (default partition).
- An ARM compute queue consisting of m8g.16xlarge instances.
A custom AMI is created from the AL2023 base AMI. This AMI has the following components installed
- PCS agent.
- EFA drivers.
- Lustre drivers.
- Slurm.
- SRSO mitigation turned off.
- CloudWatch agent.
- SSS set to do authentication and authorization against an LDAP server.
Top-level variables should be set in the terraform.tfvars file before deployment.
The most important ones are
projectcreated_dateregion&availability_zonessh_keyusers
The project is used as a toplevel name for all the resources, while the created_date is applied
as a tag to all resources.
The region and availability_zone are the AWS Region and Availability Zone the cluster should
be deployed in.
While the ssh_key is the public ssh key used for access to the
Amazon EC2 instances.
The users variable is a LDIF file name that contains all the
cluster users to add to LDAP. Users will ssh into the cluster
using only their ssh key, there is no password access.
Other variables are defined in variables.tf and of course can/should
be overridden in terraform.tfvars.
A user LDIF file needs to be created to grant access to the
cluster. This file is populated into LDAP via ldapadd.
The following is an example of a user definition.
dn: cn=Fred Smith,ou=people,dc=my-domain,dc=com
givenName: Fred
sn: Smith
cn: Fred Smith
uid: fsmith
uidNumber: 2000
gidNumber: 2000
homeDirectory: /home/fsmith
loginShell: /bin/bash
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: top
objectClass: ldapPublicKey
mail: fred.smith@example.com
sshPublicKey: ssh-ed25519 ABCDC3NzaC1lZDI1NTE5AAAAIA2dKyl6lAU+0gGefezaxqX8DtWCND6F/o/sjkS6zsvD
The wheel group has sudo privileges to to run ALL commands as ALL users.
If the user should be added to the wheel group, you need to add a
memberUid to the wheel distinguished name for the user.
The following entry adds fsmith to the wheel DN.
dn: cn=wheel,ou=groups,dc=my-domain,dc=com
changetype: modify
add: memberUid
memberUid: fsmith
Deploying the stack is done by the typical Terraform project.
terraform init
terraform plan
terraform apply -auto-approve
Once the stack has been deployed, the external IP addresses of the login nodes
will be listed in the outputs pcs_login_public_ips.
To create a dependency graph.
terraform graph -type=plan | dot -Tpng >graph.png
To tear it all down.
terraform destroy -auto-approve
The LDAP secret should be deleted by Terraform, if it isn't you can force a deletion with the following command.
aws secretsmanager delete-secret --secret-id ldap-password --force-delete-without-recovery --region us-east-2

