Later this spring, ACES – the new ‘composable’ supercomputer being stood up at Texas A&M University – will begin granting Phase One access to early users. Unlike traditional systems, whose architecture and components are mostly fixed, ACES will have a variety of nodes with a range of processors — CPUs, GPUs, FPGAs, specialized AI processors — that can be dynamically mixed and matched as needed for particular workflows.
Broadly, the NSF-funded ACES system is the first of its kind – ACES is a software-defined supercomputer whose magic sauce is Liqid’s Matrix software and fabric. Speaking at the HPC User Forum March 2022 meeting, held virtually last week, ACES principal Honggao Liu shared more details of ACES (Accelerating Computing for Emerging Sciences) system and its deployment plans. Liu is executive director, high performance research computing (HPRC) at Texas A&M. The full system is scheduled to be available early next year.
“[The] Goals of the ACES platform are to remove the most significant bottleneck in advanced computing by introducing the flexibility to [combine] different components like a processor, accelerator, or memory as needed as the basis to solve complex problems,” said Liu in his presentation.
Next generation infrastructure will be dynamically configurable, he said. “In ACES, each server can dynamically pool CPUs or GPUs or storage from the resource pool using the composable fabric and software. Each server is dynamically configurable based on the workload of your research team. We plan to deploy the Dell Omnia software which will support both Slurm and Kubernetes schedulers and they will be integrated with the Liqid software and fabric.”
The ACES grant was awarded last September to three universities – Texas A&M University, the University of Illinois at Urbana-Champaign and the University of Texas at Austin – who have moved quickly to deploy the new system. Dell is the server supplier. Intel will supply CPUs (48-core Sapphire Rapids), FPGAs, GPUs (Ponte Vecchio), and Optane PCIe SSDs. NEC will supply its Vector Engine. Graphcore will supply its IPU and NextSilicon will supply its co-processor whose details are still being kept close-to-the-vest.
Liu declined to elaborate on NextSilicon’s co-processor features. NextSilicon, of course, has been in stealth mode and hasn’t said much about its technology. Liu said, “We don’t have their devices at this point for ACES phase 1 and are in the process purchasing some.” Presumably we’ll learn more as ACES phase 2 proceeds. Liu did briefly touch on the different kinds of workflows ACES would be able to accommodate and showed a couple of slides (below) matching accelerators to well-suited workflows.
Besides highlighting the different accelerator types available, he noted, “If you need large memory you can dynamically compose up to three terabytes of the memory and then you can run your application that needs[such] a large memory.”
“As I mentioned earlier, ACES will use Nvidia-Mellanox 400 Gigabits per second InfiniBand and ACES will include two petabytes of useable DDN Lustre storage. ACES will have a couple login nodes, three management nodes, and two data transfer nodes and we will have 100 Gigabits per second network adapters on the data transfer nodes,” said Liu, noting the latter will allow users to transfer large data-sets to ACES at high-speed.
“The ACES configuration supports a broad range of application hardware preferences, depending on the research workflow or application. User can choose to use different components, so accelerators, that work best for the applications or workflows,” he said.
As on Aurora, the DoE exascale system being built at Argonne National Lab that also uses Intel CPUs and GPUs, ACES plans to use Intel’s oneAPI framework as the primary programming development tool. Intel has consistently pitched oneAPI as a vendor-neutral approach to porting code to diverse processor types.
About the ACES software environment, Liu said, “ACES will host all major HPC AI/ML software and frameworks. We will support the most widely-used the recent application software, with support to JupyterHub. We plan to offer Intel’s oneAPI as the cross-architecture programming framework or CPUs, GPUs and FPGAs. A user can use the same code through oneAPI to run on CPUs, GPUs, and FPGAs.
We will support Slurm and Kubernetes. We will use Anaconda, Easybuild to build your software, and we will provide Singularity and Charlie Cloud for container applications. On the system side, we will install the XSEDE software stack and also support the HTC Condor.” (see slide below)
As explained by Liu, ACES will have two phases. “The Phase One prototype, including the Graphcore IPU, Intel D5005 FPGAs, NEC Vector engine and Liquid-Intel Optane card, will be likely available for early access in April or sometime in May,” he said.
He noted that ACES is being integrated to Texas A&M’s FASTER computer that’s now finishing installation and will become the center’s fastest system. The NSF FASTER award was announced in August of 2020 the system and shares several attributes with ACES including use of Liqid’s composable fabric.
Here’s a description of FASTER from the Texas A&M website: “FASTER is a 184-node Intel cluster from Dell with an InfiniBand HDR-100 interconnect. A100 GPUs, A10 GPUs, A30 GPUs, A40 GPUs and T4 GPUs are distributed and composable via Liqid PCIe fabrics. All nodes are based on the Intel Ice Lake processor.” FASTER will become available to users in April.
For ACES, all of the Phase Two hardware is expected to arrive at Texas A&M in the June-July timeframe. “We will finish installation and testing acceptance of Phase Two by September 30, 2022,” said Liu. “Then we will start user testing operations for ACES system in the fourth quarter of this year and we plan to have allocation for users early next year.”
Slides courtesy of Dr. Honggao Liu