In a peer-to-peer data center, PCI-Express structures will be ubiquitous

Supercomputers are expensive, and more and more. While offering impressive performance gains over the past decade, modern HPC workloads require an incredible amount of performance, and this is especially true of any workload that will mix traditional HPC simulation and modeling with a kind of machine learning and inference training. This will almost certainly require GPU acceleration, and GPUs don’t come cheap.
Gone are the days when supercomputer nodes were just homogeneous compute engines consisting of many processor cores, the absolute minimum amount of DRAM main memory, and one or two network interface cards. The situation has become much more complex, as different workloads require different CPU and accelerator compute ratios and different combinations of network bandwidth and even fast access to flash storage. We are therefore fortunate that the PCI-Express fabrics based on the very fast Generation 4.0 and soon on the even faster Generation 5.0, which will also support the asymmetric CXL coherence protocol created by Intel and now adopted by the whole world. industry, will be ready when organizations need fast and wide PCI-Express networks to create composable collections of compute engines, storage and network adapters.
As with many technologies, HPC centers and other businesses that run GPU-accelerated HPC and AI workloads are taking a measured approach to adopting PCI-Express switch arrays as a way to make their infrastructure more efficient. malleable and improve their efficiency. While it might be a little disappointing, we’re still a little bit impatient for the future here at The next platform – this is not surprising given that people have decades of experience building distributed systems connected by Ethernet or InfiniBand switches and have very little experience building infrastructures using PCI-Express matrices.
All good things take time.
During the SC21 Supercomputer Conference, we chatted with Sumit Puri, co-founder and CEO of Liqid, one of the many startups that sells PCI-Express fabrics and composability software at the top and arguably the only one. company that has gotten the most traction with this idea of ââinfrastructure disintegration and composability that has been long in the making and is as inevitable as taxes and a lot more fun than death.
We were discussing with Puri the very high cost of GPU and FPGA accelerators and how composable infrastructure like the Liqid Matrix software stack and now the two PCI-Express switch boxes – one launched at SC21, in which we’ll get back to you in a minute – This could potentially mean that many small-sized university, government, and enterprise intensive data centers could forgo InfiniBand or Ethernet entirely as the interconnect between their compute nodes. They could simply use a PCI-Express structure, which is faster, would be peer-to-peer across all devices linked to hosts, and cost less than using traditional networks.
âOrganizations like the National Science Foundation and the Texas Advanced Computing Center, which are building a composable supercomputer called ACES, are excited to launch into a composable GPU environment,â Puri said. The next platform. âAnd we think we’re going to get them to take advantage of the PCI-Express fabric as a network, which is the next logical step. We’re definitely going to have these discussions, and the capacity is there. It’s just a discussion around architecture.
Here’s the fun part of math here. Adding composability to any clustered system adds between 5-10% to the cost of the cluster, according to Puri, but we’ve countered this in many organizations that might only have three or four iron racks like their “supercomputer”, a PCI-Express 4.0 switched fabric can do the job, and with PCI-Express 5.0 switching, the bandwidth doubles and that means the base can double, which could provide a direct interconnect between six racks and eight equipment racks. It’s a big system, especially a GPU-accelerated system, and at the very least, the world is quite a hefty pod size, even for capacity class supercomputers that run a lot of workloads.
To some extent, we believe that capacity class supercomputers – those that perform many relatively small jobs all the time and in parallel – need composability even more than capacity class machines – those that run very large ones. work on most of their infrastructure, more or less in series. The diversity of workloads calls for composable accelerators – because different workloads will have different CPU / GPU ratios, and may even have a mix of CPU hosts and GPU and FPGA accelerators – and the very high cost of those CPUs. and accelerators require their use to be pushed higher than is typical in today’s data center. Average GPU usage in most data centers is incredibly low – anecdotal evidence and complaints from people we’ve spoken to in recent years suggests that in many cases it’s in the range of 10 at 20%. But with GPUs pooling on PCI-Express fabrics and dynamically sharing them using software like Liqid Matrix, there is a chance of getting that average usage up to 60% or 70% – which is about as high as anyone would expect in a shared resource with somewhat unpredictable workloads.
“This is the fundamental problem facing HPC centers,” Puri continues. âDifferent researchers bring different workloads to machines and have different hardware requirements. In the old model, the orchestration engine asks the cluster for four servers with two GPUs each, and if the best the cluster has is four servers with four GPUs each, half of the GPUs are not used. In the Liqid world, if SLURM requires eleven servers with eleven GPUs each – and it’s not a setup that any sane person would ask, but it illustrates the principle – in ten seconds, the Matrix dials those nodes, performs the work, and when it’s done, return the GPUs to the pool for reuse. “
While this is clean and useful, coming back to using PCI-Express as an interconnect matrix, Puri confirms our hunch that it’s easy and transparent. The way it works is that the PCI-Express fabric puts a driver on every server node that presents itself to the operating system and applications as an InfiniBand connection, and the hosts below are just talking point to point. with the others on the PCI-Express Fabric. It is important to note that the Message Passing Interface (MPI) message sharing protocol that most HPC applications use to scale workloads runs on this InfiniBand driver and is not smarter either. The point is, the incremental cost of composability can be paid for by getting rid of the actual InfiniBand in module-sized supercomputers measured at rack order. For systems with one or more lines, not so much, but there is probably a way to group racks and then interconnect them between the lines to create a two-tier InfiniBand network structure. Its usefulness depends of course on the nature of the workload. But other HPC interconnects make a similar type of hierarchical architecture for the network, so this is no stranger.
This brings us to Liqid’s EX-4400 series enclosures, which house PCI-Express switching and can accommodate ten or twenty PCI-Express accelerators or storage devices. They look like this:
Peripherals inside the cases can be GPUs, FPGAs, DPUs, or custom ASICs – all they have to do is talk PCI-Express 4.0 and adapt to the slots. The chassis has four ports to connect to host systems and offers 64 Gb / s of two-way bandwidth per port. The chassis has four 2400 watt power supplies, two for devices and two for backup. EX-440 enclosures, which incorporate Broadcom PLX PCI-Express switches, are used to consolidate and virtualize connections to accelerators in the enclosures.
In the simplest case, the PCI-Express structure is used to connect a large number of devices to a single host – more devices than would have been possible with a direct PCI-Express bus. In the case of a client using GPU-accelerated blockchain software, the client needed 100 GPUs to run the blockchain workload, and putting four GPUs in a server took 25 nodes. Using five EX-4420 enclosures, it only took five enclosures and five servers to produce those 100 GPUs, which took 20 servers out of the mix. And now GPUs can be dynamically linked to each other and servers through the PCI-Express fabric when they were statically configured the old fashioned way.
When you consider that it costs around $ 1 per watt per year to pay for electricity on top of the cost of the servers themselves and the space they consume, being able to get rid of the servers while still retaining functionality and by making the infrastructure more flexible is a triple gain. .
Perhaps the most important benefit, however, is the peer-to-peer nature of the PCI-Express fabric. âThat’s really what it is, it’s the critical thing,â Puri explains, âthe ability of these devices to communicate without having to go back to the host processor. And when we do that for GPUs, for example, when we enable this data path, the performance benefits are phenomenal. We see a 5X improvement in bandwidth and a 90% reduction in latency. This peer-to-peer communication is well understood with GPUs, but it will work with other accelerators as well. And that’s super important – it will allow GPUs and flash SSDs to communicate over that RDMA data path.
Liqid is working on benchmarks to show this GPU-SSD link on the PCI-Express fabric, and the company allows customers to pair up to four EX-4400 boxes together to create larger peer-to-peer pools. They can be even larger, in a ring, star, or mesh configuration. The jump between switches is only 80 nanoseconds for PCI-Express switches, so multi-tiered PCI-Express fabrics can be created to create ever-larger device pools. And with each generation, the basis of PCI-Express switches doubles.
It would be interesting to see the two PCI-Express switch manufacturers – Broadcom and Microchip – push this a bit harder. Then all hell could break loose in the data center⦠literally and figuratively.
Liqid is adding a few customers a week and business is on the rise, according to Puri. Between 20% and 30% of transactions are generated by traditional HPC accounts, and these transactions tend to be slightly larger and therefore the contribution to HPC revenue is somewhat larger. The remaining 70-80% of transactions Liqid concludes come from the disaggregation of AI systems.