Four Lessons Data Center Managers Can Learn from the Most Powerful Supercomputers

2024-07-29

250

If you were to ask a passerby what a supercomputer is, chances are that in most cases, they will mention an example of a “villain” from a movie. From HAL 9000 (2001: A Space Odyssey) to VIKI in iRobot to Skynet from Terminator.

In popular culture, supercomputers are often presented as sentient systems that evolved before turning against their creator.

This perception would cause great laughter among researchers at Lawrence Livermore National Laboratory or the National Weather Service.

The fact is that today’s supercomputers are far from being conscious and artificial intelligence can simply be compared to a search bar capable of scanning very large volumes of data.

Today, supercomputers power a multitude of cutting-edge applications, from oil and gas exploration to weather forecasting to financial markets and the development of new technologies.

Supercomputers are the Lamborghinis and Bugattis of the computing world, and Kingston is closely following the developments that are pushing the boundaries of computing.

From the use and tuning of DRAM to firmware improvements for storage array management to prioritizing consistent throughput and latency over peak values, our technologies are deeply influenced by cutting-edge supercomputer technologies.

Similarly, cloud and on-premises data center managers can learn valuable lessons from the supercomputing world about infrastructure design and management. They will be able to better select components that can keep up with future advances, avoiding a complete system overhaul.

server with glowing lines representing a network

Table of Contents

1. Supercomputers are designed precisely for constancy

Unlike most cloud computing platforms, such as Amazon Web Services or Microsoft Azure, which are designed for a variety of applications that can use shared resources and infrastructure, most supercomputers are built to meet specific needs.

The most recent list of the Top 500 Fastest Supercomputers in the World (Public Domain and Declassified) not only shows locations and speeds, but also the primary application area.

Of the top twelve supercomputers, eleven are dedicated to energy research, nuclear testing, and defense applications. The only outsider is Frontera, a petaflop system funded by the National Science Foundation, from Texas Advanced Computing at the University of Texas.

It provides academic resources to scientists and engineers in partner research projects. The next twenty supercomputers in the Top 500 are almost all used in government defense and intelligence applications. The 30th to 50th positions are systems used primarily by weather services. The second half of the Top 100 is made up of enterprise systems (NVIDIA, Facebook, etc.), midrange weather forecasting systems, and systems dedicated to space research programs, oil and gas exploration, education, and specific government applications.

These machines are not generic solutions. They are custom-developed with manufacturers such as Intel, Cray, HP, Toshiba and IBM to perform specific computations on particular data sets, in real time or asynchronously.

They have defined acceptable latency thresholds.

Predefined computing resources leveraging millions of processing cores
Frequencies between 18,000 and 200,000 teraFLOPS.

Storage capacities are expressed in exabytes, much larger than the petabytes of modern data warehouses.

The function of systems like Frontera is not to be able to sprint to handle a spike in computational load; they must read a large volume of data constantly to get a result. A spike in computational performance could actually cause errors in the results. That is why the priority is placed on consistency.

Today’s data center manager must first answer the question “What will the system be used for?” in order to consider the architecture, manage resources, and integrate security features.

Managing a data center that runs a multitude of virtual desktops is not the same as managing a data center for a 112 plant or an air traffic control system. Needs, requirements, service level agreements, and budgets are different, and the design must be adapted accordingly.

Similarly, you need to think about how to ensure consistent performance without having to rely on custom builds.

Companies like Amazon, Google, and Microsoft have the budgets to build custom storage or compute infrastructure, but most service providers have to find a way to choose standard hardware.

As a result, more and more data center managers need to set strict criteria for QoS performance benchmarks and ensure that consistency joins compute speed and latency as the elements that require the most attention.

2. The relativity of the concept of real time

In supercomputer applications, most real-time data instances have big implications. From stopping a nuclear reaction to telemetry for a rocket launch, computational latency can have catastrophic consequences.

And the volumes of data involved are astronomical. The streams don’t just come from a single source; they often come from a network of data points.

But data is ephemeral. In the case of real-time streams, most data is not retained indefinitely. Data is written and then overwritten according to a time period set for sequential writes and overwrites.

Real-time data changes continuously, and few applications would need every single bit stored for the ages. Data is processed in batches, subjected to calculations to obtain a result (whether it is an average, a statistical model, or an algorithm), and it is this result that is retained.

Consider the forecasts provided by the National Oceanographic and Atmospheric Administration’s (NOAA) supercomputer. Weather factors are constantly changing, from precipitation, air and ground temperatures, atmospheric pressure, time of day, the effects of the sun, wind, and its interaction with terrain.

These changes happen every second and are reported via a real-time information feed. NOAA’s weather service doesn’t need to keep the raw data forever. You need the forecast models. As the Global Forecast System model takes shape, new data is ingested, resulting in updated and more accurate forecasts.

What’s more, local meteorologists who share and receive data with the weather service don’t need access to the entire global weather data set.

They limit models to local areas. They can enrich the weather service’s data with data from local weather stations to better understand microclimates and speed up the production of more accurate local forecasts by creating batches, running them through calculations to get a result (whether it’s an average, a statistical model, or an algorithm), and that result is kept.

The same is true for stock or financial models that use moving averages. Each of them uses specific indicators and built-in action triggers that rely on specific parameters for acceptable thresholds of market behavior.

Designing a system that uses “real-time” data does not need to consider keeping all the data that is input. Instead, it is appropriate to leverage non-volatile random access memory (NVRAM) and dynamic random access memory (DRAM) to cache and process the data on the fly, and then send the result to the storage solution.

illustration of a flash memory circuit with light traces

3. Latency Thresholds, NAND Flash and DRAM Tuning

In most cases, latency thresholds are set because of application requirements. In the context of stock trading, seconds are worth millions or even billions of dollars. In the context of hurricane forecasting and tracking, it could mean the difference between evacuating New Orleans or Houston.

Supercomputers operate with a defined load of service levels, in terms of latency, compute resources, storage, or bandwidth.

Most of them practice fault-aware computing. This means that they are able to redirect data streams to ensure optimal latency conditions (based on 𝛱+Δ clocking max) and switch to asynchronous computing models or prioritize resources to ensure sufficient computing power or bandwidth for jobs.

Whether you are working with high-end workstations, IRON servers, or high-performance computing or scientific workloads, large computers and big data require large amounts of DRAM. Supercomputers like the Tianhe-2 use large amounts of RAM with specialized accelerators. The way supercomputers tune the hardware and controller framework is unique to each application design.

Often, specific compute tasks, where disk access creates a huge bottleneck with RAM requirements, do not favor the use of DRAM, but are small enough for NAND flash. FPGA clusters are also tuned for each particular workload to ensure that large data sets suffer a significant performance penalty when they must use traditional means to retrieve data.

A collaboration between teams at the University of Utah, Lawrence Berkeley Lab, University of Southern California, and Argonne National Lab has demonstrated new models for automatic performance tuning (or autotuning) as effective methods for providing performance portability across architectures.

Rather than relying on a compiler that can guarantee optimal performance on newer multi-core architectures, autotuning kernels and applications can automatically tune themselves based on the target processor, network, and programming model.

IT worker wearing headset, with laptop in front of head-up display

4. Multiple layers of security

Power distribution in high-performance computing data centers is an increasingly complex challenge, especially in infrastructures operated as shared resources. In dedicated or as-a-service infrastructures, data centers must ensure continuous operation and reduce the risk of damage to sensitive hardware components in the event of power outages, surges, or changes in peak consumption.

Architects adopt a mix of loss distribution transformers:

DC power distribution and inverters,
trigeneration (production of electricity via heat with a view to storing it as emergency power)
active surveillance

Most data centers today operate in a high-level RAID structure to ensure continuous and near-simultaneous writes between multiple storage arrays. In addition, HPC infrastructures leverage a large amount of NVRAM to cache data during processing. This can be live data streams that do not pass through the storage arrays or information processed in parallel in a similar usage to a scratch disk to free up additional compute resources. The Frontera system, mentioned above, has a total working capacity of 50PB.

Users with very high bandwidth or IOPS requirements can request an allocation on an all-NVMe (non-volatile memory express) file system with a capacity of approximately 3PB and a bandwidth of approximately 1.2TB/s.

Constant RAID backup for storage and constant caching of NVMe buffers depends on the total I/O threshold of the controllers on the device and the total bandwidth available for remote storage/backup.

Most HPC infrastructures also eliminate the risk of hardware failure in mobile disks by adopting SSDs and flash storage blocks. These storage solutions provide consistent IOPS and have predictable latencies within the application-specific latency limits. Many supercomputers also leverage multi-tape libraries (with capacities as large as an exabyte or more) to provide a reliable data archive for every bit processed and stored.

And to avoid problems if all else fails, some integrate power-failure capacitors, also called power-failure protection, into SSDs and DRAM. With these capacitors, the drives (either standalone or in an array) can complete write operations in progress, reducing the potential amount of data lost in the event of a catastrophic failure.

Conclusion

Customization is certainly essential in the supercomputing world, but first and foremost, it is essential to identify the needs before building the data center. This step is also essential to achieve the most consistent performance.

Regardless of the size of the data center, why not think it is important or treat it like a supercomputer when generating, storing or sharing data. Evaluating these factors will allow architects to design high-performance architectures that are adapted to tomorrow’s technological advances, even when using generic components.

Four Lessons Data Center Managers Can Learn from the Most Powerful Supercomputers

1. Supercomputers are designed precisely for constancy

Unlike most cloud computing platforms, such as Amazon Web Services or Microsoft Azure, which are designed for a variety of applications that can use shared resources and infrastructure, most supercomputers are built to meet specific needs.

The most recent list of the Top 500 Fastest Supercomputers in the World (Public Domain and Declassified) not only shows locations and speeds, but also the primary application area.

2. The relativity of the concept of real time

In supercomputer applications, most real-time data instances have big implications. From stopping a nuclear reaction to telemetry for a rocket launch, computational latency can have catastrophic consequences.

And the volumes of data involved are astronomical. The streams don’t just come from a single source; they often come from a network of data points.

3. Latency Thresholds, NAND Flash and DRAM Tuning

In most cases, latency thresholds are set because of application requirements. In the context of stock trading, seconds are worth millions or even billions of dollars. In the context of hurricane forecasting and tracking, it could mean the difference between evacuating New Orleans or Houston.

4. Multiple layers of security

Conclusion

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US