Solidigm SSDs’ Role in Advancing AI Storage

2024-02-01

54

As artificial intelligence advances rapidly to fuel the aspirations of humanity, computing power has had to grow as well. Fueled by high throughput, low latency networks, and deep learning models, thousands of GPU clusters are popping up everywhere. This evolving marketplace prompts deep reflection from AI architects. One of the most important questions is: what is the AI storage infrastructure that can run AI accelerators (GPUs, CPUs, etc.) and network devices at full capacity without idle time?

Table of Contents

Phases of an AI project cycle

An analysis of industry practices reveals that a typical AI project cycle consists of three main phases:

Importing and preparing data
Model Development (Training)
Model Introduction (Inference)

The fourth phase (optional) may involve iterative refinement of the model based on actual inference results and new data. To understand the storage requirements for AI, it is essential to understand the nature of the primary input/output (I/O) operations in each phase and consider them collectively to form a comprehensive view.

Phase 1: Data Ingestion and Preparation

Before diving into training, it is important to thoroughly prepare the data that will be fed into the training cluster.

1. Data transformation: discovery, extraction, and preprocessing

The raw data used to create AI models inherits the traditional big data characteristics of the “3Vs”: Volume, Velocity, and Variety. The sources of data vary from event logs, transaction records, and IoT inputs to CRM, ERP, social media, satellite imagery, economics, and stock trading. Data from these diverse sources needs to be extracted and consolidated into a temporary storage area within the data pipeline. This step is usually called “extraction”.

Data is transformed into a format suitable for further analysis. In the original source systems the data is chaotic and difficult to interpret. Part of the goal of transformation is to improve data quality. These include:

Cleaning up invalid data
Remove duplicate data
Unit Standardization
Organizing data based on type

During the transformation phase, data is structured and reformatted to fit a specific business purpose – this step is called “transformation.”

2. Data exploration and dataset splitting

Data analysts use visualization and statistical techniques to describe the characteristics of a dataset, such as its scale, volume, and precision. Through exploration, they identify and explore relationships between different variables, the structure of the dataset, the presence of anomalies, and the distribution of values. Data exploration allows analysts to dig deep into the raw data.

It helps identify obvious errors, better understand patterns in the data, detect outliers and unusual events, and uncover interesting relationships between variables. Once data exploration is complete, the dataset is typically split into training and testing subsets, which are used separately during model development for training and testing purposes.

3. Feature Extraction, Feature Selection, and Pattern Mining

The success of an AI model depends on whether the selected features can effectively represent the classification problem in your research.

For example, consider individual members of Choice: characteristics include gender, height, skin color, education level, etc.

A notable difference is that, unlike the previous four dimensions, the smaller the vocal range, the more likely the dimensions will match (meaning the amount of data is much smaller) and be more accurate.

To avoid the dangers of high dimensionality and reduce computational complexity, the process of identifying the most effective features to reduce the feature dimensionality is known as feature selection.

The process of uncovering the essential relationships and logic among feature sequences, such as which ones are mutually exclusive and which ones coexist, is called pattern mining.

4. Data Conversion

The need to transform data may arise for a variety of reasons: this may be driven by a desire to align one data with other data, to facilitate compatibility, to migrate parts of the data to another system, to establish connections with other datasets, or to aggregate information within the data.

Common aspects of data transformation include converting types, changing semantics, adjusting value ranges, changing granularity, splitting tables or datasets, transforming rows and columns, etc.

Thanks to a mature open source project community, there are plenty of reliable tools at your disposal for the data ingestion and preparation stages. These tools allow you to perform ETL (extract, transform, load) or ELT (extract, load, transform) tasks. Examples include:

Kafka
Sqoop
Flume
Spark
Snow

Additionally, for tasks such as creating large sets of features, you can leverage tools such as:

Spark
Pandas
Numpy
Spark MLLib
scikit-learn
XGBoost

5. Storage characteristics suitable for data ingestion and preparation phase

During the data ingestion and preparation phase, a typical workflow is to read data randomly and write processed items sequentially. It is essential for the storage infrastructure to provide low latency for small random reads while simultaneously achieving high sequential write throughput.

Phase 2: Model development and training

Once the training dataset is prepared, the next phase is model development, training, and hyperparameter tuning. The choice of algorithm is determined by the characteristics of the use case, and the model is trained using the dataset.

1. AI Framework

The efficiency of the model was evaluated against the test dataset, adjusted where necessary, and finally deployed. The AI framework is continuously evolving with the following popular frameworks:

TensorFlow
PyTorch
MXNet
Scikit Learn
H2O
others

At this stage, the demands on compute resources are very high, and storage is important because feeding these resources with data faster and more efficiently becomes a priority to eliminate idle resources.

During model development, datasets grow continuously and often need to be accessed simultaneously by many data scientists from different workstations, dynamically augmenting them with thousands of variation entries to prevent overfitting.

2. Storage capacity expandability and data sharing

At this stage storage capacity starts to become important, but as the number of concurrent data access operations increases, scalable performance becomes the key to success. Data sharing between workstations and servers is an essential storage feature, along with fast and seamless capacity expansion.

As training progresses, the size of the dataset increases, often reaching several petabytes. Each training job typically involves random reads, and the entire process consists of many concurrent jobs accessing the same dataset. Multiple jobs competing for data access intensifies the overall random I/O workload.

The transition from model development to training requires storage that can scale without disruption to accommodate billions of data items, as well as fast multi-host random access, and especially high random read performance.

Training jobs often involve decompressing input data, augmenting or perturbing the input data, randomizing the input order, and, especially in the context of billions of items, requiring enumeration of data items to query storage of lists of training data items.

3. Checkpoint Creation: Large Sequential Write Bursts

The sheer scale of training creates new demands: training jobs today can run for days or even months, so most jobs write periodic checkpoints to quickly recover from failures, minimizing the need to restart from scratch.

Thus, the primary workload during training consists of random reads, which may be interrupted by large sequential writes during checkpointing. The storage system must be able to sustain the intensive random access required by concurrent training jobs, even during the burst of large sequential writes during checkpointing.

4. Summary of the model development phase

In summary, developing an AI model is a highly iterative process, where successive experiments confirm or refute hypotheses. As the model evolves, data scientists use example datasets to train the model, often through tens of thousands of iterations.

With each iteration, data items are augmented and slightly randomized to prevent overfitting, creating a model that is accurate for the training dataset but can also adapt to live data. As training progresses, the dataset grows, moving from the data scientist’s workstation to servers in a data center with greater computing and storage power.

Phase 3: Model deployment and inference

Once a model is developed, it is deployed and put into production. During the inference phase, real-world data is fed into the model, and ideally, its output provides valuable insights. Models are often continuously fine-tuned; new real-world data imported into the model during the inference phase is incorporated into the retraining process to improve performance.

1. Fine-tuning for real applications

Your AI storage infrastructure must operate seamlessly around the clock throughout the lifecycle of your project, so it must be self-healing to handle component failures and enable expansion and upgrades without downtime.

Data scientists need production data to fine-tune their models and explore changing patterns and goals. This highlights the importance of a unified platform, a single storage system that serves all phases of a project. Such a system gives development, training, and production easy access to dynamically evolving data.

2. Preparing the model for production

Once a model produces consistently accurate results, it is deployed to a production environment. The focus then shifts from improving the model to maintaining a robust IT environment. Production can take many forms, such as interactive or batch-oriented. Continuous use of new data helps refine the model to increase its accuracy, and data scientists regularly update the training dataset while analyzing the model output.

The table below summarizes each phase of an AI project cycle and their respective I/O characteristics and associated storage requirements.

Phases of AI	I/O Characteristics	Storage requirements	影響	Data capture and storage	Read data randomly and write preprocessed items sequentially	Low latency with few random reads, high sequential write throughput	Optimized storage allows the pipeline to provide more data for training, leading to more accurate models	Model Development (Training)	Random Data Read	High sequential write performance for multi-job performance and capacity scalability, optimized random read and checkpointing	Optimized storage improves utilization of expensive training resources (GPUs, TPUs, CPUs)	Model Introduction (Inference)	Random read/write mix	Self-healing capabilities to handle component failures, non-disruptive expansion and upgrades If the model is continuously fine-tuned, the same functionality as in the training phase	Your business demands high availability, maintainability and reliability

Phases of AI

I/O Characteristics

Storage requirements

影響

Data capture and storage

Read data randomly and write preprocessed items sequentially

Low latency with few random reads, high sequential write throughput

Optimized storage allows the pipeline to provide more data for training, leading to more accurate models

Model Development (Training)

Random Data Read

High sequential write performance for multi-job performance and capacity scalability, optimized random read and checkpointing

Optimized storage improves utilization of expensive training resources (GPUs, TPUs, CPUs)

Model Introduction (Inference)

Random read/write mix

Self-healing capabilities to handle component failures, non-disruptive expansion and upgrades

If the model is continuously fine-tuned, the same functionality as in the training phase

Your business demands high availability, maintainability and reliability

Table 1. AI project cycles by I/O characteristics and subsequent storage requirements

Key Storage Characteristics for AI Deployments

AI projects that start as single-chassis systems during initial model development need to become more flexible as data requirements grow during training and more live data accumulates on the production floor. Two key strategies are employed at the infrastructure level to achieve high capacity: increasing individual disk capacity and expanding cluster sizes of storage enclosures.

1. Capacity Increasing the capacity of individual disks and improving horizontal scalability of storage nodes are key factors. At the disk level, products such as the Solidigm D5-P5336 QLC SSD have reached capacities up to 61.44TB. At the storage enclosure level, the Enterprise and Datacenter Standard Form Factor ( EDSFF ) shows unparalleled storage density.

For U.2 15mm form factor drives, a typical 2U enclosure can accommodate 24-26 disks, providing up to 1.44PB of capacity. With the update to the E1.L 9.5mm form factor, a 1U enclosure can accommodate 32 disks, as shown in Figure 1. At 2U, the storage density is approximately 2.6x higher than a 2U U.2 enclosure. A comparison is shown in Table 2.

Form Factor	60TB drives in 2U rack space	Capacity per 2U rack space	Legacy U.2 15mm	24	1.47PB	E1.L 9.5mm	64	3.93PB

Table 2. 2U rack unit capacity based on drive form factor

What’s noteworthy is that high storage density in a single enclosure significantly reduces the rack space storage nodes take up, the number of network ports required, and the power, cooling and spare parts needed to operate them for the same capacity and manpower demands.

2. Data sharing function

Considering the aforementioned collaboration of multiple teams and the desire to train more data before distribution, the data sharing capability of storage is of paramount importance. This is reflected in the high IOPS, low latency, and bandwidth of the storage network. In addition, multipath support is essential to ensure that network services continue to operate even in the event of network component failure. Over time, existing networks have been consolidated into Ethernet and InfiniBand. InfiniBand has abundant data rates, excellent bandwidth and latency performance, and native support for RDMA. As a result, InfiniBand has become a powerful network to support AI storage. Currently, the most popular Ethernet bandwidths are 25Gbps, 40Gbps, and 100Gbps. NVIDIA also has products that support 200Gbps and 400Gbps with low latency. For east-west data flows between the network and storage, nodes are equipped with storage VLANs. NVIDIA also has products that support 200Gbps and 400Gbps RDMA.

3. Adaptability to Various I/O AI storage performance must be consistent across all types of I/O operations. All files and objects, whether they are a small 1KB item label or a 50MB image, must be accessible in roughly the same amount of time to ensure that the TTFB (time-to-first-byte) remains consistent.

4. Parallel Network File Operations AI projects demand efficient parallel network file operations for common tasks such as bulk copy, enumeration, and property modification. These operations greatly accelerate AI model development. Originally developed by Sun Microsystems in 1984, NFS (Network File System) remains the most popular network file system protocol today. NFS over Remote Direct Memory Access (NFS over RDMA) is particularly well suited for compute-intensive workloads that transfer large amounts of data. The data movement offload capabilities of RDMA reduce unnecessary data copies, improving efficiency.

5. Summary of Key AI Storage Characteristics AI storage solutions must provide sufficient capacity, robust data sharing capabilities, consistent performance across a variety of I/O types, and support for parallel network file operations. These requirements ensure that AI projects can effectively manage growing datasets and meet the performance requirements of AI model development and deployment. Finally AI development continues to exceed our lofty expectations. With the urgent need for computing giants to process more data at higher speeds, there is no room for idle processing time or power consumption. Solidigm offers drives in a variety of form factors, densities, and price points to meet the needs of various AI deployments. High-density QLC SSDs have proven their superiority in performance, capacity, reliability, and cost.

From legacy configuration racks with TLC SSDs to new configuration racks with SLC and TLC SSDs and QLC SSDs.

Figure 1. Transitioning from a TLC-only solution to SLC/TLC+QLC.

Combining CSAL with Solidigm D7-P5810 SLC SSDs gives customers the ability to tailor their deployments for performance, cost and capacity.1 With an innovative full stack and open source storage solution, it is clear that Solidigm SSDs have unique advantages to accelerate advancements in AI storage.

Traditional Write Caching and Write Shaping Caching with CSAL

Figure 2. CSAL architecture

About the Author

Sarika Mehta is a Storage Solutions Architect at Solidigm with over 15 years of experience in the storage industry, focusing on optimizing storage solutions for both cost and performance by working closely with Solidigm’s customers and partners.

Wayne Gao is a Principal Engineer in Storage Solutions Architect at Solidigm. Wayne has worked on the research and development of CSAL from Pathfinding through its commercial release at Alibaba.

Wayne is a member of the Dell EMC ECS All-Flash Object Storage team and has over 20 years of storage development experience, four US patent applications/grants, and is a EuroSys paper author.

Yi Wang is a Field Applications Engineer at Solidigm. Prior to joining Solidigm, he held technical positions at Intel, Cloudera, and NCR. He is a Cisco Certified Network Professional, a Microsoft Certified Solutions Expert, and a Cloudera Data Platform Administrator.