At re:Invent 2023, AWS introduced a new capability within its AI stack called Amazon SageMaker HyperPod. The product was developed based on customer feedback and pain points around working with AI foundation models. AWS says the solution can reduce training time by up to 40%. The solution enables the following:
Automatic cluster health check and repair. If an instance becomes defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for graphics processing unit (GPU) and network integrity.
Streamlined, distributed training for large training clusters. SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, enabling users to automatically split their models and training data sets across AWS cluster instances to help them efficiently scale training workloads.
Optimized utilization of clusters’ compute, memory, and network resources. Amazon SageMaker distributed training libraries optimize training jobs for AWS network infrastructure and cluster topology through two techniques: model parallelism and data parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large data sets to train concurrently and improve training speed.
Enterprise Strategy Group spoke with Shane Robbins, Senior Manager, Product Marketing – AI and Machine Learning about what AWS is learning from customer use of Amazon SageMaker HyperPod. “First, our customers are looking to get rid of some of the heavy lifting around model training,” said Robbins, “Compute workloads, in that sense, are huge, and the automatic detection and swap of faulty nodes the product offers means teams don’t have to dedicate resources to make it work. But in addition to that, many of our customers want more control, so they are building their own foundation models. The only path to such a thing is to have a cost-effective way to get there.”
Robbins mentioned customers that use HyperPod include AI foundation model makers like Perplexity and Stability AI, but he also mentioned enterprise plays like Thomson-Reuters, Hugging Face, IBM, and Salesforce.