//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
In the latest round of MLPerf training benchmark scores, Google showed four overall winning scores out of eight benchmarks. Nvidia claimed wins on two benchmarks versus Google on a per–accelerator basis, and a further four workloads that were uncontested.
This round of benchmarking attracted some of the biggest cutting–edge hardware and systems in the world, including systems with 4096 Google TPUv4s or 4216 Nvidia A100s, as well as latest–gen hardware from Graphcore and Intel’s Habana Labs. There were also interesting software–only submissions from Mosaic ML.
Nvidia did not enter any submissions using its latest H100 hardware, saying that H100 will appear in future rounds of benchmarking. This means the latest–gen hardware from Google, Graphcore, and Habana was up against the 2–year–old Nvidia A100.
Overall, this round of scores showed significant improvement across the board. MLPerf executive director David Kanter quoted Peter Drucker: “What gets measured gets improved.”
“It’s important to start measuring performance and measure the right thing,” Kanter said. “If we’re trying to drive the industry true north, we’re probably 5 or 6 degrees off, but since we’re traveling together, we’re all going to go pretty fast.”
In the time since MLPerf began measuring training benchmark scores, we might have expected an improvement of 3.5× purely from Moore’s Law. But the latest round of scores shows the industry is outpacing Moore’s Law by 10× across the same time frame based on hardware and software innovation. Kanter’s analysis also showed the fastest training results had improved 1.88× versus the last round of scores for the biggest systems, while 8–accelerator systems improved up to 50%.
“As a barometer of progress for the industry, things are looking pretty good,” he said.
As usual, submitting hardware companies showed how the same set of results proved each is really the winner. Here’s a run–down of the scores they showed and what it means.
Google submitted two results for its 4096–TPUv4 system in the cloud, which the company said is publicly available today. The system in question, in Google’s data center in Oklahoma, operates with 90% carbon–free energy, with a power usage effectivenes) of 1.1, making it one of the most energy efficient data centers in the world.
For the 4096–TPUv4 system, its winning times were 0.191 min for ResNet (versus Nvidia’s 4216 A100s, which did it in 0.319 min) and 0.179 minutes for BERT (versus 4096 Nvidia A100s, which did it in 0.206 min).
With smaller TPU systems, the cloud giant also won RetinaNet (the new object detection benchmark) in 2.343 min and Mask R–CNN in 2.253 min.
Google submitted scores for five of the eight benchmarks, adding that the scores represented a “significant improvement” over its previous submissions. Google’s figures put their average speedup at 1.42× the next fastest non–Google submission, and 1.5× versus Google’s June 2021 results.
The internet giant said it has been doing a great deal of work to improve the TPU’s software stack. Scalability and performance optimizations have been made in the TPU compiler and runtime, which includes faster embedding lookups and improved model weight distribution across multiple TPUs.
Google is reportedly moving towards JAX (away from TensorFlow) for internal development teams, but there was no indication of any move in this round of scores. All Google’s submissions in this round were on TensorFlow. Last year’s results did include both TensorFlow and JAX scores, but not in the same workload categories. The next round may provide some insight into whether JAX is more efficient.
Nvidia was the only company to submit results for all eight benchmarks in the closed division. As in previous rounds, Nvidia hardware dominated the list, with 90% of all submissions using Nvidia GPUs, from both Nvidia and its OEM partners.
Nvidia said its A100 was fastest on six of the eight benchmarks, when normalized to per–accelerator scores (it conceded to Google for RetinaNet and Habana Labs for ResNet on per–accelerator).
This was Nvidia’s fourth time submitting scores for its Ampere A100 GPU, which gave us an insight into how much work Nvidia has done on the software side in the last couple of years. Most improved were the scores for the DGX SuperPod A100 systems on DLRM, which had improved almost 6×. For DGX–A100 systems, the biggest improvement was on BERT, which had improved a little over 2×. Nvidia put these improvements down to extensive work on CUDA graphs, optimized libraries, enhanced pre–processing, and full stack networking improvements.
Nvidia’s outright wins were for training 3D U–Net in 1.216 min on 768 A100s, RNN–T in 2.151 min on 1536 A100s, DLRM in 0.588 min on 112 A100s (Google had an system with 128 TPUv4s that can do it in 0.561 min, but it is not commercially available), and MiniGo in 16.231 min with 1792 A100s.
Industry observers waiting eagerly to see the H100 benchmarked against the A100 and the competition were disappointed. Shar Narasimhan, director of product management for accelerated computing at Nvidia, said the H100 would feature in future rounds of MLPerf training scores.
“Our focus at Nvidia is getting our customers to deploy AI in production in the real world today,” Narasimhan said. “A100 already has a wide installed base and it’s widely available at all clouds and from every major server maker… it has the highest performance on all the MLPerf tests. Since we got great performance, we wanted to focus on what was commercially available, and that’s why we submitted on the A100.”
Narasimhan said it is important to submit results for every benchmarked workload because this more accurately reflects real–world applications. His example, a user speaking a request to identify a plant from an image on their smartphone, required a pipeline of 10 different workloads, including speech to text, image classification, and recommendation.
“That’s why it’s so important to submit to [every benchmark of] MLPerf — when you want to deliver AI in the real world, you need to have that versatility,” he said.
Other customer needs include frequent retraining at scale, infrastructure fungibility (using the same hardware for training and inference), future proofing, and maximizing productivity per dollar (data science and engineering teams can be the majority of the cost of deploying AI for some companies, he added).
Graphcore submitted results for its latest Bow IPU hardware training ResNet and BERT. ResNet was about 30% faster across system sizes compared to the last round of MLPerf training (December 2021) and BERT was about 37% faster.
“These scores are a combination of our work at the application layer, the hardware as we take advantage of our new Bow system, and at the core SDK level which continues to improve in terms of performance,” said Matt Fyles, senior vice president of software at Graphcore.
Chinese internet giant Baidu submitted two MLPerf scores for latest–gen Graphcore Bow IPU hardware; one was on the PyTorch framework and the other was on PaddlePaddle, Baidu’s own open–source AI framework which is widely used by its cloud customers.
“Our China group worked closely with the Baidu team to do the submission,” said Fyles. “[PaddlePaddle] is incredibly popular as a framework in China… We want to work in as much of the ecosystem as possible, not just with the American machine learning frameworks, also the ones in the rest of the world. It’s also good validation that our software stack can plug into different things.”
Fyles would not reveal whether Baidu is a Graphcore customer, saying only that the two companies had partnered.
Graphcore’s own submissions showed BERT training results for a 16–IPU system on Graphcore’s PopART framework and PaddlePaddle, with very similar results (20.654 and 20.747 minutes respectively). This points to consistent performance for IPUs across frameworks, Graphcore said.
The company also pointed out that Graphcore’s scores for 16– and 64–IPU systems on PaddlePaddle were almost identical to what Baidu could achieve with the same hardware and framework (Baidu’s 20.810 min and 6.740 min, versus Graphcore’s 20.747 min and 6.769 min).
“We are happy Graphcore made a submission with PaddlePaddle on IPUs with outstanding performance,” a statement from Baidu’s team read.” As for BERT training performance on IPUs, PaddlePaddle is in–line with Graphcore’s PopART framework. It shows PaddlePaddle’s hardware ecosystem is expanding, and PaddlePaddle performs excellently on more and more AI accelerators.”
Fyles also mentioned that Graphcore sees the industry heading towards lower–precision floating point formats such as FP8 for AI training. (Nvidia already announced this capability for the upcoming Hopper architecture).
“This is an area which, because we have a very general programmable processor, we can do a lot of work in software to do things such as FP8 support, and supporting algorithmic work for different precisions,” he said. “I think that’s a testament to the programmability of the processor that we can do some very interesting things at the application level to bring things like time to train down on these tough applications.”
Intel Habana Labs
Habana was another company to show off what its new silicon can do. The company submitted scores for its second–gen Gaudi2 accelerator in an 8–chip system, as well as scaled–up systems for its first–gen Gaudi chips (128 and 256–chip systems).
Habana’s 8–chip Gaudi2 system comfortably beat Nvidia’s 8–chip A100 system, training ResNet in 18.362 min versus Nvidia’s 28.685 min. Gaudi2’s BERT score was also faster than the A100’s: 17.209 min to train on Gaudi2 versus 18.442 min for Nvidia A100.
Relative to first–gen Gaudi performance from previous rounds, ResNet training improved 3.4× and BERT training improved 4.9×. The company said these speedups were achieved by transitioning to a 7nm process technology from 16nm in the first gen, Gaudi2’s 96GB of HBM2E memory with 2.45 TB/sec bandwidth, and other architecture advances.
Scores for the larger first–gen Gaudi systems were 5.884 min to train BERT on 128 chips, and 3.479 min for the 256–chip system. The company noted that this represents near–linear scaling with number of accelerators.
No scale–out results were submitted for Gaudi2.
Server maker Supermicro submitted scores for first–gen Gaudis in 8– and 16–accelerator configurations, the first OEM server scores for Habana hardware.
As in previous rounds, Habana stated that training scores were achieved “out of the box”; that is, without special software manipulations that differ from its commercial software stack, SynapseAI. This is intended to reassure customers that these results are easily repeatable.
Habana’s supporting material noted that Gaudi2 includes support for training with FP8 datatypes, but that this was not applied in the benchmark results presented in this round.
Pricing for Gaudi2 systems was described by Habana as “very competitive”.
Startup MosaicML submitted two results in the open division designed to show off its algorithmic methods for speeding up AI training.
Results were submitted to the open division (where submitters are allowed to make changes to the model used) since the company focused on a version of ResNet–50 it says uses a standard set of hyperparameters widely used in research today. The baseline for training before optimization was 110.513 min, which was sped up 4.5× by the company’s open–source deep learning library, Composer, to 23.789 min.
Heavily optimized results on similar hardware setups from the closed division, albeit with a slightly different model, were Nvidia’s 28.685 min or Dell’s 28.679 min. Mosaic’s version was about 17% faster.
“We’re focused on making ML training more efficient specifically through algorithms,” said Hanlin Tang, MosaicML co–founder and CTO. “By deploying some of our algorithms that actually changed how the training gets done, we’re able to speed up the efficiency of training quite significantly.”
Mosaic’s Composer library is designed to make it easy to add up to 20 of the company’s algorithmic methods for vision and NLP and compose them into novel recipes that can speed up training.
Hazy Research’s submission was the work of a single grad student, Tri Dao. BERT training on an 8–A100 system was finished in 17.402 minutes, compared to 18.442 minutes for an Nvidia system with the same accelerators and framework.
Hazy Research has been working on a method to speed up training of transformer networks such as BERT based on a new way of performing the computation associated with the attention mechanism.
Attention, the basis of all transformers, becomes much more compute and memory intensive when the sequence length increases.
“Many approximate attention methods aimed at alleviating these issues do not display wall–clock speedup against standard attention, as they focus on FLOPS reduction and tend to ignore overheads from memory access (IO),” a statement from Hazy said.
Hazy has made attention I/O–aware by taking memory access to SRAM and HBM into account. The company’s FlashAttention algorithm computes exact attention with fewer HBM accesses by splitting softmax computation into tiles and avoiding storage of large intermediate matrices for the backward pass. According to the company, FastAttention runs 4× faster and uses 10× less memory than PyTorch standard attention.
Hazy Research has open–sourced its implementation of FlashAttention, which it says can be applied to all transformer networks.
British consultancy Krai, MLPerf inference veterans, submitted a ResNet training score for a system with two Nvidia RTX A5000 GPUs of 284.038 min. This entry–level option may be compared with one of Nvidia’s results for a 2–A30 system, which managed the training in 235.574 min, Krai said, pointing out that while the A5000s consumed 39% more power and were 20% slower, they are also 2–3× cheaper. Another option would be to compare with a single A100; the A5000s compare favorably on speed and cost but use more power.
Given these comparisons, the dual A5000 system may still be an attractive option for smaller companies, Krai said.