MLCommons has published the latest round of MLPerf Inference benchmark scores. In the MLPerf Tiny division, U.S. startup Syntiant shined with impressive keyword spotting latency and energy consumption, while Nvidia and Qualcomm battled it out in the edge and data center categories once again.
Syntiant’s NDP120 ran the tinyML keyword spotting benchmark in 1.80 ms, the clear winner for that benchmark (the next nearest result was 19.50 ms for an Arm Cortex–M7 device). This result used 49.59 uJ of energy (for the system) at 1.1V/100 MHz. Turning the supply voltage down to 0.9 V (and reducing clock frequency to 30 MHz) reduced Syntiant’s energy to 35.29 uJ, but increased latency to 4.30 ms.
The Syntiant NDP120 is based on Syntiant’s second generation AI accelerator core alongside a Tensilica HiFi DSP for feature extraction and an Arm Cortex-M0 core for system management. Since it is designed for voice control, Syntiant did not enter any other benchmark scores for the NDP120.
Elsewhere in the MLPerf Tiny division, STMicroelectronics showed off a range of scores for a range of its STM32 microcontrollers based on Arm Cortex–M4, –M33, and –M7 parts. Across the different benchmarked workloads, ST’s highest performance device, the M7–based STM32H7 completed the tasks in the shortest time. However, ST’s M33–based part (STM32U5), with less powerful core, completed the same inference tasks using the least energy of all the ST scores.
ST’s Matthieu Durnerin explained that the M33 part is optimized for energy efficiency while the M7 is from ST’s high performance range.
“We have a huge family of STM32s, and we are demonstrating that there are many compromises,” he said. “If you want to improve your inference time, but still have good energy efficiency, choose the H7, but for pure energy efficiency, you would typically look at the U5.”
Plumerai, the European startup working on its own inference engine for microcontrollers, entered scores for several hardware setups. On an identical STMicro M4–based part, Plumerai beat ST’s own latency scores across the workloads by 13–39%. Plumerai’s engine is designed to reduce memory footprint, and it supports any Arm Cortex–M hardware.
Renesas entered scores from two parts — an Arm Cortex–M33 device with floating point unit, and another based on its own 32–bit RXv2 core. The company’s intent was to demonstrate that general–purpose microcontrollers can be used in various AI applications with appropriate latency and energy.
Silicon Labs entered its MG24, a multi–protocol SoC for IoT applications which includes an in–house developed AI accelerator block alongside an Arm Cortex–M33 core. The company’s marketing material states “up to 4x faster processing and up to 6x lower power consumption for ML workloads compared to running the same workloads on the Cortex–M33 core”. In the benchmark scores, the SoC performed the same tasks using between 1.8–3.3X less energy than an STMicro M33–based power–optimized microcontroller, with longer latencies. Silicon Labs’ scores also demonstrated its new end–to–end ML development platform, which supports existing Series 1 and 2 wireless SoCs as well as the new families with AI acceleration.
Andes, the RISC–V core designer, entered several SoCs with cores from its AndesCore range, based on the company’s AndeStar v5 instruction set architecture. The D25F and D45 cores with RISC–V DSP and SIMD extensions are geared towards AI in IoT devices, while the NX27V core is a higher performance solution with RISC–V vector extensions for edge or cloud applications.
The mobile division had two entries.
Qualcomm ran the mobile benchmark suite on a Xiaomi MI12 smartphone with Snapdragon 8 Gen 1 processor (including Hexagon tensor processor). This was up against Samsung’s Galaxy S22 with its Exynos 2200 application processor (Samsung used the chip’s NPU and DSP for all workloads except natural language processing).
Though the results were close, Qualcomm won every benchmark in this round. Overall, this round of results showed an average 2X performance gain over the previous round.
Nvidia dominated both the edge and data center inference divisions with hundreds of scores submitted both by Nvidia and by Nvidia system partners including Lenovo, Gigabyte, Supermicro, and many more.
Nvidia chose to show off its Jetson Orin SoC — just unveiled at GTC last month — targeting applications including robotics and autonomous driving. The company said the AGX Orin is up to 5X more performant and up to 2.3c more energy efficient than its predecessor, the AGX Xavier.
Orin’s main competitor in these scores is Qualcomm’s Cloud AI100. Qualcomm limited the chip’s TDP to 20W in this division. (Gigabyte also submitted scores on Qualcomm’s Cloud AI100, but limited to 75W TDP).
Nvidia claimed victory for single stream performance on a per–accelerator basis. For example, Bert single stream latency was 7.64 ms for Nvidia Orin, versus 15.41 ms for Qualcomm’s Cloud A100. Qualcomm won a couple of workloads over Orin, including ResNet–50 multi–stream latency and offline samples processed per second.
Qualcomm’s best performance was in ResNet–50 single stream, where its latency was 0.89 ms versus Nvidia Orin at 0.69 ms (1.27X win for Nvidia). Comparing the same two systems reveals Qualcomm did better in multi–stream performance for ResNet–50 (a factor of 1.2X faster than Nvidia Orin) and also won on offline samples/sec, winning by 1.59X.
Korean startup FuriosaAI entered its Warboy AI accelerator chip for ResNet and SSD edge benchmarks. Its SSD–Small results are particularly notable; latency results have improved 15%, with offline throughput more than doubled compared to the same silicon in the last round, due to compiler enhancements. The company said it is working on a next–gen AI inference system with 10X the performance of its current Warboy chip.
Data center division
Most of the performance benchmarks for data center systems were won outright by Inspur’s 12x Nvidia A100 system, which was the biggest system entered in terms of number of accelerators. The recommendation (DLRM) benchmark was won by an Inspur 8x Nvidia A100 system.
Submissions run on the Qualcomm Cloud AI100, including partner scores, have more than doubled in this round to more than 200. Compared to the last round, Qualcomm said it has improved its ResNet–50 power efficiency by 17% to 230 inferences/sec/Watt for 8x accelerator systems. Offline performance for a 16x AI100 system is 371,473 ResNet–50 queries/sec from a 2U Gigabyte server.
Qualcomm’s power efficiency (performance per watt, regardless of number of accelerators) was better than Nvidia’s A100 on the benchmarks it entered, which were ResNet, SSD, and Bert.
Nvidia also submitted scores designed to show off the performance of its Triton inference server software on Intel CPUs and Amazon Inferentia accelerators.
Startup Neuchips is still working on its RecAccel DLRM accelerator for the data center, again submitting FPGA–based scores. However, performance of the most recent design has improved 40% compared to the last round. The company plans to release samples of its ASIC in the second half of this year.
Notable in the data center open division (where companies are allowed to change the models) was Deci. This Israel–based company uses a technique called neural architecture search to optimize models in order to improve accuracy and compute performance. The company ran its models on Intel CPUs. Compared to the baseline ResNet–50 model, Deci’s model improved throughput 2.8X to 4X (depending on the hardware) while improving the accuracy.
The full set of scores is available here.