This article provides performance results for a set of well-known or reference pre-trained Neural Network models.

Performance metrics verified by the MLCommons association have been published in the MLPerf™ Tiny v1.1 benchmark. Below are additional performance metrics measured by STMicroelectronics, which have not been verified by MLCommons ^{[ST 1]}.

Information

ST Edge AI Core^{[ST 2]} is a free-of-charge desktop tool to evaluate, optimize and compile edge AI models for multiple ST products, including microcontrollers, microprocessors, and smart sensors with ISPU and MLC. It is delivered under the SLA0104 - Rev 1 software license agreement^{[ST 3]}
The inference time, current and energy measurement process described is not done in a certified laboratory but can be reproduced by any user. The results are average values, which may vary depending on the input data (random data are currently used), the temperature, and the STM32 device itself.
Published data in this article is not contractual.
Copyright STMicroelectronics - All right reserved. Do not publish the following data without written consent of STMicroelectronics

1. Performance results

1.1. STM32N6x7

The STM32N6x7 is a high performance MCU embedding an Arm Cortex^®-M55 core together with a the Neural-ART Accelerator™ Neural Processing Unit (NPU) of 600 GOPS and 3 TOPS/W power consumption. STM32N6 has 4.2 MByte of embedded RAM.

The following measures are done on the STM32N657 Discovery kit with STM32Cube.AI 10.0.1 delivered in ST Edge AI Core v2.0 and with STM32CubeIDE 1.15.0.

The Yolov8n and TinyYolov2 models are object detection models trained to recognized persons. The models are quantized in int8 per-channel using TensorFlow™ Lite converter.
The MobileNet v2 1.0 model is an image classification models trained on ImageNet and quantized in int8 per-channel using TensorFlow™ Lite converter.
The Yamnet 1024 model is an audio event detector trained on esc-10 dataset quantized in int8 per-channel using ONNX Runtime.

The performance table fields and measurement process are detailed in the section Measurement process for the STM32N6.

In nominal mode (VOS low) the settings used are the following ones: Cortex^®-M55 configured at 600 MHz, Neural-ART Accelerator configured at 800 MHz, NPU RAMS (AXISRAM 3/4/5/6) access at 800 MHz, CPU RAM (AXISRAM 1/2) access at 400 MHz, V_DDCORE set to 0.81 V, and V_DDIO set to 1.8 V.

Model Source\Link	Flash Weights (Mbyte)	RAM Activations (Mbyte)	Proc Time (ms)	inf/s	Total energy (mJ)	Total average power (mW)	V_DDCORE (mJ)	V_DDCORE (mW)	V_DDA1V8 (mJ)	V_DDA1V8 (mW)	External memories (mJ)	External memories (mW)
Yolov8n 256x256x3 [1]	2.91 MB	1.59 MB	35.6 ms	28	8 mJ	225 mW	6.4 mJ	177 mW	0.26 mJ	7.3 mW	1.4 mJ	40 mW
Yolov8n 320x320x3 [2]	2.91 MB	2.12 MB	47.1 ms	21	11 mJ	232 mW	9 mJ	191 mW	0.32 mJ	6.75 mW	1.6 mJ	34 mW
TinyYolov2 224x224x3 [3]	10.55 MB	0.38 MB	31.4 ms	40	10.8 mJ	343 mW	6 mJ	188 mW	0.5 mJ	16 mW	4.34 mJ	138 mW
MobileNet v2 1.0 224x224x3 [4]	4.13 MB	2.01 MB	23.2 ms	43	6 mJ	257 mW	3.85 mJ	166 mW	0.25 mJ	10.8 mW	1.86 mJ	80 mW
Yamnet 1024 64x96 [5]	3.41 MB	0.14 MB	9.9 ms	101	2.72 mJ	275 mW	1.23 mJ	124 mW	0.16 mJ	16 mW	1.34 mJ	135 mW

In performance mode (VOS high) the settings used are the following ines: the Cortex^®-M55 configured at 800 MHz, Neural-ART Accelerator configured at 1 GHz, NPU RAMS (AXISRAM 3/4/5/6) access at 900 MHz, RAM (AXISRAM 1/2) access at 400 MHz, V_DDCORE set to 0.89 V, and V_DDIO set to 1.8 V.

Model Source\Link	Flash Weights (Mbyte)	RAM Activations (Mbyte)	Proc Time (ms)	inf/s	Energy (mJ)	Average Power (mW)	V_DDCORE (mJ)	V_DDCORE (mW)	V_DDA1V8 (mJ)	V_DDA1V8 (mW)	External memories (mJ)	External memories (mW)
Yolov8n 256x256x3 [6]	2.91 MB	1.587 MB	29.1 ms	34	9.4 mJ	322 mW	7.8 mJ	267 mW	0.26 mJ	9 mW	1.34 mJ	46 mW
Yolov8n 320x320x3 [7]	2.91 MB	2.12 MB	38.7 ms	26	12.85 mJ	332 mW	11 mJ	285 mW	0.32 mJ	8.36 mW	1.5 mJ	39 mW
TinyYolov2 224x224x3 [8]	10.55 MB	0.38 MB	30.6 ms	33	12.8 mJ	417 mW	8 mJ	259 mW	0.52 mJ	17 mW	4.32 mJ	141 mW
MobileNet v2 1.0 224x224x3 [9]	4.13 MB	2.01 MB	20.7 ms	48	7 mJ	337 mW	4.88 mJ	236 mW	0.26 mJ	12.6 mW	1.82 mJ	88 mW
Yamnet 1024 64x96 [10]	3.41 MB	0.14 MB	9.7 ms	103	3.2 mJ	332 mW	1.7 mJ	176 mW	0.17 mJ	17 mW	1.34 mJ	138 mW

1.2. STM32 High Performance MCUs

STM32 High Performance MCUs at 3.3 V: inference time, memory footprint and energy reported in milli Joules (mJ):

STM32 Board	STM32 characteristics	Model Source\Link	Flash total (Kbyte)	RAM total (Kbyte)	Proc Time (ms)	Cur. (mA)	Energy (mJ) 3.3 V	Version
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	FDMobileNet 0.25 224x224x3 quant tfl [11]	185 Kbytes	166 Kbytes	38.5 ms	101 mA	12.8 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	MobileNet v2 128x128x3 quant tfl [12]	514 Kbytes	255 Kbytes	67.7 ms	104 mA	23.2 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Object Detector SSD MobileNet v1 0.25 192x192x3 [13]	533 Kbytes	270 Kbytes	105.8 ms	103 mA	36 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Yamnet 256 quant tfl [14]	187 Kbytes	117 Kbytes	50.4 ms	99 mA	16.5 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	CNN2D_ST_HandPosture VL53L8CX 8 postures float keras [15]	25 Kbytes	3 Kbytes	0.14 ms	92 mA	0.043 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	HAR IGN float keras [16]	26 Kbytes	4 Kbytes	0.38 ms	92 mA	0.115 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Anomaly Detection MLPerf™Tiny quant tfl [17]	277 Kbytes	6.39 Kbytes	0.667 ms	95 mA	0.209 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Key Word Spotting MLPerf™Tiny quant tfl [18]	65 Kbytes	24 Kbytes	8.16 ms	96 mA	2.59 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Image Classif MLPerf™Tiny quant tfl [19]	127 Kbytes	49 Kbytes	21.7 ms	96 mA	6.9 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H735 SMPS STM32H735G-DK	Flash 1 Mbyte RAM 564 Kbytes (432) Freq 550 MHz	Visual Wake Word MLPerf™Tiny quant tfl [20]	96 Kbytes	56 Kbytes	14.54 ms	96 mA	4.6 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	FDMobileNet 0.25 224x224x3 quant tfl [21]	185 Kbytes	166 Kbytes	53.8 ms	63 mA	11.2 mJ	STM32Cube.AI 8.1.0 STM32CubeIDE 1.12.1
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	MobileNet v2 128x128x3 quant tfl [22]	514 Kbytes	255 Kbytes	94.9 ms	65 mA	20.4 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Object Detector SSD MobileNet v1 0.25 192x192x3 [23]	533 Kbytes	270 Kbytes	149.5 ms	64 mA	31.6 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Yamnet 256 quant tfl [24]	187 Kbytes	117 Kbytes	77.3 ms	64 mA	16.3 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	CNN2D_ST_HandPosture VL53L8CX 8 postures float keras [25]	25 Kbytes	3 Kbytes	0.2 ms	62 mA	0.041 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	HAR IGN float keras [26]	26 Kbytes	4 Kbytes	0.53 ms	59 mA	0.103 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Anomaly Detection MLPerf™Tiny quant tfl [27]	277 Kbytes	6.39 Kbytes	1.08 ms	63 mA	0.193 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Key Word Spotting MLPerf™Tiny quant tfl [28]	65 Kbytes	24 Kbytes	11.44 ms	64 mA	2.4 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Image Classif MLPerf™Tiny quant tfl [29]	127 Kbytes	49 Kbytes	33 ms	64 mA	6.4 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H747 SMPS STM32H747I-DISCO	Cortex^®-M7 Flash 2 Mbytes RAM 1 Mbyte (0.5) Freq 400 MHz⁽¹⁾	Visual Wake Word MLPerf™Tiny quant tfl [30]	96 Kbytes	56 Kbytes	20.7 ms	64 mA	4.4 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	FDMobileNet 0.25 224x224x3 quant tfl [31]	185 Kbytes	166 Kbytes	76 ms	43 mA	10.8 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	MobileNet v2 128x128x3 quant tfl [32]	514 Kbytes	255 Kbytes	132 ms	44 mA	19.2 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Object Detector SSD MobileNet v1 0.25 192x192x3 [33]	533 Kbytes	270 Kbytes	208 ms	44 mA	30.2 mJ	STM32Cube.AI 8.1.0 STM32CubeIDE 1.12.1
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Yamnet 256 quant tfl [34]	187 Kbytes	117 Kbytes	105 ms	45 mA	15.6 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	CNN2D_ST_HandPosture VL53L8CX 8 postures float keras [35]	25 Kbytes	3 Kbytes	0.29 ms	41 mA	0.039 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	HAR IGN float keras [36]	26 Kbytes	4 Kbytes	0.76 ms	40 mA	0.1 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Anomaly Detection MLPerf™Tiny quant tfl [37]	277 Kbytes	6.39 Kbytes	1.24 ms	44 mA	0.184 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Key Word Spotting MLPerf™Tiny quant tfl [38]	65 Kbytes	24 Kbytes	16.3 ms	44 mA	2.4 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Image Classif MLPerf™Tiny quant tfl [39]	127 Kbytes	49 Kbytes	43 ms	44 mA	6.2 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.1
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Visual Wake Word MLPerf™Tiny quant tfl [40]	96 Kbytes	56 Kbytes	29.3 ms	44 mA	4.3 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0

⁽¹⁾ On Cortex^®-M7 core in SMPS mode 400 MHz instead of 480 max in LDO. The Cortex^®-M4 is running on a while(1) infinite loop.

For a given STM32 in a fixed configuration, the current consumption is in the same range regardless of the model. it might however vary depending on the complexity and topology of the model. The following table is providing the average current consumption of the model listed in the table above table (excluding the Anomaly Detection model which has a specific topology). These data can be used as a first estimation of the current consumption and the energy consumption of a new model from just the measurement of its inference time. From the average inference time of t second and the average current of i Ampere for a given input voltage of u Volt. The average energy is easily computed as (t x i x u) in Joule.

STM32 Board	STM32H735 550 MHz SMPS	STM32H747 400 MHz SMPS	STM32H7A3 280 MHz SMPS
Average current (mA)	98	63	43

STM32Cube.AI (X-CUBE-AI) can also generate a TensorFlow™ Lite for Microcontroller (TFLm) runtime implementation (based on TensorFlow™ version 2.10 sha-1 = 79f6defor STM32Cube.AI v8.1.0). The following table is comparing the TFLm runtime to the X-CUBE-AI runtime, the Flash and RAM footprints include the code / runtime footprint on top of the weights and activation buffer.

STM32 Board	STM32 characteristics	Model Source/Link	Runtime	Flash (Kbyte)	RAM (Kbyte)	Proc Time (ms)	Version
STM32H7A3 SMPS NUCLEO-H7A3ZI-Q	Flash 2 Mbytes RAM 1.4 Mbyte (1.18) Freq 280 MHz	Image Classif MLPerf™Tiny [41]	X-CUBE-AI	127 Kbytes	49 Kbytes	43 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Image Classif MLPerf™Tiny [41]	TFLm	160 Kbytes	55 Kbytes	98 ms	TFLm sha-1 = 79f6de STM32CubeIDE 1.12.1
		Visual Wake Word MLPerf™Tiny [42]	X-CUBE-AI	96 Kbytes	56 Kbytes	29.3 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [42]	TFLm	392 Kbytes	101 Kbytes	67 ms	TFLm sha-1 = 79f6de STM32CubeIDE 1.12.1

1.3. STM32 Ultra Low Power MCUs

STM32 Ultra Low Power MCUs at 1.8 V: inference time, memory footprint and energy reported in micro Joules (uJ):

STM32 Board	STM32 characteristics	Model Source/Link	Flash Total. (Kbyte)	RAM Total. (Kbyte)	Proc Time (ms)	Cur. (mA)	Energy (uJ) 1.8 V	Version
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	FDMobileNet 0.25 224x224x3 quant tfl [43]	186 Kbytes	166 Kbytes	188 ms	13.8 mA	4670 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	MobileNet v2 128x128x3 quant tfl [44]	516 Kbytes	255 Kbytes	345 ms	14.2 mA	8818 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Object Detector SSD MobileNet v1 0.25 192x192x3 [45]	534 Kbytes	270 Kbytes	549 ms	14.1 mA	13934 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Yamnet 256 quant tfl [46]	190 Kbytes	117 Kbytes	282 ms	14.1 mA	7157 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	CNN2D_ST_HandPosture VL53L8CX 8 postures float keras [47]	24.4 Kbytes	3.2 Kbytes	0.67 ms	14.2 mA	17 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	HAR IGN float keras [48]	25 Kbytes	4 Kbytes	2.25 ms	12.4 mA	50 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Anomaly Detection MLPerf™Tiny quant tfl [49]	278 Kbytes	6.39 Kbytes	3.47 ms	13.6 mA	85 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Key Word Spotting MLPerf™Tiny] quant tfl [50]	67 Kbytes	24 Kbytes	44.9 ms	13.7 mA	1107 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Image Classif MLPerf™Tiny quant tfl [51]	129 Kbytes	49 Kbytes	116 ms	14 mA	2923 mJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Visual Wake Word MLPerf™Tiny quant tfl [52]	98 Kbytes	56 Kbytes	74 ms	14.1 mA	1878 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	FDMobileNet 0.25 224x224x3 quant tfl [53]	186 Kbytes	166 Kbytes	291 ms	23.2 mA	12152 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	MobileNet v2 128x128x3 quant tfl [54]	516 Kbytes	255 Kbytes	545 ms	23.2 mA	22759 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Object Detector SSD MobileNet v1 0.25 192x192x3 [55]	534 Kbytes	270 Kbytes	851 ms	23.3 mA	35691 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Yamnet 256 quant tfl [56]	190 Kbytes	117 Kbytes	426 ms	23.6 mA	18096 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	CNN2D_ST_HandPosture VL53L8CX 8 postures [57]	24.5 Kbytes	3.2 Kbytes	1.14 ms	23.1 mA	47 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	HAR IGN float [58]	25 Kbytes	4 Kbytes	3.5 ms	21.9 mA	137 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Anomaly Detection MLPerf™Tiny quant tfl [59]	278 Kbytes	6.39 Kbytes	5.33 ms	21.8 mA	209 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Key Word Spotting MLPerf™Tiny quant tfl [60]	67 Kbytes	24 Kbytes	71.2 ms	22.9 mA	2937 uJ	STM32Cube.AI 8.1.0 STM32CubeIDE 1.12.1
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Image Classif MLPerf™Tiny quant tfl [61]	129 Kbytes	49 Kbytes	179 ms	23.3 mA	7507 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5 LDO NUCLEO-L4R5ZI	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Visual Wake Word MLPerf™Tiny quant tfl [62]	98 Kbytes	56 Kbytes	119.5 ms	23.3 mA	5012 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32G474 LDO NUCLEO-G474REI	Flash 512 Mbytes RAM 128 Kbytes Freq 170 MHz	Anomaly Detection MLPerf™Tiny quant tfl [63]	278 Kbytes	6.39 Kbytes	4.06 ms	32 mA	429 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32G474 LDO NUCLEO-G474REI	Flash 512 Mbytes RAM 128 Kbytes Freq 170 MHz	Key Word Spotting MLPerf™Tiny quant tfl [64]	67 Kbytes	24 Kbytes	65.6 ms	35 mA	7578 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32G474 LDO NUCLEO-G474REI	Flash 512 Mbytes RAM 128 Kbytes Freq 170 MHz	Image Classif MLPerf™Tiny quant tfl [65]	129 Kbytes	49 Kbytes	126.8 ms	35 mA	14645 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32G474 LDO NUCLEO-G474REI	Flash 512 Mbytes RAM 128 Kbytes Freq 170 MHz	Visual Wake Word MLPerf™Tiny quant tfl [66]	98 Kbytes	56 Kbytes	84.88 ms	34 mA	9524 uJ	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0

The following table is providing the average current consumption of the model listed in the table above table (excluding the Anomaly Detection model which has a specific topology). These data can be used as a first estimation of the current consumption and the energy consumption of a new model from just the measurement of its inference time. From the average inference time of t second and the average current of i Ampere for a given input voltage of u Volt. The average energy is easily computed as (t x i x u) in Joule

STM32 Board	STM32U575 160 MHz SMPS	STM32L4R5 120 MHz LDO Single Bank	STM32G474 170 MHz LDO
Average current (mA)	14	23	34

STM32Cube.AI (X-CUBE-AI) can also generate a TensorFlow™ Lite for Microcontroller (TFLm) runtime implementation (based on TensorFlow™ version 2.10 sha-1 = 79f6de for STM32Cube.AI v8.1.0). The following table is comparing the TFLm runtime to the X-CUBE-AI runtime, the Flash and RAM footprints include the code / runtime footprint on top of the weights and activation buffer.

STM32 Board	STM32 characteristics	Model Source/Link	Runtime	Flash (Kbyte)	RAM (Kbyte)	Proc Time (ms)	Version
STM32U575 SMPS NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Image Classif MLPerf™Tiny [67]	X-CUBE-AI	129 Kbytes	49 Kbytes	116 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Image Classif MLPerf™Tiny [67]	TFLm	161 Kbytes	55 Kbytes	251 ms	TFLm sha-1 = 79f6de STM32CubeIDE 1.12.1
		Visual Wake Word MLPerf™Tiny [68]	X-CUBE-AI	96 Kbytes	56 Kbytes	74 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [68]	TFLm	393 Kbytes	101 Kbytes	176 ms	TFLm sha-1 = 79f6de STM32CubeIDE 1.12.1

2. SMPS vs LDO

Inference time, memory footprint and energy for SMPS and LDO power configuration at 3.3V :

STM32 Board	STM32 characteristics	Model Source/Link	PWR config	Cur. (mA)	Energy (uJ)	Proc Time (ms)	Version
STM32U575 NUCLEO-U575ZI-Q	Flash 2 Mbytes RAM 786 Kbytes Freq 160 MHz	Image Classif MLPerf™Tiny [69]	SMPS	8.7	3330	116 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Image Classif MLPerf™Tiny [69]	LDO	18.1	6929	116 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [70]	SMPS	8.8	2149	74 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [70]	LDO	18.3	4469	74 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
STM32L4R5ZI-P NUCLEO-L4R5ZI-P	Flash 2 Mbytes RAM 640 Kbytes Freq 120 MHz	Image Classif MLPerf™Tiny [71]	SMPS (external)	10.2	6025	179 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Image Classif MLPerf™Tiny [71]	LDO	22.8	13468	179 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [72]	SMPS (external)	10.4	4101	119.5 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0
		Visual Wake Word MLPerf™Tiny [72]	LDO	18.3	9188	119.5 ms	STM32Cube.AI 9.1.0 STM32CubeIDE 1.16.0

3. Measurement process for the STM32N6x7

On this performance only the Machine Learning model inference processing is reported. In a complete application, the sensor acquisition, the data conditioning and pre-processing must also be considered.

The Neural Art compiler options used are -O3 with the epoch controller enabled: --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --enable-epoch-controller. For the Yamnet onnx model, the stedgeai options --input-data-type int8 --output-data-type float32 --inputs-ch-position chfirst --outputs-ch-position chlast have been used to optimize the input / output format of the end application. The C toolchain is GCC.

The column Model Source/Links indicates the pre-trained ML model and the source, either how it was built / trained or where it can be downloaded.

The column Flash Weights (Mbyte) reports the model weights Flash occupancy stored in external Flash.

The column Activations (Mbyte) reports the model activations as well as input and output buffers RAM occupancy. They are stored in the internal RAM. Note that to input buffer is not counted as it can be generally integrated within the activation buffer using "Use activation buffer for input buffer" and "Use activation buffer for the output buffer" options.

The column Proc Time (ms) reports the model inference processing time.

The column FPS reports the model frame per second.

The column Total energy (mJ) is the total energy spent during an inference of the model on the STM32N6 discovery kit including the external Flash and RAM memories.

The column Total average Power (mW) is the average power consumption during an inference of the model on the STM32N6 discovery kit including the external Flash and RAM memories.

The columns V_DDCORE are the energy and power consumption spent through the V_DDCORE power supply including V_DDCSI, set to 0.81 V for nominal mode and 0.89 V for performance mode.

The columns V_DDA1V8 are the energy and power consumption spent through the V_DDA1V8 1.8 V power supplies: V_DDA18PLL1, V_DDA18CSI, V_IO2 (Hexa SPI IOs), V_DDIO3 (Octo SPI IOs).

The columns External memories are the energy and power consumption spent by the external memories Flash and RAM used on the STM32N6 discovery board. Please refer to the STM32N657 discovery kit user manual for details on the external memories reference.

The application package x-cube-n6-ai-power-measurement provides the mean to measure V_DDCORE on the STM32N6 discovery kit. Details are provided in the Readme.md of the package.

4. Measurement process for other STM32

On this performance only the Machine Learning model inference processing is reported. In a complete application, the sensor acquisition, the data conditioning and pre-processing must also be considered.

To give developers more control over their applications, ST introduced a new setting in STM32Cube.AI v8.1.0 to define priorities. If users choose the “Time” setting, the algorithm will take more RAM but have faster inference times. On the other hand, choosing “RAM” will have the smallest memory footprint and the slowest times. Finally, the default “Balanced” parameter finds the middle ground between the two approaches, providing a good compromise. For these measurements, the setting "balanced" is used.

The STM32 Board column indicates the STM32 reference and the board used for measurement. By default, the STM32 is configured in maximum performance configuration, so with maximum frequency and especially HCLK / AXI clock at maximal frequency. When a different setting is used it is specified (for instance lower frequency to use a different Voltage Scale or for STM32H7, lower HCLK/AXI frequency). Many STM32 embed a powerful switched-mode power supply (SMPS) that can be used to improve power efficiency when the supply voltage is high enough. When used instead of the integrated low-dropout regulator (LDO), power consumption is optimized by a factor equal to the ratio of the internal VCORE supply voltage to the VDD voltage. The improvement due to the SMPS depends only upon the SMPS efficiency and the VDD voltage. When SMPS is indicated it means that the internal voltage regulator used is the SMPS step-down converter instead of the LDO.

The STM32 Characteristics column provides the available internal Flash size, the full internal RAM size and the frequency. The RAM size includes the different kind of memories and banks, TCM, SRAM etc. For the time being, the buffers used by X-CUBE-AI must be placed in a continuous memory area, the maximal RAM size available in continuous area is provided between "()" if not equal to the full size. The frequency indicated is the operating frequency used for the test, so generally the maximal frequency. The only different case is with the STM32H747 Discovery kit (STM32H747I-DISCO), which is operating by default in SMPS power mode and therefore is limited to 400 MHz instead of 480 MHz. Data are rounded to 3 decimals.

The column Model Source/Links indicates the pre-trained ML model and the source, either how it was built / trained or where it can be downloaded. tfl stands for TensorFlow™ Lite .tflite model , h5 stands for Keras .h5 model, quant for quantized models on 8 bits.

The memory footprints are the one reported by X-CUBE-AI using the "Analyze" function (the version of X-CUBE-AI used is mentioned in the table).
The column Flash reports the Flash occupancy including the model weights, the runtime code generated by X-CUBE-AI to run the neural network and its constants (including the initialized tables).
The column RAM reports the RAM buffers occupancy, used to store the model activations as well as input and output buffers, and the RAM required by the runtime to inference the model. Note that to gain RAM space the "Use activation buffer for input buffer" and "Use activation buffer for the output buffer" options are selected (through X-CUBE-AI Advanced Settings panel).
For X-CUBE-AI runtime, the total Flash and RAM memory footprints are reported after an "Analyze" operation on the main panel by the fields Used Flash and Used RAM. The compiler used is gcc embedded in STM32CubeIDE.
For TensorFlow™ Lite for microcontroller runtime, the Flash and RAM memory footprints related to the runtime/code execution are computed from the memory map of the validation project of the given model built with STM32CubeIDE. The runtime/code part is computed taking into account all the modules used by tflite_micro. The STM32CubeIDE build options for TensorFlow™ Lite for microcontroller are the optimal ones (best compromise between speed and code size), -Ofast for GCC compiler and -Osize for G++ Compiler.

The column Proc Time reports the model inference processing time. When the current / energy is indicated, the measure is obtained through the X-CUBE-AI "System Performance" application following the process described on this WiKi article on power measurement. Otherwise the "Validation on target" application is used. In all cases, when generating the application, the selected clock source is always the HSI, X-CUBE-AI is generating first the optimal clock settings and eventually afterwards the clock is set to HSI. STM32CubeMX then autonomously reconfigures the clock settings.

Cur. and Energy is the current and energy computed following the process described in the WiKi article on power measurement. For STM32 Ultra Low Power microcontrollers, measurement is done with the X-NUCLEO-LPM01A power shield as described in the section 4.3.1 "Measure process when current is below 50 mA". For STM32 High Performance microcontrollers, measurement is done with the Qoitec Otii Arc power analyzer as described in the section 4.3.2 "Measure process when current is above 50 mA". In both cases, a 10 s window is used for averaging) and HSI is selected as the clock source.

Accuracy is not reported. X-CUBE-AI is not modifying the DL/ML model topology. The impact on accuracy should be limited. X-CUBE-AI is providing through the "Validation" application a way to measure the accuracy either on x86 or on the target. It can be used to check the eventual impact on accuracy. When running the "Validation on target" application several metrics are computed, one of them is the X-Cross providing error metrics between the original model executed in Python™ and the C model executed on the target. Random data can be used to compute the RMSE/MAE/L2R errors, however it is recommended to use true data to get the final accuracy. For more details on the metrics, refer to the STEdgeAI-Core documentation.

Note that accuracy check is important to compare a float model with a quantize model or when using the Weight compression feature of X-CUBE-AI for float models.

5. STMicroelectronics references

↑ MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.”
↑ ST Edge AI Core
↑ SLA0104

STM32Cube.AI model performances

Contents