How to benchmark your NN model on STM32MPU

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP21x lines, STM32MP23x lines, STM32MP25x lines

This article describes how to measure the performance of a Neural Network (NN) model on all STM32MPU platforms using the X-LINUX-AI unified benchmark.

1. Description[edit | edit source]

The X-LINUX-AI unified benchmark is a common benchmark application which allows the benchmark of either NBG (Network Binary Graph), TensorFlow^TM Lite and ONNX^TM models with a unique binary file. The aim of this tool is to simplify the NN model performance evaluation on STM32MPU platforms.

The model type (NBG, TFLite or ONNX^TM) is abstracted using a high-level common API. In concrete terms, it is possible to benchmark any supported model type with a unique command. This makes it possible to benchmark a complete directory containing different types of models and compare them.

The X-LINUX-AI unified benchmark provides several options and useful information, which are detailed below, to easily compare models and determine whether a model is correctly optimized to run on the current target.

2. Installation[edit | edit source]

2.1. Installing from the OpenSTLinux AI package repository[edit | edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After configuring the AI OpenSTLinux package, proceed to the installation of X-LINUX-AI components for this application.

The minimum package required is:

x-linux-ai -i x-linux-ai-benchmark

3. How to use the X-LINUX-AI unified benchmark tool[edit | edit source]

3.1. Executing with the command line[edit | edit source]

The x-linux-ai-benchmark tool binary is located in the userfs partition: /usr/bin/x-linux-ai-benchmark

It can therefore be accessed from anywhere in the file system using the following command:

x-linux-ai-benchmark

It accepts the following input parameters:

usage: x-linux-ai-benchmark [-h] (-d MODELS_DIRECTORY | -m MODEL_PATH) [--cpu_cores CPU_CORES] [--minimal_serial]
                            [--export_results]

options:
  -h, --help            show this help message and exit
  -d MODELS_DIRECTORY, --models_directory MODELS_DIRECTORY
                        path to models directory to benchmark
  -m MODEL_PATH, --model_path MODEL_PATH
                        path to the model to benchmark
  --cpu_cores CPU_CORES
                        number of CPU cores used for the benchmark, by default the benchmark automatically detect
                        the maximum of CPU cores available
  --minimal_serial      use this option to display result on a serial terminal
  --export_results      use this option to export benchmark results in a JSON file

The X-LINUX-AI unified benchmark is designed to be as simple as possible. Only one option is mandatory to run the benchmark which must be chosen from the two following exclusive arguments:

-m, --model_path: This option is used to specify the path to the NN model to be tested.
-d, --models_directory: This option is used to benchmark several models contained in a same directory. Note that model type can be mixed in the directory. The unified benchmark parses files in the directory and skips all files that are not NN models with a known extension type.

Concerning the execution engine used to run the benchmark, the unified benchmark automatically selects the best possible solution, depending on the board and the model type used:

For STM32MP2 series' boards with AI hardware accelerator, if the model used is a NBG, the benchmark runs on NPU/GPU. For TFLite or ONNX models, a first benchmark will be done using respectively TFLite VX-delegate or VSInpu execution provider targeting NPU/GPU, a second will be done on CPU.

For STM32 MPU boards without AI hardware accelerator, the benchmark always runs on CPU.

In both cases, the number of CPU cores used is automatically set to the maximum if the optional argument --cpu_cores is not set. Otherwise, the benchmark uses the specified cores value.

The benchmark also provides two more convenient options:

--export_results: This option can be used to export the benchmark results to a JSON file named "x-linux-ai-benchmark-results.json". This JSON file is composed of a JSON class named "board_information" containing all the board configuration information, and a JSON class for each model tested.
--minimal_serial: The benchmark uses some graphic libraries to format outputs. When using serial links, the formatting may not render correctly, so a lighter version is available with this option.

Depending on the type of model used, benchmark outputs can be composed of tables.
The first table displays the characteristics of the board used for the benchmark.

Some of these characteristics are common for STM32MP1 series' boards and STM32MP2 series' boards : the X-LINUX-AI version, the board name, the number of CPU cores available, and the CPU frequency.
More categories are available specifically for STM32MP2 series' boards with AI hardware accelerator : GPU/NPU driver version, and GPU/NPU frequency.

The second table summarizes the relevant information on the reference models.

Inference time refers to the amount of time it takes for a machine learning model to process input data and produce an output prediction. In this case, millisecond is the metric used.
CPU, GPU, NPU % refers to the percentage of each execution engine used for the inference.
- For STM32MP2 series' boards with AI hardware accelerator, all the execution engines are available.
- For STM32 MPU boards without AI hardware accelerator, only CPU is available, this is why the mention "NA" is displayed for GPU and NPU.
Peak RAM refers to the maximum amount of RAM memory necessary on the target to execute an inference of a specific NN model.

On STM32MP2 series' boards with AI hardware accelerator, the non optimal model table could additionally be displayed. As its name suggests, the models that are not correctly optimized for STM32MP2x with AI hardware accelerator target are stored in this table. If your model appears in this list, it means that your model is not quantized, or quantized with an unsupported quantization scheme like per-channel. In such case, refer to the article How to deploy your NN model on STM32MPU.

The example below contains a non optimal model table :

+----------------------------------------------------------------------------+
|                             Non-Optimal models                             |
+-------------------------+--------------------------------------------------+
|        model name       |                     comments                     |
+-------------------------+--------------------------------------------------+
| blazeface_128x128_quant | GPU usage is 87.88% compared to NPU usage 12.12% |
|                         | please verify if the model is quantized or that  |
|                         | the quantization scheme used is the 8-bits per-  |
|                         |                      tensor                      |
+-------------------------+--------------------------------------------------+
+---------------------------------------------------------------------------------------+
|                            TensorFlow Lite models benchmark                           |
+-------------------------+---------------------+-------+-------+-------+---------------+
|        Model Name       | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+-------------------------+---------------------+-------+-------+-------+---------------+
| blazeface_128x128_quant |         8.73        |  0.0  | 87.88 | 12.12 |     103.08    |
| blazeface_128x128_quant |         12.7        | 100.0 |   0   |   0   |     36.96     |
+-------------------------+---------------------+-------+-------+-------+---------------+
Note: Peak RAM information is only APPROXIMATE to the actual memory footprint of the model at runtime.
Take the information at your discretion.

4. How to benchmark a single model[edit | edit source]

4.1. On STM32MP2x board with AI hardware accelerator[edit | edit source]

For this demonstration, the NN model used is yolov8n_256_quant_pt_uf_pose_coco-st.nb, which is a YoloV8n that has been processed and converted to a network binary graph to run on the NPU.

The model used in this example can be installed from the following package:

x-linux-ai -i pose-estimation-models-yolov8n

Information

The same demonstration could be also carried out with TFLite^TM, ONNX^TM models

Use the following command to launch the benchmark on a single model:

x-linux-ai-benchmark -m /usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/yolov8n_256_quant_pt_uf_pose_coco-st.nb

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+----------------------------+-------------------+
|          Machine           |  STM32MP257F-EV1  |
|         CPU cores          |         2         |
|    CPU Clock frequency     |       1.5GHz      |
|  GPU/NPU Driver Version    |       6.4.19      |
|  GPU/NPU Clock frequency   |      800 MHZ      |
|    X-LINUX-AI Version      |       v6.0.0      |
+----------------------------+-------------------+
For hardware accelerated models, computation engine used for benchmark is NPU running at 800 MHZ
For other models, computation engine uses for benchmark is CPU with 2 cores at :  1.5GHz
+----------------------------------------------------------------------------------------------------+
|                                        NBG models benchmark                                        |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
| yolov8n_256_quant_pt_uf_pose_coco-st |        15.31        |  0.0  | 14.06 | 85.94 |     23.03     |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
Note: Peak RAM information is only APPROXIMATE to the actual memory footprint of the model at runtime.
Take the information at your discretion.

The first table is dedicated to target information, and the second is dedicated to benchmark results.

4.2. On STM32MPU board without AI hardware accelerator[edit | edit source]

For this demonstration, the NN model mobilenet_v1_0.5_128_quant.tflite is used and downloaded from Tensorflow Hub^[1]. It is a lite model trained for image classification.

The model used in this example can be installed from the following package:

x-linux-ai -i img-models-mobilenetv1-05-128

Information

The same demonstration could be also carried out with ONNX^TM models

To launch the benchmark on a single model use the following command:

 x-linux-ai-benchmark -m /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+--------------------------+---------------------+
|         Machine          |   STM32MP157F-DK2   |
|        CPU cores         |          2          |
|   CPU Clock frequency    |        0.8GHz       |
|   X-LINUX-AI Version     |        v6.0.0       |
|                          |                     |
|                          |                     |
+--------------------------+---------------------+
Computation engine used for benchmark is CPU with 2 cores at  0.8GHz
+--------------------------------------------------------------------------+
|                     TensorFlow Lite models benchmark                     |
+----------------------------+---------------------+-------+---------------+
|         Model Name         | Inference Time (ms) | CPU % | Peak RAM (MB) |
+----------------------------+---------------------+-------+---------------+
| mobilenet_v1_0.5_128_quant |        28.84        | 100.0 |     27.93     |
+----------------------------+---------------------+-------+---------------+
Note: Peak RAM information is only APPROXIMATE to the actual memory footprint of the model at runtime.
Take the information at your discretion.

The first table is dedicated to target information, and the second is dedicated to benchmark results.

5. How to benchmark multiple models[edit | edit source]

With X-LINUX-AI unified benchmark it is possible to benchmark multiple models which are located in a same directory. With this method you can easily compare the performance of multiple models with multiple architectures and model types.

5.1. On STM32MP2x board with AI hardware accelerator[edit | edit source]

For the demonstration we use image classification models. The benchmark runs on NBG, TensorFlow^TM Lite, ONNX^TM models using all the compute engines available on the board.

The model used in this example can be installed from the following package:

x-linux-ai -i img-models-mobilenetv2-10-224

Use the following command to launch the benchmark of multiple models stored in the same directory:

 x-linux-ai-benchmark -d /usr/local/x-linux-ai/image-classification/models/mobilenet/

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+----------------------------+-------------------+
|          Machine           |  STM32MP257F-EV1  |
|         CPU cores          |         2         |
|    CPU Clock frequency     |       1.5GHz      |
|  GPU/NPU Driver Version    |       6.4.19      |
|  GPU/NPU Clock frequency   |      800 MHZ      |
|    X-LINUX-AI Version      |       v6.0.0      |
|                            |                   |
|                            |                   |
+----------------------------+-------------------+
For hardware accelerated models, computation engine used for benchmark is NPU running at 800 MHZ
For other models, computation engine uses for benchmark is CPU with 2 cores at :  1.5GHz
model extension : .txt not supported, model skipped => supported extension are : .tflite, .onnx, .nb 
model extension : .txt not supported, model skipped => supported extension are : .tflite, .onnx, .nb 
Benchmark of the mobilenet_v2_1.0_224_int8_per_tensor.onnx failed model not supported with VSINPU execution provider, retrying on CPU...
model extension :  not supported, model skipped => supported extension are : .tflite, .onnx, .nb 
+----------------------------------------------------------------------------------------------------+
|                                        NBG models benchmark                                        |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
| mobilenet_v2_1.0_224_int8_per_tensor |        12.01        |  0.0  |  6.45 | 93.55 |      39.3     |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
+----------------------------------------------------------------------------------------------------+
|                                  TensorFlow Lite models benchmark                                  |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|      mobilenet_v1_0.5_128_quant      |         2.06        |  0.0  |  3.54 | 96.46 |     57.33     |
|      mobilenet_v1_0.5_128_quant      |        13.12        | 100.0 |   0   |   0   |     37.48     |
| mobilenet_v2_1.0_224_int8_per_tensor |        13.12        |  0.0  | 29.55 | 70.45 |     118.99    |
| mobilenet_v2_1.0_224_int8_per_tensor |        170.74       | 100.0 |   0   |   0   |     38.73     |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
+----------------------------------------------------------------------------------------------------+
|                                       ONNX models benchmark                                        |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|      mobilenet_v1_0.5_128_quant      |         2.16        |  0.0  |  5.52 | 94.48 |     78.18     |
|      mobilenet_v1_0.5_128_quant      |        17.93        | 100.0 |   0   |   0   |     32.77     |
| mobilenet_v2_1.0_224_int8_per_tensor |        180.57       | 100.0 |   0   |   0   |     55.14     |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
Note: Peak RAM information is only APPROXIMATE to the actual memory footprint of the model at runtime.
Take the information at your discretion.

Benchmark results on multiple models are classified in different tables, depending on the model type. A table is dedicated to NBG, TensorFlow^TM Lite and ONNX^TM models. As mentioned earlier in this article, a "non optimal model" table is displayed with a model that is not quantized or quantized in per-channel. For further information on these specifics points, refer to the article How to deploy your NN model on STM32MPU.

Information

Files that are not NN models and that are present in the benchmarked directory are skipped, with a log in the console

5.2. On STM32MPU board without AI hardware accelerator[edit | edit source]

For the demonstration we use image classification models. The benchmark runs on TensorFlow^TM Lite, ONNX^TM models.

The models used in this example can be installed from the following package:

x-linux-ai -i img-models-mobilenetv1-05-128

Use the following command to launch the benchmark of multiple models stored in the same directory:

 x-linux-ai-benchmark -d /usr/local/x-linux-ai/image-classification/models/mobilenet/

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+--------------------------+---------------------+
|         Machine          |   STM32MP157F-DK2   |
|        CPU cores         |          2          |
|   CPU Clock frequency    |        0.8GHz       |
|   X-LINUX-AI Version     |        v6.0.0       |
+--------------------------+---------------------+
Computation engine use for benchmark : CPU with 2 cores at :  0.8GHz

+--------------------------------------------------------------------------+
|                     TensorFlow Lite models benchmark                     |
+----------------------------+---------------------+-------+---------------+
|         Model Name         | Inference Time (ms) | CPU % | Peak RAM (MB) |
+----------------------------+---------------------+-------+---------------+
| mobilenet_v1_0.5_128_quant |        28.54        | 100.0 |     27.31     |
+----------------------------+---------------------+-------+---------------+
+--------------------------------------------------------------------------+
|                          ONNX models benchmark                           |
+----------------------------+---------------------+-------+---------------+
|         Model Name         | Inference Time (ms) | CPU % | Peak RAM (MB) |
+----------------------------+---------------------+-------+---------------+
| mobilenet_v1_0.5_128_quant |        61.08        | 100.0 |     17.26     |
+----------------------------+---------------------+-------+---------------+

Benchmark results on multiple models are classified in different tables, depending on the model type.

One table is dedicated to TensorFlow^TMLite models and one for ONNX^TM models.

Information

If there are files, that are not NN models in the benchmarked directory, files just will be skipped with a log in the console

6. How to export benchmark results[edit | edit source]

Exporting benchmark results is very simple: Use the optional argument --export_results. A JSON file is generated at the end of the benchmark named x-linux-ai-benchmark-result.json, and located in the directory where the benchmark was executed.

The JSON result file is built around different structures:

One dedicated to the board information:

    "board_information": {
        "name": "STM32MP257",
        "nb_cpu_core": 2,
        "cpu clock": 1500000000.0,
        "gpu version": "6.4.19",
        "gpu clock": 800000000
    },

One structure per model tested:

    "mobilenet_v2_1.0_224_int8_per_tensor_nbg": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "nbg",
        "execution_engine": "gpu/npu",
        "cpu_core_used": "2",
        "inference_time": 11.74,
        "cpu_usage": 0.0,
        "gpu_usage": 6.81,
        "gpu_layer_list": [
            "DepthwiseConvLayer",
            "Softmax2Layer"
        ],
        "npu_usage": 93.19,
        "npu_layer_list": [
            "TensorTranspose",
            "ConvolutionReluPoolingLayer2",
            "FullyConnectedReluLayer",
            "TensorCopy"
        ],
        "ram_usage": "NA",
        "macc_usage": "NA"
    },
    "mobilenet_v2_1.0_224_int8_per_tensor_onnx": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "onnx",
        "execution_engine": "cpu",
        "cpu_core_used": "2",
        "inference_time": 177.94,
        "cpu_usage": 100.0,
        "gpu_usage": "NA",
        "gpu_layer_list": [
            "NA"
        ],
        "npu_usage": "NA",
        "npu_layer_list": [
            "NA"
        ],
        "ram_usage": "44228608",
        "macc_usage": "NA"
    },
    "mobilenet_v2_1.0_224_int8_per_tensor_tflite": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "tflite",
        "execution_engine": "cpu",
        "cpu_core_used": "2",
        "inference_time": 119.77,
        "cpu_usage": 100.0,
        "gpu_usage": "NA",
        "gpu_layer_list": [
            "NA"
        ],
        "npu_usage": "NA",
        "npu_layer_list": [
            "NA"
        ],
        "ram_usage": 37902300,
        "macc_usage": "NA"
    }

If multiple models are tested, each model tested have a dedicated structure with benchmark results information. The information listed in each structure may vary depending on the model type and the target used.

7. Going further[edit | edit source]

The X-LINUX-AI benchmark is built on top of the common NBG, TensorFLow^TM Lite and ONNX^TM benchmark available in X-LINUX-AI expansion package. All the options provided in those benchmark utilities are not available in the unified benchmark with the aim of keeping things simple.

To go further on a specific benchmark, refer to the following articles:

For NBG benchmark: How to measure the performance of NBG-based models
For TFLite^TM benchmark: How to measure performance of your NN models using TensorFlow^TM Lite runtime
For ONNX^TM benchmark: How to measure performance of your models using ONNX Runtime

↑ TensorFlow Hub

[tflite_hub_url-1] TensorFlow Hub

[1]