This article describes how to measure the performance of a Neural Network (NN) model on all STM32MPU platforms using the X-LINUX-AI unified benchmark.
1. Description[edit | edit source]
The X-LINUX-AI unified benchmark is a common benchmark application which allows the benchmark of either NBG (Network Binary Graph), TensorFlowTM Lite and ONNXTM models with a unique binary file. The aim of this tool is to simplify the NN model performance evaluation on STM32MPU platforms.
The model type (NBG, TFLite or ONNXTM) is abstracted using a high-level common API. In concrete terms, it is possible to benchmark any supported model type with a unique command. This makes it possible to benchmark a complete directory containing different types of models and compare them.
The X-LINUX-AI unified benchmark provides several options and useful information, which are detailed below, to easily compare models and determine whether a model is correctly optimized to run on the current target.
2. Installation[edit | edit source]
2.1. Installing from the OpenSTLinux AI package repository[edit | edit source]
After configuring the AI OpenSTLinux package, proceed to the installation of X-LINUX-AI components for this application.
The minimum package required is:
x-linux-ai -i x-linux-ai-benchmark
3. How to use the X-LINUX-AI unified benchmark tool[edit | edit source]
3.1. Executing with the command line[edit | edit source]
The x-linux-ai-benchmark tool binary is located in the userfs partition: /usr/bin/x-linux-ai-benchmark
It can therefore be accessed from anywhere in the file system using the following command:
x-linux-ai-benchmark
It accepts the following input parameters:
usage: x-linux-ai-benchmark [-h] (-d MODELS_DIRECTORY | -m MODEL_PATH) [--cpu_cores CPU_CORES] [--minimal_serial] [--export_results] options: -h, --help show this help message and exit -d MODELS_DIRECTORY, --models_directory MODELS_DIRECTORY path to models directory to benchmark -m MODEL_PATH, --model_path MODEL_PATH path to the model to benchmark --cpu_cores CPU_CORES number of CPU cores used for the benchmark, by default the benchmark automatically detect the maximum of CPU cores available --minimal_serial use this option to display result on a serial terminal --export_results use this option to export benchmark results in a JSON file
The X-LINUX-AI unified benchmark is designed to be as simple as possible. Only one option is mandatory to run the benchmark which must be chosen from the two following exclusive arguments:
- -m, --model_path: This option is used to specify the path to the NN model to be tested.
- -d, --models_directory: This option is used to benchmark several models contained in a same directory. Note that model type can be mixed in the directory. The unified benchmark parses files in the directory and skips all files that are not NN models with a known extension type.
Concerning the execution engine used to run the benchmark, the unified benchmark automatically selects the best possible solution, depending on the board and the model type used:
- For STM32MP2 series' boards , if the model used is a NBG, the benchmark runs on NPU/GPU, otherwise it runs on CPU.
- For STM32MP1 series' boards , the benchmark always runs on CPU.
In both cases, the number of CPU cores used is automatically set to the maximum if the optional argument --cpu_cores is not set. Otherwise, the benchmark uses the specified cores value.
The benchmark also provides two more convenient options:
- --export_results: This option can be used to export the benchmark results to a JSON file named "x-linux-ai-benchmark-results.json". This JSON file is composed of a JSON class named "board_information" containing all the board configuration information, and a JSON class for each model tested.
- --minimal_serial: The benchmark uses some graphic libraries to format outputs. When using serial links, the formatting may not render correctly, so a lighter version is available with this option.
Depending on the type of model used, benchmark outputs can be composed of tables.
The first table displays the characteristics of the board used for the benchmark.
- Some of these characteristics are common for STM32MP1 series' boards and STM32MP2 series' boards : the X-LINUX-AI version, the board name, the number of CPU cores available, and the CPU frequency.
- More categories are available specifically for STM32MP2 series' boards : GPU/NPU driver version, and GPU/NPU frequency.
The second table summarizes the relevant information on the reference models.
- Inference time refers to the amount of time it takes for a machine learning model to process input data and produce an output prediction. In this case, millisecond is the metric used.
- CPU, GPU, NPU, CORAL_TPU % refers to the percentage of each execution engine used for the inference.
- Peak RAM refers to the maximum amount of RAM memory necessary on the target to execute an inference of a specific NN model.
On STM32MP2 series' boards , the non optimal model table could additionally be displayed. As its name suggests, the models that are not correctly optimized for STM32MP2x target are stored in this table. If your model appears in this list, it means that your model is not quantized, or quantized with an unsupported quantization scheme like per-channel. In such case, refer to the article How to deploy your NN model on STM32MPU.
The example below contains a non optimal model table :
+--------------------------------------------------------------------------------------------+ | NBG models benchmark | +------------------------------+---------------------+-------+-------+-------+---------------+ | Model Name | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) | +------------------------------+---------------------+-------+-------+-------+---------------+ | movenet_singlepose_lightning | 65.23 | 0.0 | 93.76 | 6.24 | NA | +------------------------------+---------------------+-------+-------+-------+---------------+ +--------------------------------------------------------------------------------+ | Non-Optimal models | +------------------------------+-------------------------------------------------+ | model name | comments | +------------------------------+-------------------------------------------------+ | movenet_singlepose_lightning | GPU usage is 93.76% compared to NPU usage 6.24% | | | please verify if the model is quantized or that | | | the quantization scheme used is the 8-bits per- | | | tensor | +------------------------------+-------------------------------------------------+
4. How to benchmark a single model[edit | edit source]
4.2. On STM32MP1x board[edit | edit source]
For this demonstration, the NN model mobilenet_v1_0.5_128_quant.tflite is used and downloaded from Tensorflow Hub[1]. It is a lite model trained for image classification.
The model used in this example can be installed from the following package:
x-linux-ai -i img-models-mobilenetv1-05-128
Information |
The same demonstration could be also carried out with ONNXTM or Edge TPUTM models |
To launch the benchmark on a single model use the following command:
x-linux-ai-benchmark -m /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
After running the benchmark, this is the output on the console:
+------------------------------------------------+ | X-LINUX-AI unified NN model benchmark | +--------------------------+---------------------+ | Machine | STM32MP157F-DK2 | | CPU cores | 2 | | CPU Clock frequency | 0.8GHz | | X-LINUX-AI Version | v5.1.0 | +--------------------------+---------------------+ Computation engine use for benchmark : CPU with 2 cores at : 0.8GHz +--------------------------------------------------------------------------+ | TensorFlow Lite models benchmark | +----------------------------+---------------------+-------+---------------+ | Model Name | Inference Time (ms) | CPU % | Peak RAM (MB) | +----------------------------+---------------------+-------+---------------+ | mobilenet_v1_0.5_128_quant | 28.31 | 100.0 | 27.37 | +----------------------------+---------------------+-------+---------------+
The first table is dedicated to target information, and the second is dedicated to benchmark results.
5. How to benchmark multiple models[edit | edit source]
With X-LINUX-AI unified benchmark it is possible to benchmark multiple models which are located in a same directory. With this method you can easily compare the performance of multiple models with multiple architectures and model types.
5.2. On STM32MP1x board[edit | edit source]
For the demonstration we use image classification models. The benchmark runs on TensorFlowTM Lite, ONNXTM and Coral Edge TPUTM models.
The models used in this example can be installed from the following package:
x-linux-ai -i img-models-mobilenetv1-05-128
Use the following command to launch the benchmark of multiple models stored in the same directory:
x-linux-ai-benchmark -d /usr/local/x-linux-ai/image-classification/models/mobilenet/
After running the benchmark, this is the output on the console:
+------------------------------------------------+ | X-LINUX-AI unified NN model benchmark | +--------------------------+---------------------+ | Machine | STM32MP157F-DK2 | | CPU cores | 2 | | CPU Clock frequency | 0.8GHz | | X-LINUX-AI Version | v5.1.0 | +--------------------------+---------------------+ Computation engine use for benchmark : CPU with 2 cores at : 0.8GHz +--------------------------------------------------------------------------+ | TensorFlow Lite models benchmark | +----------------------------+---------------------+-------+---------------+ | Model Name | Inference Time (ms) | CPU % | Peak RAM (MB) | +----------------------------+---------------------+-------+---------------+ | mobilenet_v1_0.5_128_quant | 28.54 | 100.0 | 27.31 | +----------------------------+---------------------+-------+---------------+ +--------------------------------------------------------------------------+ | ONNX models benchmark | +----------------------------+---------------------+-------+---------------+ | Model Name | Inference Time (ms) | CPU % | Peak RAM (MB) | +----------------------------+---------------------+-------+---------------+ | mobilenet_v1_0.5_128_quant | 61.08 | 100.0 | 17.26 | +----------------------------+---------------------+-------+---------------+
Benchmark results on multiple models are classified in different tables, depending on the mode
Benchmark results on multiple models are classified in different tables, depending on the model type. One table is dedicated to TensorFlowTM Lite models, a second for Coral Edge TPUTM models and the last one for ONNXTM models.
Information |
If there are files, that are not NN models in the benchmarked directory, files just will be skipped with a log in the console |
6. How to export benchmark results[edit | edit source]
Exporting benchmark results is very simple: Use the optional argument --export_results. A JSON file is generated at the end of the benchmark named x-linux-ai-benchmark-result.json, and located in the directory where the benchmark was executed.
The JSON result file is built around different structures:
- One dedicated to the board information:
"board_information": { "name": "STM32MP257", "nb_cpu_core": 2, "cpu clock": 1500000000.0, "gpu version": "6.4.15.6.691815", "gpu clock": 800000000 },
- One structure per model tested:
"mobilenet_v2_1.0_224_int8_per_tensor_nbg": { "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor", "model_type": "nbg", "execution_engine": "gpu/npu", "cpu_core_used": "2", "inference_time": 11.74, "cpu_usage": 0.0, "gpu_usage": 6.81, "gpu_layer_list": [ "DepthwiseConvLayer", "Softmax2Layer" ], "npu_usage": 93.19, "npu_layer_list": [ "TensorTranspose", "ConvolutionReluPoolingLayer2", "FullyConnectedReluLayer", "TensorCopy" ], "ram_usage": "NA", "macc_usage": "NA" }, "mobilenet_v2_1.0_224_int8_per_tensor_onnx": { "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor", "model_type": "onnx", "execution_engine": "cpu", "cpu_core_used": "2", "inference_time": 177.94, "cpu_usage": 100.0, "gpu_usage": "NA", "gpu_layer_list": [ "NA" ], "npu_usage": "NA", "npu_layer_list": [ "NA" ], "ram_usage": "44228608", "macc_usage": "NA" }, "mobilenet_v2_1.0_224_int8_per_tensor_tflite": { "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor", "model_type": "tflite", "execution_engine": "cpu", "cpu_core_used": "2", "inference_time": 119.77, "cpu_usage": 100.0, "gpu_usage": "NA", "gpu_layer_list": [ "NA" ], "npu_usage": "NA", "npu_layer_list": [ "NA" ], "ram_usage": 37902300, "macc_usage": "NA" }
If multiple models are tested, each model tested have a dedicated structure with benchmark results information. The information listed in each structure may vary depending on the model type and the target used.
7. Going further[edit | edit source]
The X-LINUX-AI benchmark is built on top of the common NBG, TensorFLowTM Lite, Coral and ONNXTM benchmark available in X-LINUX-AI expansion package. All the options provided in those benchmark utilities are not available in the unified benchmark with the aim of keeping things simple.
To go further on a specific benchmark, refer to the following articles:
- For NBG benchmark: How to measure the performance of NBG-based models
- For TFLiteTM benchmark: How to measure performance of your NN models using TensorFlowTM Lite runtime
- For ONNXTM benchmark: How to measure performance of your models using ONNX Runtime
- For Coral Edge TPUTM benchmark: How to measure performance of your NN models using the Coral Edge TPUTM