This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MPUs platforms.
1. Installation[edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit source]
After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:
apt-get install tensorflow-lite-tools
The model used in this example can be installed from the following package:
apt-get install tflite-models-mobilenetv1
2. How to use the Benchmark application[edit source]
2.1. Executing with the command line[edit source]
The benchmark_model C/C++ application is located in the userfs partition:
/usr/local/bin/tensorflow-lite-x.x.x/tools/benchmark_model
It accepts the following input parameters:
usage: ./benchmark_model <flags>
Flags:
--num_runs=50 int32 optional expected number of runs, see also min_secs, max_secs
--min_secs=1 float optional minimum number of seconds to rerun for, potentially s
--max_secs=150 float optional maximum number of seconds to rerun for, potentially .
--run_delay=-1 float optional delay between runs in seconds
--run_frequency=-1 float optional Execute at a fixed frequency, instead of a fixed del.
--num_threads=-1 int32 optional number of threads
--use_caching=false bool optional Enable caching of prepacked weights matrices in matr.
--benchmark_name= string optional benchmark name
--output_prefix= string optional benchmark output prefix
--warmup_runs=1 int32 optional minimum number of runs performed on initialization, s
--warmup_min_secs=0.5 float optional minimum number of seconds to rerun for, potentially s
--verbose=false bool optional Whether to log parameters whose values are not set. .
--dry_run=false bool optional Whether to run the tool just with simply loading the.
--report_peak_memory_footprint=false bool optional Report the peak memory footprint by periodically che.
--memory_footprint_check_interval_ms=50 int32 optional The interval in millisecond between two consecutive .
--graph= string optional graph file name
--input_layer= string optional input layer names
--input_layer_shape= string optional input layer shape
--input_layer_value_range= string optional A map-like string representing value range for *inte4
--input_layer_value_files= string optional A map-like string representing value file. Each item.
--allow_fp16=false bool optional allow fp16
--require_full_delegation=false bool optional require delegate to run the entire graph
--enable_op_profiling=false bool optional enable op profiling
--max_profiling_buffer_entries=1024 int32 optional max profiling buffer entries
--profiling_output_csv_file= string optional File path to export profile data as CSV, if not set .
--print_preinvoke_state=false bool optional print out the interpreter internals just before call.
--print_postinvoke_state=false bool optional print out the interpreter internals just before benc.
--release_dynamic_tensors=false bool optional Ensure dynamic tensor's memory is released when they.
--help=false bool optional Print out all supported flags if true.
--num_threads=-1 int32 optional number of threads used for inference on CPU.
--max_delegated_partitions=0 int32 optional Max number of partitions to be delegated.
--min_nodes_per_partition=0 int32 optional The minimal number of TFLite graph nodes of a partit.
--delegate_serialize_dir= string optional Directory to be used by delegates for serializing an.
--delegate_serialize_token= string optional Model-specific token acting as a namespace for deleg.
--external_delegate_path= string optional The library path for the underlying external.
--external_delegate_options= string optional A list of comma-separated options to be passed to th.
2.2. Testing with MobileNet V1[edit source]
The model used for testing is the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Hub[1].
It is a model used for image classification.
On the target, the model is located here:
/usr/local/demo-ai/image-classification/models/mobilenet/
2.2.1. Benchmark on NPU[edit source]
There are several types of delegation possible with this benchmark to improve the performances. In this part we will show how to use the benchmark with NPU acceleration. Please, expand the following section.
To use the acceleration of the NPU we have to add an option to the benchmark to allow it to delegate the execution of the neural network, in our case we will delegate the operations to the VX delegate. The option to use is --external_delegate_path=/usr/lib/libvx_delegate.so.2, which will give the following command:
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2
Information |
When using the NPU, there is a warm-up time that can sometimes be quite long depending on the model used. |
Console output:
STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libvx_delegate.so.2]
Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 1.36451
Initialized session in 432.337ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:287]Op 162: default layout inference pass.
count=1 curr=8364808
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=562 first=1906 curr=1760 min=1735 max=2316 avg=1765.5 std=25
Inference timings in us: Init: 432337, First inference: 8364808, Warmup (avg): 8.36481e+06, Inference (avg): 1765.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=7.97266 overall=47.6914
2.2.2. Benchmark on CPU[edit source]
The easiest way to use the benchmark is to run it on the CPU. Please, expand the following section to learn how to use it.
To do this you need to run at least the benchmark with the --graph option. But to go a little further, it can be interesting to add the number of CPU cores as an option to the benchmark to improve the performances. Here is the command to execute:
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2
Console output:
STARTING! Log parameter values verbosely: [1] Min num runs: [50] Min runs duration (seconds): [1] Max runs duration (seconds): [150] Inter-run delay (seconds): [-1] Number of prorated runs per second: [-1] Num threads: [2] Use caching: [0] Benchmark name: [] Output prefix: [] Min warmup runs: [1] Min warmup runs duration (seconds): [0.5] Run w/o invoking kernels: [0] Report the peak memory footprint: [0] Memory footprint check interval (ms): [50] Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite] Input layers: [] Input shapes: [] Input value ranges: [] Input value files: [] Allow fp16: [0] Require full delegation: [0] Enable op profiling: [1] Max initial profiling buffer entries: [1024] Allow dynamic increase on profiling buffer entries: [0] CSV File to export profiling data to: [] Print pre-invoke interpreter state: [0] Print post-invoke interpreter state: [0] Release dynamic tensor memory: [0] Optimize memory usage for large tensors: [0] Disable delegate clustering: [0] File path to export outputs layer to: [] print out all supported flags: [0] #threads used for CPU inference: [2] Max number of delegated partitions: [0] Min nodes per partition: [0] Directory for delegate serialization: [] Model-specific token/key for delegate serialization.: [] Use xnnpack: [0] External delegate path: [] External delegate options: [] Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite INFO: Created TensorFlow Lite XNNPACK delegate for CPU. The input model file size (MB): 1.36451 Initialized session in 36.021ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=38 first=16929 curr=13525 min=12995 max=20882 avg=13453.9 std=1369 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=73 first=13603 curr=13097 min=13020 max=19314 avg=13439.5 std=952 Inference timings in us: Init: 36021, First inference: 16929, Warmup (avg): 13453.9, Inference (avg): 13439.5 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Memory footprint delta from the start of the tool (MB): init=7.54688 overall=8.51953 Profiling Info for Benchmark Initialization: ============================== Run Order ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] ModifyGraphWithDelegate 31.231 31.231 99.111% 99.111% 2636.000 1 ModifyGraphWithDelegate/0 AllocateTensors 0.280 0.280 0.889% 100.000% 0.000 1 AllocateTensors/0 ============================== Top by Computation Time ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] ModifyGraphWithDelegate 31.231 31.231 99.111% 99.111% 2636.000 1 ModifyGraphWithDelegate/0 AllocateTensors 0.280 0.280 0.889% 100.000% 0.000 1 AllocateTensors/0 Number of nodes executed: 2 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] ModifyGraphWithDelegate 1 31.231 99.111% 99.111% 2636.000 1 AllocateTensors 1 0.280 0.889% 100.000% 0.000 1 Timings (microseconds): count=1 curr=31511 Memory (bytes): count=0 2 nodes observed Operator-wise Profiling Info for Regular Benchmark Runs: ============================== Run Order ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] SOFTMAX 0.026 0.027 0.202% 0.202% 0.000 1 [MobilenetV1/Predictions/Reshape_1]:30 RESHAPE 0.004 0.004 0.032% 0.234% 0.000 1 [MobilenetV1/Logits/SpatialSqueeze]:29 TfLiteXNNPackDelegate 0.254 0.279 2.080% 2.314% 0.000 1 [MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:32 AVERAGE_POOL_2D 0.028 0.026 0.195% 2.509% 0.000 1 [MobilenetV1/Logits/AvgPool_1a/AvgPool]:27 TfLiteXNNPackDelegate 13.241 13.054 97.491% 100.000% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:31 ============================== Top by Computation Time ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] TfLiteXNNPackDelegate 13.241 13.054 97.491% 97.491% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:31 TfLiteXNNPackDelegate 0.254 0.279 2.080% 99.571% 0.000 1 [MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:32 SOFTMAX 0.026 0.027 0.202% 99.773% 0.000 1 [MobilenetV1/Predictions/Reshape_1]:30 AVERAGE_POOL_2D 0.028 0.026 0.195% 99.968% 0.000 1 [MobilenetV1/Logits/AvgPool_1a/AvgPool]:27 RESHAPE 0.004 0.004 0.032% 100.000% 0.000 1 [MobilenetV1/Logits/SpatialSqueeze]:29 Number of nodes executed: 5 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] TfLiteXNNPackDelegate 2 13.332 99.574% 99.574% 0.000 2 SOFTMAX 1 0.027 0.202% 99.776% 0.000 1 AVERAGE_POOL_2D 1 0.026 0.194% 99.970% 0.000 1 RESHAPE 1 0.004 0.030% 100.000% 0.000 1 Timings (microseconds): count=73 first=13553 curr=13049 min=12971 max=19265 avg=13390.3 std=953 Memory (bytes): count=0 5 nodes observed Delegate internal: ============================== Run Order ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] DelegateOpInvoke 0.752 0.699 5.287% 5.287% 0.000 1 Delegate/Convolution (NHWC, QU8) IGEMM:0 DelegateOpInvoke 0.515 0.484 3.659% 8.946% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:1 DelegateOpInvoke 0.625 0.631 4.777% 13.723% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:2 DelegateOpInvoke 0.222 0.245 1.854% 15.577% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:3 DelegateOpInvoke 0.589 0.529 4.006% 19.583% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:4 DelegateOpInvoke 0.392 0.397 3.001% 22.584% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:5 DelegateOpInvoke 0.954 0.985 7.449% 30.033% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:6 DelegateOpInvoke 0.104 0.109 0.823% 30.856% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:7 DelegateOpInvoke 0.477 0.486 3.675% 34.531% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:8 DelegateOpInvoke 0.191 0.193 1.464% 35.995% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:9 DelegateOpInvoke 1.159 0.931 7.043% 43.038% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:10 DelegateOpInvoke 0.055 0.056 0.423% 43.462% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:11 DelegateOpInvoke 0.484 0.536 4.052% 47.514% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:12 DelegateOpInvoke 0.094 0.096 0.726% 48.240% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:13 DelegateOpInvoke 1.017 0.930 7.037% 55.277% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:14 DelegateOpInvoke 0.096 0.096 0.728% 56.005% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:15 DelegateOpInvoke 0.904 0.956 7.232% 63.237% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:16 DelegateOpInvoke 0.094 0.098 0.743% 63.980% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:17 DelegateOpInvoke 0.231 0.259 1.959% 65.939% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:0 DelegateOpInvoke 0.902 0.942 7.131% 73.069% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:18 DelegateOpInvoke 0.095 0.096 0.727% 73.797% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:19 DelegateOpInvoke 0.930 0.937 7.086% 80.882% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:20 DelegateOpInvoke 0.097 0.096 0.730% 81.612% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:21 DelegateOpInvoke 0.921 0.921 6.972% 88.584% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:22 DelegateOpInvoke 0.028 0.028 0.214% 88.797% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:23 DelegateOpInvoke 0.462 0.493 3.727% 92.525% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:24 DelegateOpInvoke 0.050 0.050 0.380% 92.904% 0.000 1 Delegate/Convolution (NHWC, QU8) DWConv:25 DelegateOpInvoke 0.911 0.938 7.096% 100.000% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:26 ============================== Top by Computation Time ============================== [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] DelegateOpInvoke 0.954 0.985 7.449% 7.449% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:6 DelegateOpInvoke 0.904 0.956 7.232% 14.681% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:16 DelegateOpInvoke 0.902 0.942 7.131% 21.812% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:18 DelegateOpInvoke 0.911 0.938 7.096% 28.907% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:26 DelegateOpInvoke 0.930 0.937 7.086% 35.993% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:20 DelegateOpInvoke 1.159 0.931 7.043% 43.036% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:10 DelegateOpInvoke 1.017 0.930 7.037% 50.073% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:14 DelegateOpInvoke 0.921 0.921 6.972% 57.045% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:22 DelegateOpInvoke 0.752 0.699 5.287% 62.332% 0.000 1 Delegate/Convolution (NHWC, QU8) IGEMM:0 DelegateOpInvoke 0.625 0.631 4.777% 67.110% 0.000 1 Delegate/Convolution (NHWC, QU8) GEMM:2 Number of nodes executed: 28 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] DelegateOpInvoke 28 13.203 100.000% 100.000% 0.000 28 Timings (microseconds): count=73 first=13351 curr=12879 min=12802 max=19094 avg=13216.7 std=952 Memory (bytes): count=0 28 nodes observed
2.2.3. Benchmark on GPU[edit source]
In this part we will show how to use the benchmark with GPU acceleration. Please, expand the following section.
The way to do it is similar to the NPU one, however it will be necessary to export an environment variable to force the use of the GPU only. First, export the following environment variable:
export VIV_VX_DISABLE_TP_NN=1
Then, run the command:
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2
Console output:
STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libvx_delegate.so.2]
Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 1.36451
Initialized session in 21.586ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:287]Op 162: default layout inference pass.
count=1 curr=1050067
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=86 first=12273 curr=11654 min=11582 max=12273 avg=11671 std=103
Inference timings in us: Init: 21586, First inference: 1050067, Warmup (avg): 1.05007e+06, Inference (avg): 11671
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.94922 overall=80.4414
3. References[edit source]