This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MP1x plateform.
1. Installation[edit | edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]
After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:
apt-get install tensorflow-lite-tools
The model used in this example can be installed from the following package:
apt-get install tflite-models-mobilenetv1
2. How to use the Benchmark application[edit | edit source]
2.1. Executing with the command line[edit | edit source]
The benchmark_model C/C++ application is located in the userfs partition:
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model
It accepts the following input parameters:
usage: ./benchmark_model <flags>
Flags:
--num_runs=50 int32 optional expected number of runs, see also min_secs, max_secs
--min_secs=1 float optional minimum number of seconds to rerun for, potentially making the actual number of runs to be greater than num_runs
--max_secs=150 float optional maximum number of seconds to rerun for, potentially making the actual number of runs to be less than num_runs. Note if --max-secs is exceeded in the middle of a run, the benchmark will continue to the end of the run but will not start the next run.
--run_delay=-1 float optional delay between runs in seconds
--run_frequency=-1 float optional Execute at a fixed frequency, instead of a fixed delay.Note if the targeted rate per second cannot be reached, the benchmark would start the next run immediately, trying its best to catch up. If set, this will override run_delay.
--num_threads=-1 int32 optional number of threads
--use_caching=false bool optional Enable caching of prepacked weights matrices in matrix multiplication routines. Currently implies the use of the Ruy library.
--benchmark_name= string optional benchmark name
--output_prefix= string optional benchmark output prefix
--warmup_runs=1 int32 optional minimum number of runs performed on initialization, to allow performance characteristics to settle, see also warmup_min_secs
--warmup_min_secs=0.5 float optional minimum number of seconds to rerun for, potentially making the actual number of warm-up runs to be greater than warmup_runs
--verbose=false bool optional Whether to log parameters whose values are not set. By default, only log those parameters that are set by parsing their values from the commandline flags.
--dry_run=false bool optional Whether to run the tool just with simply loading the model, allocating tensors etc. but without actually invoking any op kernels.
--report_peak_memory_footprint=false bool optional Report the peak memory footprint by periodically checking the memory footprint. Internally, a separate thread will be spawned for this periodic check. Therefore, the performance benchmark result could be affected.
--memory_footprint_check_interval_ms=50 int32 optional The interval in millisecond between two consecutive memory footprint checks. This is only used when --report_peak_memory_footprint is set to true.
--graph= string optional graph file name
--input_layer= string optional input layer names
--input_layer_shape= string optional input layer shape
--input_layer_value_range= string optional A map-like string representing value range for *integer* input layers. Each item is separated by ':', and the item value consists of input layer name and integer-only range values (both low and high are inclusive) separated by ',', e.g. input1,1,2:input2,0,254
--input_layer_value_files= string optional A map-like string representing value file. Each item is separated by ',', and the item value consists of input layer name and value file path separated by ':', e.g. input1:file_path1,input2:file_path2. In case the input layer name contains ':' e.g. "input:0", escape it with "\:". If the input_name appears both in input_layer_value_range and input_layer_value_files, input_layer_value_range of the input_name will be ignored. The file format is binary and it should be array format or null separated strings format.
--allow_fp16=false bool optional allow fp16
--require_full_delegation=false bool optional require delegate to run the entire graph
--enable_op_profiling=false bool optional enable op profiling
--max_profiling_buffer_entries=1024 int32 optional max initial profiling buffer entries
--allow_dynamic_profiling_buffer_increase=false bool optional allow dynamic increase on profiling buffer entries
--profiling_output_csv_file= string optional File path to export profile data as CSV, if not set prints to stdout.
--print_preinvoke_state=false bool optional print out the interpreter internals just before calling Invoke. The internals will include allocated memory size of each tensor etc.
--print_postinvoke_state=false bool optional print out the interpreter internals just before benchmark completes (i.e. after all repeated Invoke calls complete). The internals will include allocated memory size of each tensor etc.
--release_dynamic_tensors=false bool optional Ensure dynamic tensor's memory is released when they are not used.
--optimize_memory_for_large_tensors=0 int32 optional Optimize memory usage for large tensors with sacrificing latency.
--disable_delegate_clustering=false bool optional Disable delegate clustering.
--output_filepath= string optional File path to export outputs layer as binary data.
--help=false bool optional Print out all supported flags if true.
--num_threads=-1 int32 optional number of threads used for inference on CPU.
--max_delegated_partitions=0 int32 optional Max number of partitions to be delegated.
--min_nodes_per_partition=0 int32 optional The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
--delegate_serialize_dir= string optional Directory to be used by delegates for serializing any model data. This allows the delegate to save data into this directory to reduce init time after the first run. Currently supported by NNAPI delegate with specific backends on Android. Note that delegate_serialize_token is also required to enable this feature.
--delegate_serialize_token= string optional Model-specific token acting as a namespace for delegate serialization. Unique tokens ensure that the delegate doesn't read inapplicable/invalid data. Note that delegate_serialize_dir is also required to enable this feature.
--external_delegate_path= string optional The library path for the underlying external.
--external_delegate_options= string optional A list of comma-separated options to be passed to the external delegate. Each option is a colon-separated key-value pair, e.g. option_name:option_value.
2.2. Testing with MobileNet V1[edit | edit source]
The model used for testing is the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Hub[1].
It is a model used for image classification.
On the target, the model is located here:
/usr/local/demo-ai/computer-vision/models/mobilenet/
To launch the Benchmark application in its minimal configuration, use the following command:
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
Console output:
STARTING! Log parameter values verbosely: [0] Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite] Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite The input model file size (MB): 1.36451 Initialized session in 7.11ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=6 first=86950 curr=83078 min=82943 max=86950 avg=83733.5 std=1447 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=83766 curr=87918 min=82910 max=87918 avg=83747.3 std=1083 Inference timings in us: Init: 7110, First inference: 86950, Warmup (avg): 83733.5, Inference (avg): 83747.3 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=2.52734 overall=3.96094
To obtain the best performances it is interesting to use the flag num_threads to use more than one thread for the benchmark depending of the hardware used.
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2
Console output:
STARTING! Log parameter values verbosely: [0] Num threads: [2] Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite] Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite The input model file size (MB): 1.36451 Initialized session in 6.484ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=12 first=49700 curr=43522 min=43522 max=50016 avg=45037.8 std=2285 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=43819 curr=44488 min=43451 max=58290 avg=45438.3 std=3255 Inference timings in us: Init: 6484, First inference: 49700, Warmup (avg): 45037.8, Inference (avg): 45438.3 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=2.52734 overall=4.17969
In order to display more information, you could use the following flags verbose and enable_op_profiling.
/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --enable_op_profiling=true --num_threads=2 --verbose=true
Console output:
STARTING! Log parameter values verbosely: [1] Min num runs: [50] Min runs duration (seconds): [1] Max runs duration (seconds): [150] Inter-run delay (seconds): [-1] Number of prorated runs per second: [-1] Num threads: [2] Use caching: [0] Benchmark name: [] Output prefix: [] Min warmup runs: [1] Min warmup runs duration (seconds): [0.5] Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite] Input layers: [] Input shapes: [] Input value ranges: [] Input value files: [] Allow fp16: [0] Require full delegation: [0] Enable op profiling: [1] Max profiling buffer entries: [1024] CSV File to export profiling data to: [] Print pre-invoke interpreter state: [0] Print post-invoke interpreter state: [0] Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite The input model file size (MB): 1.36451 Initialized session in 7.048ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=12 first=47714 curr=44373 min=43478 max=47714 avg=44048.9 std=1131 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=45097 curr=44107 min=43532 max=58039 avg=45254.1 std=3243 Inference timings in us: Init: 7048, First inference: 47714, Warmup (avg): 44048.9, Inference (avg): 45254.1 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=2.62109 overall=4.20703 Profiling Info for Benchmark Initialization: ============================== Run Order ============================== [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] AllocateTensors 0.000 2.594 2.594 100.000% 100.000% 124.000 1 AllocateTensors/0 ============================== Top by Computation Time ============================== [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] AllocateTensors 0.000 2.594 2.594 100.000% 100.000% 124.000 1 AllocateTensors/0 Number of nodes executed: 1 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] AllocateTensors 1 2.594 100.000% 100.000% 124.000 1 Timings (microseconds): count=1 curr=2594 Memory (bytes): count=0 1 nodes observed Operator-wise Profiling Info for Regular Benchmark Runs: ============================== Run Order ============================== [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] CONV_2D 0.034 4.257 3.922 8.702% 8.702% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0 DEPTHWISE_CONV_2D 3.962 1.545 1.802 3.997% 12.700% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]:1 CONV_2D 5.766 4.550 4.562 10.122% 22.821% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2 DEPTHWISE_CONV_2D 10.334 1.093 1.073 2.381% 25.202% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]:3 CONV_2D 11.410 2.508 2.798 6.207% 31.409% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4 DEPTHWISE_CONV_2D 14.213 1.895 1.827 4.053% 35.462% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]:5 CONV_2D 16.043 3.550 3.551 7.878% 43.340% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6 DEPTHWISE_CONV_2D 19.599 0.476 0.518 1.149% 44.489% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]:7 CONV_2D 20.120 1.623 1.673 3.711% 48.201% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]:8 DEPTHWISE_CONV_2D 21.796 1.035 0.841 1.866% 50.067% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]:9 CONV_2D 22.639 2.491 2.543 5.642% 55.709% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10 DEPTHWISE_CONV_2D 25.186 0.234 0.240 0.532% 56.241% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]:11 CONV_2D 25.428 1.242 1.315 2.917% 59.158% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]:12 DEPTHWISE_CONV_2D 26.745 0.421 0.434 0.964% 60.121% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]:13 CONV_2D 27.182 2.385 2.179 4.834% 64.956% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14 DEPTHWISE_CONV_2D 29.364 0.410 0.405 0.898% 65.854% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]:15 CONV_2D 29.771 2.133 2.181 4.838% 70.692% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16 DEPTHWISE_CONV_2D 31.955 0.417 0.422 0.936% 71.628% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]:17 CONV_2D 32.378 2.307 2.242 4.974% 76.603% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18 DEPTHWISE_CONV_2D 34.623 0.421 0.475 1.055% 77.658% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]:19 CONV_2D 35.101 2.128 2.193 4.865% 82.523% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20 DEPTHWISE_CONV_2D 37.297 0.414 0.407 0.903% 83.426% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]:21 CONV_2D 37.706 2.355 2.157 4.786% 88.212% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]:22 DEPTHWISE_CONV_2D 39.866 0.156 0.132 0.292% 88.504% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]:23 CONV_2D 40.000 1.263 1.277 2.833% 91.337% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]:24 DEPTHWISE_CONV_2D 41.281 0.211 0.195 0.433% 91.770% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]:25 CONV_2D 41.477 2.384 2.486 5.516% 97.285% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26 AVERAGE_POOL_2D 43.968 0.045 0.051 0.113% 97.399% 0.000 1 [MobilenetV1/Logits/AvgPool_1a/AvgPool]:27 CONV_2D 44.021 0.858 1.112 2.468% 99.867% 0.000 1 [MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:28 RESHAPE 45.137 0.008 0.009 0.019% 99.886% 0.000 1 [MobilenetV1/Logits/SpatialSqueeze]:29 SOFTMAX 45.147 0.050 0.051 0.114% 100.000% 0.000 1 [MobilenetV1/Predictions/Reshape_1]:30 ============================== Top by Computation Time ============================== [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] CONV_2D 5.766 4.550 4.562 10.122% 10.122% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2 CONV_2D 0.034 4.257 3.922 8.702% 18.824% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0 CONV_2D 16.043 3.550 3.551 7.878% 26.702% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6 CONV_2D 11.410 2.508 2.798 6.207% 32.909% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4 CONV_2D 22.639 2.491 2.543 5.642% 38.551% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10 CONV_2D 41.477 2.384 2.486 5.516% 44.067% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26 CONV_2D 32.378 2.307 2.242 4.974% 49.042% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18 CONV_2D 35.101 2.128 2.193 4.865% 53.907% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20 CONV_2D 29.771 2.133 2.181 4.838% 58.745% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16 CONV_2D 27.182 2.385 2.179 4.834% 63.579% 0.000 1 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14 Number of nodes executed: 31 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] CONV_2D 15 36.182 80.306% 80.306% 0.000 15 DEPTHWISE_CONV_2D 13 8.763 19.450% 99.756% 0.000 13 SOFTMAX 1 0.051 0.113% 99.869% 0.000 1 AVERAGE_POOL_2D 1 0.051 0.113% 99.982% 0.000 1 RESHAPE 1 0.008 0.018% 100.000% 0.000 1 Timings (microseconds): count=50 first=44865 curr=43927 min=43359 max=57839 avg=45071.9 std=3241 Memory (bytes): count=0 31 nodes observed
3. References[edit | edit source]