How to measure performance of your NN models using TensorFlow Lite runtime

Revision as of 16:59, 1 December 2023 by Registered User
Applicable for STM32MP13x lines, STM32MP15x lines

This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MP1x and STM32MP2x platforms.

1. Installation[edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit source]

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:

 apt-get install tensorflow-lite-tools

The model used in this example can be installed from the following package:

 apt-get install tflite-models-mobilenetv1

2. How to use the Benchmark application[edit source]

2.1. Executing with the command line[edit source]

The benchmark_model C/C++ application is located in the userfs partition:

/usr/local/bin/tensorflow-lite-x.x.x/tools/benchmark_model

It accepts the following input parameters:

usage: ./benchmark_model <flags>
Flags:
        --num_runs=50                           int32   optional        expected number of runs, see also min_secs, max_secs
        --min_secs=1                            float   optional        minimum number of seconds to rerun for, potentially s
        --max_secs=150                          float   optional        maximum number of seconds to rerun for, potentially .
        --run_delay=-1                          float   optional        delay between runs in seconds
        --run_frequency=-1                      float   optional        Execute at a fixed frequency, instead of a fixed del.
        --num_threads=-1                        int32   optional        number of threads
        --use_caching=false                     bool    optional        Enable caching of prepacked weights matrices in matr.
        --benchmark_name=                       string  optional        benchmark name
        --output_prefix=                        string  optional        benchmark output prefix
        --warmup_runs=1                         int32   optional        minimum number of runs performed on initialization, s
        --warmup_min_secs=0.5                   float   optional        minimum number of seconds to rerun for, potentially s
        --verbose=false                         bool    optional        Whether to log parameters whose values are not set. .
        --dry_run=false                         bool    optional        Whether to run the tool just with simply loading the.
        --report_peak_memory_footprint=false    bool    optional        Report the peak memory footprint by periodically che.
        --memory_footprint_check_interval_ms=50 int32   optional        The interval in millisecond between two consecutive .
        --graph=                                string  optional        graph file name
        --input_layer=                          string  optional        input layer names
        --input_layer_shape=                    string  optional        input layer shape
        --input_layer_value_range=              string  optional        A map-like string representing value range for *inte4
        --input_layer_value_files=              string  optional        A map-like string representing value file. Each item.
        --allow_fp16=false                      bool    optional        allow fp16
        --require_full_delegation=false         bool    optional        require delegate to run the entire graph
        --enable_op_profiling=false             bool    optional        enable op profiling
        --max_profiling_buffer_entries=1024     int32   optional        max profiling buffer entries
        --profiling_output_csv_file=            string  optional        File path to export profile data as CSV, if not set .
        --print_preinvoke_state=false           bool    optional        print out the interpreter internals just before call.
        --print_postinvoke_state=false          bool    optional        print out the interpreter internals just before benc.
        --release_dynamic_tensors=false         bool    optional        Ensure dynamic tensor's memory is released when they.
        --help=false                            bool    optional        Print out all supported flags if true.
        --num_threads=-1                        int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0            int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0             int32   optional        The minimal number of TFLite graph nodes of a partit.
        --delegate_serialize_dir=               string  optional        Directory to be used by delegates for serializing an.
        --delegate_serialize_token=             string  optional        Model-specific token acting as a namespace for deleg.
        --external_delegate_path=               string  optional        The library path for the underlying external.
        --external_delegate_options=            string  optional        A list of comma-separated options to be passed to th.

2.2. Testing with MobileNet V1[edit source]

The model used for testing is the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Hub[1]. It is a model used for image classification.
On the target, the model is located here:

/usr/local/demo-ai/computer-vision/models/mobilenet/

2.2.1. Benchmark on NPU[edit source]

There are several types of delegation possible with this benchmark to improve the performances. In this part we will show how to use the benchmark with NPU acceleration. Please, expand the following section.

To use the acceleration of the NPU we have to add an option to the benchmark to allow it to delegate the execution of the neural network, in our case we will delegate the operations to the VX delegate. The option to use is --external_delegate_path=/usr/lib/libvx_delegate.so.2, which will give the following command:

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2
Info white.png Information
When using the NPU, there is a warm-up time that can sometimes be quite long depending on the model used.

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libvx_delegate.so.2]
Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 1.36451
Initialized session in 432.337ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:287]Op 162: default layout inference pass.
count=1 curr=8364808

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=562 first=1906 curr=1760 min=1735 max=2316 avg=1765.5 std=25

Inference timings in us: Init: 432337, First inference: 8364808, Warmup (avg): 8.36481e+06, Inference (avg): 1765.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=7.97266 overall=47.6914

2.2.2. Benchmark on CPU[edit source]

The easiest way to use the benchmark is to run it on the CPU. Please, expand the following section to learn how to use it.

To do this you need to run at least the benchmark with the --graph option. But to go a little further, it can be interesting to add the number of CPU cores as an option to the benchmark to improve the performances. Here is the command to execute:

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2

Console output:

STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Number of prorated runs per second: [-1]
Num threads: [2]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Run w/o invoking kernels: [0]
Report the peak memory footprint: [0]
Memory footprint check interval (ms): [50]
Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [1]
Max initial profiling buffer entries: [1024]
Allow dynamic increase on profiling buffer entries: [0]
CSV File to export profiling data to: []
Print pre-invoke interpreter state: [0]
Print post-invoke interpreter state: [0]
Release dynamic tensor memory: [0]
Optimize memory usage for large tensors: [0]
Disable delegate clustering: [0]
File path to export outputs layer to: []
print out all supported flags: [0]
#threads used for CPU inference: [2]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
Directory for delegate serialization: []
Model-specific token/key for delegate serialization.: []
Use xnnpack: [0]
External delegate path: []
External delegate options: []
Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 1.36451
Initialized session in 36.021ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=38 first=16929 curr=13525 min=12995 max=20882 avg=13453.9 std=1369

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=73 first=13603 curr=13097 min=13020 max=19314 avg=13439.5 std=952

Inference timings in us: Init: 36021, First inference: 16929, Warmup (avg): 13453.9, Inference (avg): 13439.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=7.54688 overall=8.51953
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	 ModifyGraphWithDelegate	   31.231	   31.231	 99.111%	 99.111%	  2636.000	        1	ModifyGraphWithDelegate/0
	         AllocateTensors	    0.280	    0.280	  0.889%	100.000%	     0.000	        1	AllocateTensors/0

============================== Top by Computation Time ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	 ModifyGraphWithDelegate	   31.231	   31.231	 99.111%	 99.111%	  2636.000	        1	ModifyGraphWithDelegate/0
	         AllocateTensors	    0.280	    0.280	  0.889%	100.000%	     0.000	        1	AllocateTensors/0

Number of nodes executed: 2
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	 ModifyGraphWithDelegate	        1	    31.231	    99.111%	    99.111%	  2636.000	        1
	         AllocateTensors	        1	     0.280	     0.889%	   100.000%	     0.000	        1

Timings (microseconds): count=1 curr=31511
Memory (bytes): count=0
2 nodes observed



Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                 SOFTMAX	    0.026	    0.027	  0.202%	  0.202%	     0.000	        1	[MobilenetV1/Predictions/Reshape_1]:30
	                 RESHAPE	    0.004	    0.004	  0.032%	  0.234%	     0.000	        1	[MobilenetV1/Logits/SpatialSqueeze]:29
	   TfLiteXNNPackDelegate	    0.254	    0.279	  2.080%	  2.314%	     0.000	        1	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:32
	         AVERAGE_POOL_2D	    0.028	    0.026	  0.195%	  2.509%	     0.000	        1	[MobilenetV1/Logits/AvgPool_1a/AvgPool]:27
	   TfLiteXNNPackDelegate	   13.241	   13.054	 97.491%	100.000%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:31

============================== Top by Computation Time ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	   TfLiteXNNPackDelegate	   13.241	   13.054	 97.491%	 97.491%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:31
	   TfLiteXNNPackDelegate	    0.254	    0.279	  2.080%	 99.571%	     0.000	        1	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:32
	                 SOFTMAX	    0.026	    0.027	  0.202%	 99.773%	     0.000	        1	[MobilenetV1/Predictions/Reshape_1]:30
	         AVERAGE_POOL_2D	    0.028	    0.026	  0.195%	 99.968%	     0.000	        1	[MobilenetV1/Logits/AvgPool_1a/AvgPool]:27
	                 RESHAPE	    0.004	    0.004	  0.032%	100.000%	     0.000	        1	[MobilenetV1/Logits/SpatialSqueeze]:29

Number of nodes executed: 5
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	   TfLiteXNNPackDelegate	        2	    13.332	    99.574%	    99.574%	     0.000	        2
	                 SOFTMAX	        1	     0.027	     0.202%	    99.776%	     0.000	        1
	         AVERAGE_POOL_2D	        1	     0.026	     0.194%	    99.970%	     0.000	        1
	                 RESHAPE	        1	     0.004	     0.030%	   100.000%	     0.000	        1

Timings (microseconds): count=73 first=13553 curr=13049 min=12971 max=19265 avg=13390.3 std=953
Memory (bytes): count=0
5 nodes observed

Delegate internal: 
============================== Run Order ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	        DelegateOpInvoke	    0.752	    0.699	  5.287%	  5.287%	     0.000	        1	Delegate/Convolution (NHWC, QU8) IGEMM:0
	        DelegateOpInvoke	    0.515	    0.484	  3.659%	  8.946%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:1
	        DelegateOpInvoke	    0.625	    0.631	  4.777%	 13.723%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:2
	        DelegateOpInvoke	    0.222	    0.245	  1.854%	 15.577%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:3
	        DelegateOpInvoke	    0.589	    0.529	  4.006%	 19.583%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:4
	        DelegateOpInvoke	    0.392	    0.397	  3.001%	 22.584%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:5
	        DelegateOpInvoke	    0.954	    0.985	  7.449%	 30.033%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:6
	        DelegateOpInvoke	    0.104	    0.109	  0.823%	 30.856%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:7
	        DelegateOpInvoke	    0.477	    0.486	  3.675%	 34.531%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:8
	        DelegateOpInvoke	    0.191	    0.193	  1.464%	 35.995%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:9
	        DelegateOpInvoke	    1.159	    0.931	  7.043%	 43.038%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:10
	        DelegateOpInvoke	    0.055	    0.056	  0.423%	 43.462%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:11
	        DelegateOpInvoke	    0.484	    0.536	  4.052%	 47.514%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:12
	        DelegateOpInvoke	    0.094	    0.096	  0.726%	 48.240%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:13
	        DelegateOpInvoke	    1.017	    0.930	  7.037%	 55.277%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:14
	        DelegateOpInvoke	    0.096	    0.096	  0.728%	 56.005%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:15
	        DelegateOpInvoke	    0.904	    0.956	  7.232%	 63.237%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:16
	        DelegateOpInvoke	    0.094	    0.098	  0.743%	 63.980%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:17
	        DelegateOpInvoke	    0.231	    0.259	  1.959%	 65.939%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:0
	        DelegateOpInvoke	    0.902	    0.942	  7.131%	 73.069%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:18
	        DelegateOpInvoke	    0.095	    0.096	  0.727%	 73.797%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:19
	        DelegateOpInvoke	    0.930	    0.937	  7.086%	 80.882%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:20
	        DelegateOpInvoke	    0.097	    0.096	  0.730%	 81.612%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:21
	        DelegateOpInvoke	    0.921	    0.921	  6.972%	 88.584%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:22
	        DelegateOpInvoke	    0.028	    0.028	  0.214%	 88.797%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:23
	        DelegateOpInvoke	    0.462	    0.493	  3.727%	 92.525%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:24
	        DelegateOpInvoke	    0.050	    0.050	  0.380%	 92.904%	     0.000	        1	Delegate/Convolution (NHWC, QU8) DWConv:25
	        DelegateOpInvoke	    0.911	    0.938	  7.096%	100.000%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:26

============================== Top by Computation Time ==============================
	             [node type]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	        DelegateOpInvoke	    0.954	    0.985	  7.449%	  7.449%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:6
	        DelegateOpInvoke	    0.904	    0.956	  7.232%	 14.681%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:16
	        DelegateOpInvoke	    0.902	    0.942	  7.131%	 21.812%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:18
	        DelegateOpInvoke	    0.911	    0.938	  7.096%	 28.907%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:26
	        DelegateOpInvoke	    0.930	    0.937	  7.086%	 35.993%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:20
	        DelegateOpInvoke	    1.159	    0.931	  7.043%	 43.036%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:10
	        DelegateOpInvoke	    1.017	    0.930	  7.037%	 50.073%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:14
	        DelegateOpInvoke	    0.921	    0.921	  6.972%	 57.045%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:22
	        DelegateOpInvoke	    0.752	    0.699	  5.287%	 62.332%	     0.000	        1	Delegate/Convolution (NHWC, QU8) IGEMM:0
	        DelegateOpInvoke	    0.625	    0.631	  4.777%	 67.110%	     0.000	        1	Delegate/Convolution (NHWC, QU8) GEMM:2

Number of nodes executed: 28
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	        DelegateOpInvoke	       28	    13.203	   100.000%	   100.000%	     0.000	       28

Timings (microseconds): count=73 first=13351 curr=12879 min=12802 max=19094 avg=13216.7 std=952
Memory (bytes): count=0
28 nodes observed

2.2.3. Benchmark on GPU[edit source]

In this part we will show how to use the benchmark with GPU acceleration. Please, expand the following section.

The way to do it is similar to the NPU one, however it will be necessary to export an environment variable to force the use of the GPU only. First, export the following environment variable:

 export VIV_VX_DISABLE_TP_NN=1

Then, run the command:

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libvx_delegate.so.2]
Loaded model /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 1.36451
Initialized session in 21.586ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:287]Op 162: default layout inference pass.
count=1 curr=1050067

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=86 first=12273 curr=11654 min=11582 max=12273 avg=11671 std=103

Inference timings in us: Init: 21586, First inference: 1050067, Warmup (avg): 1.05007e+06, Inference (avg): 11671
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.94922 overall=80.4414

3. References[edit source]