How to measure performance of your NN models using TensorFlow Lite runtime

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP21x lines, STM32MP23x lines, STM32MP25x lines

This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MPUs platforms.

1. Installation[edit | edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:

x-linux-ai -i tensorflow-lite-tools

The model used in this example can be installed from the following package:

x-linux-ai -i img-models-mobilenetv2-10-224

2. How to use the Benchmark application[edit | edit source]

2.1. Executing with the command line[edit | edit source]

The benchmark_model C/C++ application is located in the userfs partition:

/usr/local/bin/tensorflow-lite-*/tools/benchmark_model

It accepts the following input parameters:

usage: ./benchmark_model <flags>
Flags:
        --num_runs=50                           int32   optional        expected number of runs, see also min_secs, max_secs
        --min_secs=1                            float   optional        minimum number of seconds to rerun for, potentially s
        --max_secs=150                          float   optional        maximum number of seconds to rerun for, potentially .
        --run_delay=-1                          float   optional        delay between runs in seconds
        --run_frequency=-1                      float   optional        Execute at a fixed frequency, instead of a fixed del.
        --num_threads=-1                        int32   optional        number of threads
        --use_caching=false                     bool    optional        Enable caching of prepacked weights matrices in matr.
        --benchmark_name=                       string  optional        benchmark name
        --output_prefix=                        string  optional        benchmark output prefix
        --warmup_runs=1                         int32   optional        minimum number of runs performed on initialization, s
        --warmup_min_secs=0.5                   float   optional        minimum number of seconds to rerun for, potentially s
        --verbose=false                         bool    optional        Whether to log parameters whose values are not set. .
        --dry_run=false                         bool    optional        Whether to run the tool just with simply loading the.
        --report_peak_memory_footprint=false    bool    optional        Report the peak memory footprint by periodically che.
        --memory_footprint_check_interval_ms=50 int32   optional        The interval in millisecond between two consecutive .
        --graph=                                string  optional        graph file name
        --input_layer=                          string  optional        input layer names
        --input_layer_shape=                    string  optional        input layer shape
        --input_layer_value_range=              string  optional        A map-like string representing value range for *inte4
        --input_layer_value_files=              string  optional        A map-like string representing value file. Each item.
        --allow_fp16=false                      bool    optional        allow fp16
        --require_full_delegation=false         bool    optional        require delegate to run the entire graph
        --enable_op_profiling=false             bool    optional        enable op profiling
        --max_profiling_buffer_entries=1024     int32   optional        max profiling buffer entries
        --profiling_output_csv_file=            string  optional        File path to export profile data as CSV, if not set .
        --print_preinvoke_state=false           bool    optional        print out the interpreter internals just before call.
        --print_postinvoke_state=false          bool    optional        print out the interpreter internals just before benc.
        --release_dynamic_tensors=false         bool    optional        Ensure dynamic tensor's memory is released when they.
        --help=false                            bool    optional        Print out all supported flags if true.
        --num_threads=-1                        int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0            int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0             int32   optional        The minimal number of TFLite graph nodes of a partit.
        --delegate_serialize_dir=               string  optional        Directory to be used by delegates for serializing an.
        --delegate_serialize_token=             string  optional        Model-specific token acting as a namespace for deleg.
        --external_delegate_path=               string  optional        The library path for the underlying external.
        --external_delegate_options=            string  optional        A list of comma-separated options to be passed to th.

2.2. Testing with MobileNet[edit | edit source]

The model used for testing is the mobilenet_v2_1.0_224_int8_per_tensor.tflite downloaded from STM32 AI model zoo^[1]. It is a model used for image classification.
On the target, the model is located here:

/usr/local/x-linux-ai/image-classification/models/mobilenet/

There are several types of delegation possible with this benchmark to improve the performances.

2.2.1. Benchmark on NPU[edit | edit source]

In this part we will show how to use the benchmark with NPU acceleration. Please, expand the following section.

To use the acceleration of the NPU we have to add an option to the benchmark to allow it to delegate the execution of the neural network, in our case we will delegate the operations to the VX delegate. The option to use is --external_delegate_path=/usr/lib/libvx_delegate.so.2, which will give the following command:

 /usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite  --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2

Information

When using the NPU, there is a warm-up time that can sometimes be quite long depending on the model used.

Console output:

INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [2]
INFO: Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite]
INFO: #threads used for CPU inference: [2]
INFO: External delegate path: [/usr/lib/libvx_delegate.so.2]
INFO: Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite
INFO: Vx delegate: allowed_cache_mode set to 0.
INFO: Vx delegate: device num set to 0.
INFO: Vx delegate: allowed_builtin_code set to 0.
INFO: Vx delegate: error_during_init set to 0.
INFO: Vx delegate: error_during_prepare set to 0.
INFO: Vx delegate: error_during_invoke set to 0.
INFO: EXTERNAL delegate created.
INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
INFO: The input model file size (MB): 3.59541
INFO: Initialized session in 348.762ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
INFO: count=1 curr=21792293

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=78 first=12990 curr=12820 min=12632 max=13711 avg=12758.7 std=133

INFO: Inference timings in us: Init: 348762, First inference: 21792293, Warmup (avg): 2.17923e+07, Inference (avg): 12758.7
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=10.625 overall=115.172

2.2.2. Benchmark on CPU[edit | edit source]

The easiest way to use the benchmark is to run it on the CPU. Please, expand the following section to learn how to use it.

To do this you need to run at least the benchmark with the --graph option. But to go a little further, it can be interesting to add the number of CPU cores as an option to the benchmark to improve the performances. Here is the command to execute:

 /usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite --num_threads=2 external_delegate_path=/usr/lib/libvx_delegate.so.2

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite]
#threads used for CPU inference: [2]
Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 3.59541
Initialized session in 273.952ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=5 first=133187 curr=119112 min=119112 max=133187 avg=122056 std=5566

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=120156 curr=119232 min=119081 max=128264 avg=119760 std=1422

Inference timings in us: Init: 273952, First inference: 133187, Warmup (avg): 122056, Inference (avg): 119760
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=13.4102 overall=19.6641

2.2.3. Benchmark on GPU[edit | edit source]

In this part we will show how to use the benchmark with GPU acceleration. Please, expand the following section.

The way to do it is similar to the NPU one, however it will be necessary to export an environment variable to force the use of the GPU only. First, export the following environment variable:

 export VIV_VX_DISABLE_TP_NN=1

Then, run the command:

 /usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-

classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2

Console output:

INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [2]
INFO: Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite]
INFO: #threads used for CPU inference: [2]
INFO: External delegate path: [/usr/lib/libvx_delegate.so.2]
INFO: Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite
INFO: Vx delegate: allowed_cache_mode set to 0.
INFO: Vx delegate: device num set to 0.
INFO: Vx delegate: allowed_builtin_code set to 0.
INFO: Vx delegate: error_during_init set to 0.
INFO: Vx delegate: error_during_prepare set to 0.
INFO: Vx delegate: error_during_invoke set to 0.
INFO: EXTERNAL delegate created.
INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
INFO: The input model file size (MB): 3.59541
INFO: Initialized session in 31.554ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
INFO: count=1 curr=2296912

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=72759 curr=72012 min=71791 max=72759 avg=72064.3 std=152

INFO: Inference timings in us: Init: 31554, First inference: 2296912, Warmup (avg): 2.29691e+06, Inference (avg): 72064.3
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=10.75 overall=143.371

3. References[edit | edit source]

↑ STM32 AI model zoo

[model_zoo_url-1] STM32 AI model zoo

[1]