How to measure the performance of your models using ONNX Runtime

Applicable for

This article describes how to measure the performance of an ONNX model using ONNX Runtime on an STM32MP1xxx platform.

1. Installation[edit source]↑

1.1. Installing from the OpenSTLinux AI package repository[edit source]↑

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:

 apt-get install onnxruntime-tools

The model used in this example can be installed from the following package:

 apt-get install onnx-models-mobilenet

2. How to use the benchmark application[edit source]↑

2.1. Executing with the command line[edit source]↑

The onnxruntime_perf_test executable is located in the userfs partition:

/usr/local/bin/onnxruntime-x.x.x/tools/onnxruntime_perf_test

It accepts the following input parameters:

usage: ./onnxruntime_perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -I: Generate tensor input binding (Free dimensions are treated as 1.)
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1..
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.

        -h: help

2.2. Testing with MobileNet V1[edit source]↑

The model used for testing is mobilenet_v1_0.5_128_quant.onnx, installed by the onnx-models-mobilenet package. It is a model used for image classification.
On the target, the model is located here:

/usr/local/demo-ai/computer-vision/models/mobilenet/

To benchmark an ONNX model with onnxruntime_perf_test, use the following command:

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx

Console output:

Session creation time cost: 0.294321 s
Total inference time cost: 0.867173 s
Total inference requests: 8
Average inference time cost: 108.397 ms
Total inference run time: 0.867324 s
Avg CPU usage: 50 %
Peak working set size: 18923520 bytes
Avg CPU usage:50
Peak working set size:18923520
Runs:8
Min Latency: 0.105907 s
Max Latency: 0.111885 s
P50 Latency: 0.108234 s
P90 Latency: 0.111885 s
P95 Latency: 0.111885 s
P99 Latency: 0.111885 s
P999 Latency: 0.111885 s

To obtain the best performance, it is interesting to use the additional flags -P -x 2 -y 2 to use more than one thread for the benchmark depending of the hardware used.

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx

Console output:

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 2
Session creation time cost: 0.325633 s
Total inference time cost: 0.516908 s
Total inference requests: 8
Average inference time cost: 64.6135 ms
Total inference run time: 0.517071 s
Avg CPU usage: 95 %
Peak working set size: 17842176 bytes
Avg CPU usage:95
Peak working set size:17842176
Runs:8
Min Latency: 0.0594675 s
Max Latency: 0.0781295 s
P50 Latency: 0.0617747 s
P90 Latency: 0.0781295 s
P95 Latency: 0.0781295 s
P99 Latency: 0.0781295 s
P999 Latency: 0.0781295 s

To display more information, use the flag -v.

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 -v /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx

Console output (excerpt):

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 2
2022-08-22 08:49:47.130287862 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off
2022-08-22 08:49:47.130961905 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2022-08-22 08:49:47.131583198 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0
2022-08-22 08:49:47.183344592 [I:onnxruntime:, inference_session.cc:1327 Initialize] Initializing session.
2022-08-22 08:49:47.183556550 [I:onnxruntime:, inference_session.cc:1364 Initialize] Adding default CPU execution provider.
2022-08-22 08:49:47.227434804 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0
2022-08-22 08:49:47.248430720 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0
...
2022-08-22 08:49:47.418089054 [V:onnxruntime:, inference_session.cc:150 VerifyEachNodeIsAssignedToAnEp] Node placements
2022-08-22 08:49:47.418275930 [V:onnxruntime:, inference_session.cc:152 VerifyEachNodeIsAssignedToAnEp] All nodes have been placed on [CPUExecutionProvider].
2022-08-22 08:49:47.421319603 [V:onnxruntime:, session_state.cc:68 CreateGraphInfo] SaveMLValueNameIndexMapping
2022-08-22 08:49:47.422994939 [V:onnxruntime:, session_state.cc:114 CreateGraphInfo] Done saving OrtValue mappings.
2022-08-22 08:49:47.425985278 [I:onnxruntime:, session_state_utils.cc:140 SaveInitializedTensors] Saving initialized tensors.
2022-08-22 08:49:47.436059673 [I:onnxruntime:, session_state_utils.cc:266 SaveInitializedTensors] Done saving initialized tensors
2022-08-22 08:49:47.546423807 [I:onnxruntime:, inference_session.cc:1576 Initialize] Session successfully initialized.
2022-08-22 08:49:47.548277894 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-22 08:49:47.613135022 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:1,time_cost:0.0597978
2022-08-22 08:49:47.673166849 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:2,time_cost:0.0580345
2022-08-22 08:49:47.731373797 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:3,time_cost:0.058048
2022-08-22 08:49:47.789770371 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:4,time_cost:0.0784632
2022-08-22 08:49:47.868311400 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:5,time_cost:0.0588315
2022-08-22 08:49:47.927296267 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:6,time_cost:0.0687991
2022-08-22 08:49:47.996246694 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:7,time_cost:0.0661833
2022-08-22 08:49:48.062668361 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:8,time_cost:0.058399
Session creation time cost: 0.418648 s
Total inference time cost: 0.506556 s
Total inference requests: 8
Average inference time cost: 63.3196 ms
Total inference run time: 0.507934 s
Avg CPU usage: 93 %
Peak working set size: 18747392 bytes
Avg CPU usage:93
Peak working set size:18747392
Runs:8
Min Latency: 0.0580345 s
Max Latency: 0.0784632 s
P50 Latency: 0.0597978 s
P90 Latency: 0.0784632 s
P95 Latency: 0.0784632 s
P99 Latency: 0.0784632 s
P999 Latency: 0.0784632 s