This article describes how to measure the performance of an ONNX model using ONNX Runtime on an STM32MP1xxx platform.
1. Installation[edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit source]
After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:
apt-get install onnxruntime-tools
The model used in this example can be installed from the following package:
apt-get install onnx-models-mobilenet
2. How to use the benchmark application[edit source]
2.1. Executing with the command line[edit source]
The onnxruntime_perf_test executable is located in the userfs partition:
/usr/local/bin/onnxruntime-x.x.x/tools/onnxruntime_perf_test
It accepts the following input parameters:
usage: ./onnxruntime_perf_test [options...] model_path [result_file] Options: -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'. Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. -M: Disable memory pattern. -A: Disable memory arena -I: Generate tensor input binding (Free dimensions are treated as 1.) -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.. -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000. -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600. -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file. -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on. -v: Show verbose information. -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0. -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0. -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0 -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0 -P: Use parallel executor instead of sequential executor. -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all). Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. -u [optimized_model_path]: Specify the optimized model path for saving. -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals. -h: help
2.2. Testing with MobileNet V1[edit source]
The model used for testing is mobilenet_v1_0.5_128_quant.onnx, installed by the onnx-models-mobilenet package.
It is a model used for image classification.
On the target, the model is located here:
/usr/local/demo-ai/computer-vision/models/mobilenet/
To benchmark an ONNX model with onnxruntime_perf_test, use the following command:
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx
Console output:
Session creation time cost: 0.294321 s Total inference time cost: 0.867173 s Total inference requests: 8 Average inference time cost: 108.397 ms Total inference run time: 0.867324 s Avg CPU usage: 50 % Peak working set size: 18923520 bytes Avg CPU usage:50 Peak working set size:18923520 Runs:8 Min Latency: 0.105907 s Max Latency: 0.111885 s P50 Latency: 0.108234 s P90 Latency: 0.111885 s P95 Latency: 0.111885 s P99 Latency: 0.111885 s P999 Latency: 0.111885 s
To obtain the best performance, it is interesting to use the additional flags -P -x 2 -y 2 to use more than one thread for the benchmark depending of the hardware used.
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx
Console output:
Setting intra_op_num_threads to 2 Setting inter_op_num_threads to 2 Session creation time cost: 0.325633 s Total inference time cost: 0.516908 s Total inference requests: 8 Average inference time cost: 64.6135 ms Total inference run time: 0.517071 s Avg CPU usage: 95 % Peak working set size: 17842176 bytes Avg CPU usage:95 Peak working set size:17842176 Runs:8 Min Latency: 0.0594675 s Max Latency: 0.0781295 s P50 Latency: 0.0617747 s P90 Latency: 0.0781295 s P95 Latency: 0.0781295 s P99 Latency: 0.0781295 s P999 Latency: 0.0781295 s
To display more information, use the flag -v.
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 -v /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx
Console output (excerpt):
Setting intra_op_num_threads to 2 Setting inter_op_num_threads to 2 2022-08-22 08:49:47.130287862 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off 2022-08-22 08:49:47.130961905 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true 2022-08-22 08:49:47.131583198 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0 2022-08-22 08:49:47.183344592 [I:onnxruntime:, inference_session.cc:1327 Initialize] Initializing session. 2022-08-22 08:49:47.183556550 [I:onnxruntime:, inference_session.cc:1364 Initialize] Adding default CPU execution provider. 2022-08-22 08:49:47.227434804 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-08-22 08:49:47.248430720 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 ... 2022-08-22 08:49:47.418089054 [V:onnxruntime:, inference_session.cc:150 VerifyEachNodeIsAssignedToAnEp] Node placements 2022-08-22 08:49:47.418275930 [V:onnxruntime:, inference_session.cc:152 VerifyEachNodeIsAssignedToAnEp] All nodes have been placed on [CPUExecutionProvider]. 2022-08-22 08:49:47.421319603 [V:onnxruntime:, session_state.cc:68 CreateGraphInfo] SaveMLValueNameIndexMapping 2022-08-22 08:49:47.422994939 [V:onnxruntime:, session_state.cc:114 CreateGraphInfo] Done saving OrtValue mappings. 2022-08-22 08:49:47.425985278 [I:onnxruntime:, session_state_utils.cc:140 SaveInitializedTensors] Saving initialized tensors. 2022-08-22 08:49:47.436059673 [I:onnxruntime:, session_state_utils.cc:266 SaveInitializedTensors] Done saving initialized tensors 2022-08-22 08:49:47.546423807 [I:onnxruntime:, inference_session.cc:1576 Initialize] Session successfully initialized. 2022-08-22 08:49:47.548277894 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-22 08:49:47.613135022 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:1,time_cost:0.0597978 2022-08-22 08:49:47.673166849 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:2,time_cost:0.0580345 2022-08-22 08:49:47.731373797 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:3,time_cost:0.058048 2022-08-22 08:49:47.789770371 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:4,time_cost:0.0784632 2022-08-22 08:49:47.868311400 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:5,time_cost:0.0588315 2022-08-22 08:49:47.927296267 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:6,time_cost:0.0687991 2022-08-22 08:49:47.996246694 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:7,time_cost:0.0661833 2022-08-22 08:49:48.062668361 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:8,time_cost:0.058399 Session creation time cost: 0.418648 s Total inference time cost: 0.506556 s Total inference requests: 8 Average inference time cost: 63.3196 ms Total inference run time: 0.507934 s Avg CPU usage: 93 % Peak working set size: 18747392 bytes Avg CPU usage:93 Peak working set size:18747392 Runs:8 Min Latency: 0.0580345 s Max Latency: 0.0784632 s P50 Latency: 0.0597978 s P90 Latency: 0.0784632 s P95 Latency: 0.0784632 s P99 Latency: 0.0784632 s P999 Latency: 0.0784632 s
3. References[edit source]