This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MPUs platform.
1. Installation[edit | edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]
After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:
x-linux-ai -i onnxruntime-tools
The model used in this example can be installed from the following package:
x-linux-ai -i img-models-mobilenetv2-10-224
2. How to use the benchmark application[edit | edit source]
2.1. Executing with the command line[edit | edit source]
The onnxruntime_perf_test executable is located in the userfs partition:
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test
It accepts the following input parameters:
usage: ./onnxruntime_perf_test [options...] model_path [result_file] Options: -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'. Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. -M: Disable memory pattern. -A: Disable memory arena -I: Generate tensor input binding (Free dimensions are treated as 1.) -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1. -e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'. -b [tf|ort]: backend to use. Default:ort -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000. -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600. -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file. -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on. -S: Given random seed, to produce the same input data. This defaults to -1(no initialize). -v: Show verbose information. -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0. -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0. -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0 -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0 -P: Use parallel executor instead of sequential executor. -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all). Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. -u [optimized_model_path]: Specify the optimized model path for saving. -d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default). -q [CUDA only] use separate stream for copy. -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals. -C: Specify session configuration entries as key-value pairs: -C "<key1>|<value1> <key2>|<value2>" Refer to onnxruntime_session_options_config_keys.h for valid keys and values. [Example] -C "session.disable_cpu_ep_fallback|1 ep.context_enable|1" -i: Specify EP specific runtime options as key value pairs. Different runtime options available are: [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>' [DML only] [performance_preference]: DML device performance preference, options: 'default', 'minimum_power', 'high_performance', [DML only] [device_filter]: DML device filter, options: 'any', 'gpu', 'npu', [DML only] [disable_metacommands]: Options: 'true', 'false', [DML only] [enable_graph_capture]: Options: 'true', 'false', [DML only] [enable_graph_serialization]: Options: 'true', 'false', [OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime. [OpenVINO only] [device_id]: Selects a particular hardware device for inference. [OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on NPU device targets. [OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime. [OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded. [OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU) [Example] [For OpenVINO EP] -e openvino -i "device_type|CPU enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>"" [QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'. [QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'. [profiling_file_path] : QNN profiling file path if ETW not enabled. [QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10. [QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set). [QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance', 'high_power_saver', 'low_balanced', 'extreme_power_saver', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'. [QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'. [QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'. [QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options: '0', '1', '2', '3', default is '0'. [QNN only] [soc_model]: The SoC Model number. Refer to QNN SDK documentation for specific values. Defaults to '0' (unknown). [QNN only] [htp_arch]: The minimum HTP architecture. The driver will use ops compatible with this architecture. Options are '0', '68', '69', '73', '75'. Defaults to '0' (none). [QNN only] [device_id]: The ID of the device to use when setting 'htp_arch'. Defaults to '0' (for single device). [QNN only] [enable_htp_fp16_precision]: Enable the HTP_FP16 precision so that the float32 model will be inferenced with fp16 precision. Otherwise, it will be fp32 precision. Only works for float32 model. Defaults to '0' (with FP32 precision.). [Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so" [TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability. [TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs. [TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte. [TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision. [TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision. [TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name. [TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table. [TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device. [TensorRT only] [trt_dla_core]: DLA core number. [TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model. [TensorRT only] [trt_engine_cache_enable]: Enable engine caching. [TensorRT only] [trt_engine_cache_path]: Specify engine cache path. [TensorRT only] [trt_engine_cache_prefix]: Customize engine cache prefix when trt_engine_cache_enable is true. [TensorRT only] [trt_engine_hw_compatible]: Enable hardware compatibility. Engines ending with '_sm80+' can be re-used across all Ampere+ GPU (a hardware-compatible engine may have lower throughput and/or higher latency than its non-hardware-compatible counterpart). [TensorRT only] [trt_weight_stripped_engine_enable]: Enable weight-stripped engine build. [TensorRT only] [trt_onnx_model_folder_path]: Folder path for the ONNX model with weights. [TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially. [TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs. [TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow. [Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false' [NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP.. [NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP. [NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices. [NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP. [Example] [For NNAPI EP] -e nnapi -i "NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED" [CoreML only] [COREML_FLAG_CREATE_MLPROGRAM]: Create an ML Program model instead of Neural Network. [Example] [For CoreML EP] -e coreml -i "COREML_FLAG_CREATE_MLPROGRAM" [SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'. [SNPE only] [priority]: execution priority, options: 'low', 'normal'. [SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'. [SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default. [Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low" -T [Set intra op thread affinities]: Specify intra op thread affinity string [Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6 Use semicolon to separate configuration between threads. E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor. The number of affinities must be equal to intra_op_num_threads - 1 -D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool. -Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage. -n [Exit after session creation]: allow user to measure session creation time to measure impact of enabling any initialization optimizations. -l Provide file as binary in memory by using fopen before session creation. -h: help
2.2. Testing with MobileNet[edit | edit source]
The model used for testing is mobilenet_v1_0.5_128_quant.onnx, installed by the img-models-mobilenetv1-05-128 package.
It is a model used for image classification.
On the target, the model is located here:
/usr/local/x-linux-ai/image-classification/models/mobilenet/
2.2.1. Benchmark on NPU[edit | edit source]
To benchmark an ONNX model on NPU with onnxruntime_perf_test, please expand this section.
To run the benchmark, use the following command:
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 1 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx -e vsinpu
Console output:
Session creation time cost: 5.88315 s First inference time cost: 2 ms Total inference time cost: 0.0176648 s Total inference requests: 8 Average inference time cost: 2.2081 ms Total inference run time: 0.0178289 s Number of inferences per second: 448.71 Avg CPU usage: 0 % Peak working set size: 76144640 bytes Avg CPU usage:0 Peak working set size:76144640 Runs:8 Min Latency: 0.00217539 s Max Latency: 0.00223231 s P50 Latency: 0.00222424 s P90 Latency: 0.00223231 s P95 Latency: 0.00223231 s P99 Latency: 0.00223231 s P999 Latency: 0.00223231 s
To display more information, use the flag -v.
2.2.2. Benchmark on CPU[edit | edit source]
To benchmark an ONNX model on CPU with onnxruntime_perf_test, please expand this section.
To run the benchmark, use the following command:
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx
Console output:
Session creation time cost: 0.107338 s First inference time cost: 18 ms Total inference time cost: 0.142892 s Total inference requests: 8 Average inference time cost: 17.8616 ms Total inference run time: 0.143068 s Number of inferences per second: 55.9176 Avg CPU usage: 100 % Peak working set size: 32243712 bytes Avg CPU usage:100 Peak working set size:32243712 Runs:8 Min Latency: 0.0177605 s Max Latency: 0.0180254 s P50 Latency: 0.0178472 s P90 Latency: 0.0180254 s P95 Latency: 0.0180254 s P99 Latency: 0.0180254 s P999 Latency: 0.0180254 s
To obtain the best performance, it is interesting to use the additional flags -P -x 2 -y 1 to use more than one thread for the benchmark depending of the hardware used.
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 1 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx
Console output:
Setting intra_op_num_threads to 2 Setting inter_op_num_threads to 1 Session creation time cost: 0.119392 s First inference time cost: 18 ms Total inference time cost: 0.145847 s Total inference requests: 8 Average inference time cost: 18.2309 ms Total inference run time: 0.14602 s Number of inferences per second: 54.7871 Avg CPU usage: 96 % Peak working set size: 34209792 bytes Avg CPU usage:96 Peak working set size:34209792 Runs:8 Min Latency: 0.0177527 s Max Latency: 0.019529 s P50 Latency: 0.018044 s P90 Latency: 0.019529 s P95 Latency: 0.019529 s P99 Latency: 0.019529 s P999 Latency: 0.019529 s
To display more information, use the flag -v.
2.2.3. Benchmark on GPU[edit | edit source]
To benchmark an ONNX model on GPU with onnxruntime_perf_test, please expand this section.
First, export a environment variable:
export VIV_VX_DISABLE_TP_NN=1
Then, run the benchmark
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/x-linu
x-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx -e vsinpu
Console output:
Session creation time cost: 0.469887 s First inference time cost: 12 ms Total inference time cost: 0.0974822 s Total inference requests: 8 Average inference time cost: 12.1853 ms Total inference run time: 0.0976532 s Number of inferences per second: 81.9226 Avg CPU usage: 5 % Peak working set size: 105005056 bytes Avg CPU usage:5 Peak working set size:105005056 Runs:8 Min Latency: 0.0119358 s Max Latency: 0.0127721 s P50 Latency: 0.0120457 s P90 Latency: 0.0127721 s P95 Latency: 0.0127721 s P99 Latency: 0.0127721 s P999 Latency: 0.0127721 s
To display more information, use the flag -v.
3. References[edit | edit source]