Last edited one month ago

How to measure the performance of your models using ONNX Runtime

Applicable for STM32MP13x lines, STM32MP15x lines, STM32MP25x lines


This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MPUs platform.

1. Installation[edit | edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:

x-linux-ai -i onnxruntime-tools

The model used in this example can be installed from the following package:

x-linux-ai -i img-models-mobilenetv2-10-224

2. How to use the benchmark application[edit | edit source]

2.1. Executing with the command line[edit | edit source]

The onnxruntime_perf_test executable is located in the userfs partition:

/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test

It accepts the following input parameters:

usage: ./onnxruntime_perf_test [options...] model_path [result_file]
Options:
	-m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
		Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
	-M: Disable memory pattern.
	-A: Disable memory arena
	-I: Generate tensor input binding (Free dimensions are treated as 1.)
	-c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
	-e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'.
	-b [tf|ort]: backend to use. Default:ort
	-r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
	-t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
	-p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
	-s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
	-S: Given random seed, to produce the same input data. This defaults to -1(no initialize).
	-v: Show verbose information.
	-x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
	-y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
	-f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
	-F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
	-P: Use parallel executor instead of sequential executor.
	-o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
		Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
	-u [optimized_model_path]: Specify the optimized model path for saving.
	-d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default). 
	-q [CUDA only] use separate stream for copy. 
	-z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.
	-C: Specify session configuration entries as key-value pairs: -C "<key1>|<value1> <key2>|<value2>" 
	    Refer to onnxruntime_session_options_config_keys.h for valid keys and values. 
	    [Example] -C "session.disable_cpu_ep_fallback|1 ep.context_enable|1" 
	-i: Specify EP specific runtime options as key value pairs. Different runtime options available are: 
	    [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'

	    [DML only] [performance_preference]: DML device performance preference, options: 'default', 'minimum_power', 'high_performance', 
	    [DML only] [device_filter]: DML device filter, options: 'any', 'gpu', 'npu', 
	    [DML only] [disable_metacommands]: Options: 'true', 'false', 
	    [DML only] [enable_graph_capture]: Options: 'true', 'false', 
	    [DML only] [enable_graph_serialization]: Options: 'true', 'false', 

	    [OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime.
	    [OpenVINO only] [device_id]: Selects a particular hardware device for inference.
	    [OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on NPU device targets.
	    [OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime.
	    [OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded.
	    [OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU) 
	    [Example] [For OpenVINO EP] -e openvino -i "device_type|CPU enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>""

	    [QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'.
	    [QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'.
	    [profiling_file_path] : QNN profiling file path if ETW not enabled.
	    [QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10.
	    [QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set).
	    [QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance', 
	    'high_power_saver', 'low_balanced', 'extreme_power_saver', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'. 
	    [QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'. 
	    [QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'.
	    [QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options: 
	    '0', '1', '2', '3', default is '0'.
	    [QNN only] [soc_model]: The SoC Model number. Refer to QNN SDK documentation for specific values. Defaults to '0' (unknown). 
	    [QNN only] [htp_arch]: The minimum HTP architecture. The driver will use ops compatible with this architecture. 
	    Options are '0', '68', '69', '73', '75'. Defaults to '0' (none). 
	    [QNN only] [device_id]: The ID of the device to use when setting 'htp_arch'. Defaults to '0' (for single device). 
	    [QNN only] [enable_htp_fp16_precision]: Enable the HTP_FP16 precision so that the float32 model will be inferenced with fp16 precision. 
	    Otherwise, it will be fp32 precision. Only works for float32 model. Defaults to '0' (with FP32 precision.). 
	    [Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so" 

	    [TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability.
	    [TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs.
	    [TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte.
	    [TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision.
	    [TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision.
	    [TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name.
	    [TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table.
	    [TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device.
	    [TensorRT only] [trt_dla_core]: DLA core number.
	    [TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model.
	    [TensorRT only] [trt_engine_cache_enable]: Enable engine caching.
	    [TensorRT only] [trt_engine_cache_path]: Specify engine cache path.
	    [TensorRT only] [trt_engine_cache_prefix]: Customize engine cache prefix when trt_engine_cache_enable is true.
	    [TensorRT only] [trt_engine_hw_compatible]: Enable hardware compatibility. Engines ending with '_sm80+' can be re-used across all Ampere+ GPU (a hardware-compatible engine may have lower throughput and/or higher latency than its non-hardware-compatible counterpart).
	    [TensorRT only] [trt_weight_stripped_engine_enable]: Enable weight-stripped engine build.
	    [TensorRT only] [trt_onnx_model_folder_path]: Folder path for the ONNX model with weights.
	    [TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially.
	    [TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs.
	    [TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow.
	    [Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false'

	    [NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP..
	    [NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP.
	    [NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices.
	    [NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP.
	    [Example] [For NNAPI EP] -e nnapi -i "NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED"

	    [CoreML only] [COREML_FLAG_CREATE_MLPROGRAM]: Create an ML Program model instead of Neural Network.
	    [Example] [For CoreML EP] -e coreml -i "COREML_FLAG_CREATE_MLPROGRAM"

	    [SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'. 
	    [SNPE only] [priority]: execution priority, options: 'low', 'normal'. 
	    [SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'. 
	    [SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default. 
	    [Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low" 


	-T [Set intra op thread affinities]: Specify intra op thread affinity string
	 [Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6 
		 Use semicolon to separate configuration between threads.
		 E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor.
		 The number of affinities must be equal to intra_op_num_threads - 1

	-D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool.
	-Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage.
	-n [Exit after session creation]: allow user to measure session creation time to measure impact of enabling any initialization optimizations.
	-l Provide file as binary in memory by using fopen before session creation.
	-h: help

2.2. Testing with MobileNet[edit | edit source]

The model used for testing is mobilenet_v1_0.5_128_quant.onnx, installed by the img-models-mobilenetv1-05-128 package. It is a model used for image classification.
On the target, the model is located here:

/usr/local/x-linux-ai/image-classification/models/mobilenet/

2.2.1. Benchmark on NPU[edit | edit source]

To benchmark an ONNX model on NPU with onnxruntime_perf_test, please expand this section.

To run the benchmark, use the following command:

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 1 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx  -e vsinpu

Console output:

Session creation time cost: 5.88315 s
First inference time cost: 2 ms
Total inference time cost: 0.0176648 s
Total inference requests: 8
Average inference time cost: 2.2081 ms
Total inference run time: 0.0178289 s
Number of inferences per second: 448.71 
Avg CPU usage: 0 %
Peak working set size: 76144640 bytes
Avg CPU usage:0
Peak working set size:76144640
Runs:8
Min Latency: 0.00217539 s
Max Latency: 0.00223231 s
P50 Latency: 0.00222424 s
P90 Latency: 0.00223231 s
P95 Latency: 0.00223231 s
P99 Latency: 0.00223231 s
P999 Latency: 0.00223231 s

To display more information, use the flag -v.

2.2.2. Benchmark on CPU[edit | edit source]

To benchmark an ONNX model on CPU with onnxruntime_perf_test, please expand this section.

To run the benchmark, use the following command:

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx 

Console output:

Session creation time cost: 0.107338 s
First inference time cost: 18 ms
Total inference time cost: 0.142892 s
Total inference requests: 8
Average inference time cost: 17.8616 ms
Total inference run time: 0.143068 s
Number of inferences per second: 55.9176 
Avg CPU usage: 100 %
Peak working set size: 32243712 bytes
Avg CPU usage:100
Peak working set size:32243712
Runs:8
Min Latency: 0.0177605 s
Max Latency: 0.0180254 s
P50 Latency: 0.0178472 s
P90 Latency: 0.0180254 s
P95 Latency: 0.0180254 s
P99 Latency: 0.0180254 s
P999 Latency: 0.0180254 s

To obtain the best performance, it is interesting to use the additional flags -P -x 2 -y 1 to use more than one thread for the benchmark depending of the hardware used.

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 1 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx 

Console output:

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 1
Session creation time cost: 0.119392 s
First inference time cost: 18 ms
Total inference time cost: 0.145847 s
Total inference requests: 8
Average inference time cost: 18.2309 ms
Total inference run time: 0.14602 s
Number of inferences per second: 54.7871 
Avg CPU usage: 96 %
Peak working set size: 34209792 bytes
Avg CPU usage:96
Peak working set size:34209792
Runs:8
Min Latency: 0.0177527 s
Max Latency: 0.019529 s
P50 Latency: 0.018044 s
P90 Latency: 0.019529 s
P95 Latency: 0.019529 s
P99 Latency: 0.019529 s
P999 Latency: 0.019529 s

To display more information, use the flag -v.

2.2.3. Benchmark on GPU[edit | edit source]

To benchmark an ONNX model on GPU with onnxruntime_perf_test, please expand this section.

First, export a environment variable:

 export VIV_VX_DISABLE_TP_NN=1

Then, run the benchmark

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/x-linu

x-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.onnx -e vsinpu

Console output:

Session creation time cost: 0.469887 s
First inference time cost: 12 ms
Total inference time cost: 0.0974822 s
Total inference requests: 8
Average inference time cost: 12.1853 ms
Total inference run time: 0.0976532 s
Number of inferences per second: 81.9226 
Avg CPU usage: 5 %
Peak working set size: 105005056 bytes
Avg CPU usage:5
Peak working set size:105005056
Runs:8
Min Latency: 0.0119358 s
Max Latency: 0.0127721 s
P50 Latency: 0.0120457 s
P90 Latency: 0.0127721 s
P95 Latency: 0.0127721 s
P99 Latency: 0.0127721 s
P999 Latency: 0.0127721 s

To display more information, use the flag -v.

3. References[edit | edit source]