This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MPUs platform.
1. Installation[edit | edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]
After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:
x-linux-ai -i onnxruntime-tools
The model used in this example can be installed from the following package:
x-linux-ai -i img-models-mobilenetv2-10-224
2. How to use the benchmark application[edit | edit source]
2.1. Executing with the command line[edit | edit source]
The onnxruntime_perf_test executable is located in the userfs partition:
/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test
It accepts the following input parameters:
NPU device targets. [OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime. [OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded. [OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU) [Example] [For OpenVINO EP] -e openvino -i "device_type|CPU enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>"" [QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'. [QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'. [profiling_file_path] : QNN profiling file path if ETW not enabled. [QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10. [QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set). [QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance', 'high_power_saver', 'low_balanced', 'extreme_power_saver', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'. [QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'. [QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'. [QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options: '0', '1', '2', '3', default is '0'. [QNN only] [soc_model]: The SoC Model number. Refer to QNN SDK documentation for specific values. Defaults to '0' (unknown). [QNN only] [htp_arch]: The minimum HTP architecture. The driver will use ops compatible with this architecture. Options are '0', '68', '69', '73', '75'. Defaults to '0' (none). [QNN only] [device_id]: The ID of the device to use when setting 'htp_arch'. Defaults to '0' (for single device). [QNN only] [enable_htp_fp16_precision]: Enable the HTP_FP16 precision so that the float32 model will be inferenced with fp16 precision. Otherwise, it will be fp32 precision. Only works for float32 model. Defaults to '0' (with FP32 precision.). [Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so" [TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability. [TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs. [TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte. [TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision. [TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision. [TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name. [TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table. [TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device. [TensorRT only] [trt_dla_core]: DLA core number. [TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model. [TensorRT only] [trt_engine_cache_enable]: Enable engine caching. [TensorRT only] [trt_engine_cache_path]: Specify engine cache path. [TensorRT only] [trt_engine_cache_prefix]: Customize engine cache prefix when trt_engine_cache_enable is true. [TensorRT only] [trt_engine_hw_compatible]: Enable hardware compatibility. Engines ending with '_sm80+' can be re-used across all Ampere+ GPU (a hardware-compatible engine may have lower throughput and/or higher latency than its non-hardware-compatible counterpart). [TensorRT only] [trt_weight_stripped_engine_enable]: Enable weight-stripped engine build. [TensorRT only] [trt_onnx_model_folder_path]: Folder path for the ONNX model with weights. [TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially. [TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs. [TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow. [Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false' [NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP.. [NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP. [NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices. [NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP. [Example] [For NNAPI EP] -e nnapi -i "NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED" [CoreML only] [COREML_FLAG_CREATE_MLPROGRAM]: Create an ML Program model instead of Neural Network. [Example] [For CoreML EP] -e coreml -i "COREML_FLAG_CREATE_MLPROGRAM" [SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'. [SNPE only] [priority]: execution priority, options: 'low', 'normal'. [SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'. [SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default. [Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low" -T [Set intra op thread affinities]: Specify intra op thread affinity string [Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6 Use semicolon to separate configuration between threads. E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor. The number of affinities must be equal to intra_op_num_threads - 1 -D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool. -Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage. -n [Exit after session creation]: allow user to measure session creation time to measure impact of enabling any initialization optimizations. -l Provide file as binary in memory by using fopen before session creation. -h: helpusage: ./onnxruntime_perf_test [options...] model_path [result_file] Options: -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'. Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. -M: Disable memory pattern. -A: Disable memory arena -I: Generate tensor input binding (Free dimensions are treated as 1.) -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1. -e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'. -b [tf|ort]: backend to use. Default:ort -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000. -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600. -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file. -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on. -S: Given random seed, to produce the same input data. This defaults to -1(no initialize). -v: Show verbose information. -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0. -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0. -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0 -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0 -P: Use parallel executor instead of sequential executor. -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all). Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. -u [optimized_model_path]: Specify the optimized model path for saving. -d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default). -q [CUDA only] use separate stream for copy. -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals. -C: Specify session configuration entries as key-value pairs: -C "<key1>|<value1> <key2>|<value2>" Refer to onnxruntime_session_options_config_keys.h for valid keys and values. [Example] -C "session.disable_cpu_ep_fallback|1 ep.context_enable|1" -i: Specify EP specific runtime options as key value pairs. Different runtime options available are: [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>' [DML only] [performance_preference]: DML device performance preference, options: 'default', 'minimum_power', 'high_performance', [DML only] [device_filter]: DML device filter, options: 'any', 'gpu', 'npu', [DML only] [disable_metacommands]: Options: 'true', 'false', [DML only] [enable_graph_capture]: Options: 'true', 'false', [DML only] [enable_graph_serialization]: Options: 'true', 'false', [OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime. [OpenVINO only] [device_id]: Selects a particular hardware device for inference. [OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on
2.2. Testing with MobileNet[edit | edit source]
The model used for testing is mobilenet_v1_0.5_128_quant.onnx, installed by the img-models-mobilenetv1-05-128 package.
It is a model used for image classification.
On the target, the model is located here:
/usr/local/x-linux-ai/image-classification/models/mobilenet/
2.2.1. Benchmark on NPU[edit | edit source]
To benchmark an ONNX model on NPU with onnxruntime_perf_test, please expand this section.
2.2.2. Benchmark on CPU[edit | edit source]
To benchmark an ONNX model on CPU with onnxruntime_perf_test, please expand this section.
2.2.3. Benchmark on GPU[edit | edit source]
To benchmark an ONNX model on GPU with onnxruntime_perf_test, please expand this section.
3. References[edit | edit source]