How to use hardware acceleration with TensorFlow Lite and ONNX Runtime frameworks

Applicable for

1. Article purpose[edit | edit source]

The main purpose of this article is to describe the main steps and advise on how to use GPU/NPU hardware acceleration on STM32MP2 series with AI hardware acceleration using TensorFlow Lite^TM and ONNX Runtime^TM frameworks.

2. Hardware acceleration with TensorFlow Lite^TM[edit | edit source]

2.1. Prerequisites[edit | edit source]

First, it is mandatory to use a model that supports acceleration with the GPU/NPU of the STM32MP2 series with AI hardware acceleration. To make sure the model can be accelerated, follow this wiki article to deploy the NN model correctly.

For TensorFlow Lite model hardware acceleration, in addition to OpenVX (NBG model) an external delegate for TensorFlow Lite runtime named tflite-vx-delegate has been delivered since X-LINUX-AI v6.0.0 release. It allows to directly run .tflite models on GPU/NPU of the STM32MP2 series with AI hardware acceleration through TensorFlow Lite runtime.

To install the tflite-vx delegate package on the board:

 x-linux-ai -i tflite-vx-delegate
 x-linux-ai -i tflite-vx-delegate-example

This command installs the libvx_delegate in the usr/lib directory of the board.

Then, check if the libraries are correctly installed:

 ls /usr/lib | grep -e libvx_delegate.so

The different libraries are listed as follows with X, Y, and Z representing the version of the library:

libvx_delegate.so.X
libvx_delegate.so.X.Y.Z

Information

Theses packages are automatically downloaded on STM32MP2x series with the packagegroup-x-linux-ai-demo (for both packages) as well as packagegroup-x-linux-ai-tflite for tflite-vx-delegate.

Important

For TFLite vx-delegate, inference source code examples in C++ and Python are available in the meta-layer x-linux-ai in recipes-frameworks/tflite-vx-delegate/files/tflite-vx-delegate-example and on the board usr/local/bin/tflite-vx-delegate-example/

Now with everything correctly set up, let's see how to use the GPU/NPU hardware acceleration with TensorFlow Lite^TM.

2.2. Acceleration with TensorFlow Lite^TM[edit | edit source]

There are two different ways to use GPU/NPU hardware acceleration on the STM32MP2 series with AI hardware acceleration using TensorFlow Lite^TM.

2.2.1. C++ TensorFlow Lite^TM API[edit | edit source]

Here is a snippet on how to modify your C++ application to include the support of the TensorFlow Lite^TM external delegate for STM32MP2 series GPU/NPU:

 std::unique_ptr<tflite::FlatBufferModel> model;
 std::unique_ptr<tflite::Interpreter> interpreter;
 model = tflite::FlatBufferModel::BuildFromBuffer(_model, _len);
 std::string vx_delegate_path = "/usr/lib/libvx_delegate.so.2";
 
 model->error_reporter();
 tflite::ops::builtin::BuiltinOpResolver resolver;
 
 /* Add custom operator from TFLite VX delegate */
 resolver.AddCustom(kNbgCustomOp, tflite::ops::custom::Register_VSI_NPU_PRECOMPILED());
 
 tflite::InterpreterBuilder(*model, resolver)(&interpreter);
 if (!interpreter) {
   std::cout << "FATAL: Failed to construct interpreter" << std::endl;
   exit(-1);
 }
 
 /* Define the path to the TFLite external delegate */
 const char * delegate_path = vx_delegate_path.c_str();
 
 /* Set external delegate options */
 auto ext_delegate_option = TfLiteExternalDelegateOptionsDefault(delegate_path);
 
 /* Add optional features to skip the warmup time if a Network Binary Graph is already generated */
 ext_delegate_option.insert(&ext_delegate_option, "cache_file_path", "/path/to/my_nn_network.nb");
 ext_delegate_option.insert(&ext_delegate_option, "allowed_cache_mode", "true");
 auto ext_delegate_ptr = TfLiteExternalDelegateCreate(&ext_delegate_option);
 
 /* Set the interpreter to use the external delegate */
 interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);
  
 /* Optionnaly set the number of CPU threads for fallback */
 interpreter->SetNumThreads(number_of_threads);
 
 if (interpreter->AllocateTensors() != kTfLiteOk) {
 std::cout << "FATAL: Failed to allocate tensors!" << std::endl;
 [...]

The TfLiteExternalDelegateOptionsDefault must be initialized with the path to the library of the delegate. In this case the library is libvx_delegate, located in the user file system.

Then it is possible to add specific options available in the chosen delegate. There is no obligation to define these specific options. In the case of libvx_delegate, two options are interesting:

cache_file_path: if an .nb file pointed to by this variable is found, it is used to skip the warmup time, that is, the compilation of the .tflite model on the target. If no .nb file is found, it generates an .nb file in this path.
allowed_cache_mode: if this variable is set to True, the .nb file pointed to by cache_file_path is used or generated.

Finally, modify the interpreter of the delegate with the ModifyGraphWithDelegate function.

Information

For more information on how to use this delegate or for any issues, refer to the tflite-vx-delegate GitHub^[1].

2.2.2. Python TensorFlow Lite^TM API[edit | edit source]

The following is a snippet on how to modify the Python application to include the support of the TensorFlow Lite^TM external delegate for STM32MP2 series GPU/NPU:

 import tflite_runtime.interpreter as tflr
 
 vx_delegate = tflr.load_delegate(library="/usr/lib/libvx_delegate.so.2",
                                  options={"cache_file_path":"<path/to/.nb/model>", "allowed_cache_mode":"true"})
 
 self._interpreter = tflr.Interpreter(model_path=<path/to/your/model/.tflite>,
                                      num_threads = 2,                           # Number of CPU cores
                                      experimental_delegates=[vx_delegate])
 
 self._interpreter.allocate_tensors()
 self._input_details = self._interpreter.get_input_details()
 self._output_details = self._interpreter.get_output_details()
 [...]

The TensorFlow Lite^TM Interpreter must be initialized with the path to the model and the experimental_delegate path to library of the delegate. In this case the library is libvx_delegate, located in the user file system.

It is possible to add specific options available in the chosen delegate. There is no obligation to define these specific options.
In the case of libvx_delegate, two options are interesting:

cache_file_path: if an .nb file pointed to by this variable is found, it is used to skip the warmup time, that is, the compilation of the .tflite model on the target. If no .nb file is found, it generates an .nb file in this path.
allowed_cache_mode: if this variable is set to True, the .nb file pointed to by cache_file_path is used or generated.

The TFLite interpreter can be used in the same way with or without the declaration of the delegate. It has no impact on the rest of the code.

Information

For more information on how to use this delegate or for any issues, refer to the tflite-vx-delegate GitHub^[1].

3. Hardware acceleration with ONNX^TM Runtime[edit | edit source]

3.1. Prerequisites[edit | edit source]

For ONNX models hardware acceleration, in addition to OpenVX (NBG model) an execution provider for ONNX runtime named VSInpu execution provider has been delivered since X-LINUX-AI v6.0.0 release. It allows to directly run .onnx models on GPU/NPU of the STM32MP2 series with AI hardware acceleration through ONNX runtime.

To install the VSInpu execution provider package on the board:

 x-linux-ai -i onnxruntime 
 x-linux-ai -i python3-onnxruntime 
 x-linux-ai -i ort-vsinpu-ep-example-cpp
 x-linux-ai -i ort-vsinpu-ep-example-python

The VSInpu execution provider is up-streamed directly in ONNX runtime, thus it is automatically available with the onnxruntime package for STM32MP2x series

This command installs the libonnxruntime in the usr/lib directory of the board.

Then, check if the libraries are correctly installed:

 ls /usr/lib | grep -e libonnxruntime.so

The different libraries are listed as follows with X, Y, and Z representing the version of the library:

libonnruntime.so.X
libonnruntime.so.X.Y.Z

Information

Theses packages are automatically downloaded on STM32MP2x series with the packagegroup-x-linux-ai-demo (for both packages) as well packagegroup-x-linux-ai-onnx for VSInpu execution provider.

Important

For ONNX Runtime VSInpu, inference source code examples in C++ and Python are available in the meta-layer x-linux-ai inrecipes-frameworks/onnxruntime/files/ort-vsinpu-ep-example and on the board usr/local/bin/ort-vsinpu-ep-example/

Now with everything correctly set up, let's see how to use the GPU/NPU hardware acceleration with ONNX^TM Runtime.

3.2. Acceleration with ONNX^TM Runtime[edit | edit source]

3.2.1. C++ ONNX^TM Runtime API[edit | edit source]

The following is a snippet on how to modify the C++ application to include the support of the ONNX^TM Runtime execution provider for STM32MP2 series GPU/NPU:

 Ort::Env ort_env(ORT_LOGGING_LEVEL_WARNING, "Onnx_environment");
 Ort::SessionOptions session_options;
     // Set VSINPU AI execution provider
     OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_VSINPU(session_options);
     if (status != nullptr) {
         std::cerr << "Failed to set VSINPU AI execution provider: " << Ort::GetApi().GetErrorMessage(status) << std::endl;
         Ort::GetApi().ReleaseStatus(status);
         throw std::runtime_error("[ORT] Failed: VSINPU AI execution provider runtime error");
     }
 }
 session_options.DisableCpuMemArena();
 session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
 
 /* create a session from the ONNX model file */
 Ort::Session session(ort_env, model_path.c_str(), session_options);
 
 [...]
 # Get input shape and prepare your input data
 [...]
 Ort::RunOptions run_options;
 auto output_tensors = session.Run(run_options, input_name_data, 
 input_tensor_data, num_of_input, output_data, num_of_ouutput);
 [...]

The ONNX^TM Runtime session_options must be modified with the VSINPUExecutionProvider to run the NN model on the GPU/NPU. The OrtSessionOptionsAppendExecutionProvider_VSINPU function is used to execute the model on the GPU/NPU instead of the CPU. If an operation is not supported, the execution of this operation falls back to the CPU.

Information

For more information on how to use this execution provider or for any issues, refer to the ONNX Runtime GitHub^[2].

3.2.2. Python ONNX^TM Runtime API[edit | edit source]

The following is a snippet on how to modify the Python application to include the support of the ONNX^TM execution provider for STM32MP2 series GPU/NPU:

import onnxruntime as ort

# Load the ONNX model and create an inference session with VSI NPU execution provider
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession(model_path, sess_options=session_options, providers=['VSINPUExecutionProvider'])

# Get input and output details
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
input_type = session.get_inputs()[0].type

# Prepare input data
[...]
# Run the inference
output_data = session.run(None, {input_name: input_data})
[...]

The python API of ONNX Runtime is very simple to use as the VSInpu execution provider is part of ONNX runtime officially supported execution provider. The only needed modification is to use the providers option in the session to point out VSINPUExecutionProvider.

Information

For more information on how to use this execution provider or for any issues, refer to the ONNX Runtime GitHub^[2].

4. Hardware acceleration with with the STAI_MPU API[edit | edit source]

It is also possible to accelerate the TFLite^TM or ONNX^TM models using the GPU/NPU via the STAI_MPU API by setting an option in the STAI_MPU constructor.
However, the STAI_MPU is based on the TensorFlow Lite^TM and ONNX^TM Runtime mechanism explained in the previous sections.
To find out more about how to unable hardware acceleration using the STAI_MPU API, refer to this article: STAI_MPU:_AI_unified_API_for_STM32MPUs.

5. References[edit | edit source]

↑ ^1.0 ^1.1 TFLite VX Delegate GitHub
↑ ^2.0 ^2.1 ONNX Runtime GitHub

[vxdelegate_url-1] 1.0 ^1.1 TFLite VX Delegate GitHub

[onnxruntime_url-2] 2.0 ^2.1 ONNX Runtime GitHub

[1]

[2]

How to use hardware acceleration with TensorFlow Lite and ONNX Runtime frameworks

1. Article purpose[edit | edit source]

2. Hardware acceleration with TensorFlow LiteTM[edit | edit source]

2.1. Prerequisites[edit | edit source]

2.2. Acceleration with TensorFlow LiteTM[edit | edit source]

2.2.1. C++ TensorFlow LiteTM API[edit | edit source]

2.2.2. Python TensorFlow LiteTM API[edit | edit source]

3. Hardware acceleration with ONNXTM Runtime[edit | edit source]

3.1. Prerequisites[edit | edit source]

3.2. Acceleration with ONNXTM Runtime[edit | edit source]

3.2.1. C++ ONNXTM Runtime API[edit | edit source]

3.2.2. Python ONNXTM Runtime API[edit | edit source]

4. Hardware acceleration with with the STAI_MPU API[edit | edit source]

5. References[edit | edit source]

2. Hardware acceleration with TensorFlow Lite^TM[edit | edit source]

2.2. Acceleration with TensorFlow Lite^TM[edit | edit source]

2.2.1. C++ TensorFlow Lite^TM API[edit | edit source]

2.2.2. Python TensorFlow Lite^TM API[edit | edit source]

3. Hardware acceleration with ONNX^TM Runtime[edit | edit source]

3.2. Acceleration with ONNX^TM Runtime[edit | edit source]

3.2.1. C++ ONNX^TM Runtime API[edit | edit source]

3.2.2. Python ONNX^TM Runtime API[edit | edit source]