docs/source/using_api.rst

   1 .. _using-tidl-api:
   2
   3 *************
   4 Using the API
   5 *************
   6
   7 This section illustrates using TIDL APIs to leverage deep learning in user applications. The overall flow is as follows:
   8
   9 * Create a :term:`Configuration` object to specify the set of parameters required for network exectution.
  10 * Create :term:`Executor` objects - one to manage overall execution on the EVEs, the other for C66x DSPs.
  11 * Use the :term:`Execution Objects<EO>` (EO) created by the Executor to process :term:`frames<Frame>`. There are two approaches to processing frames using Execution Objects:
  12
  13   #. Each EO processes a single frame.
  14   #. Split processing frame across multiple EOs using an :term:`ExecutionObjectPipeline`.
  15
  16 Refer Section :ref:`api-documentation` for API documentation.
  17
  18 Use Cases
  19 +++++++++
  20
  21 .. _use-case-1:
  22
  23 Each EO processes a single frame
  24 ================================
  25
  26 In this approach, the :term:`network<Network>` is set up as a single :term:`Layer Group`. An :term:`EO` runs the entire layer group on a single frame. To increase throughput, frame processing can be pipelined across available EOs. For example, on AM5749, frames can be processed by 4 EOs: one each on EVE1, EVE2, DSP1, and DSP2.
  27
  28
  29 .. figure:: images/tidl-one-eo-per-frame.png
  30     :align: center
  31     :scale: 80
  32
  33     Processing a frame with one EO. Not to scale. Fn: Frame n, LG: Layer Group.
  34
  35 #. Determine if there are any TIDL capable :term:`compute cores<Compute core>` on the AM57x Processor:
  36
  37     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  38         :language: c++
  39         :lines: 64-65
  40         :linenos:
  41
  42 #. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See ``examples/test/testvecs/config/infer`` for examples of configuration files.
  43
  44     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  45         :language: c++
  46         :lines: 92-94
  47         :linenos:
  48
  49 #. Create Executor on C66x and EVE. In this example, all available C66x and EVE  cores are used (lines 1-2 and :ref:`CreateExecutor`).
  50 #. Create a vector of available ExecutionObjects from both Executors (lines 7-8 and :ref:`CollectEOs`).
  51 #. Allocate input and output buffers for each ExecutionObject (:ref:`AllocateMemory`)
  52 #. Run the network on each input frame.  The frames are processed with available execution objects in a pipelined manner. The additional num_eos iterations are required to flush the pipeline (lines 15-26).
  53
  54    * Wait for the EO to finish processing. If the EO is not processing a frame (the first iteration on each EO), the call to ``ProcessFrameWait`` returns false. ``ReportTime`` is used to report host and device execution times.
  55    * Read a frame and start running the network. ``ProcessFrameStartAsync`` is asynchronous and returns before processing is complete. ``ReadFrame`` is application specific and used to read an input frame for processing. For example, with OpenCV, ``ReadFrame`` is implemented using OpenCV APIs to capture a frame from the camera.
  56
  57     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  58         :language: c++
  59         :lines: 108-127,129,133-139
  60         :linenos:
  61
  62     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  63         :language: c++
  64         :lines: 154-163
  65         :linenos:
  66         :caption: CreateExecutor
  67         :name: CreateExecutor
  68
  69     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  70         :language: c++
  71         :lines: 166-172
  72         :linenos:
  73         :caption: CollectEOs
  74         :name: CollectEOs
  75
  76     .. literalinclude:: ../../examples/common/utils.cpp
  77         :language: c++
  78         :lines: 197-212
  79         :linenos:
  80         :caption: AllocateMemory
  81         :name: AllocateMemory
  82
  83 The complete example is available at ``/usr/share/ti/tidl/examples/one_eo_per_frame/main.cpp``.
  84
  85 .. note::
  86     The double buffering technique described in :ref:`use-case-3` can be used with a single :term:`ExecutionObject` to overlap reading a frame with the processing of the previous frame.
  87
  88 .. _use-case-2:
  89
  90 Frame split across EOs
  91 ======================
  92 This approach is typically used to reduce the latency of processing a single frame. Certain network layers such as Softmax and Pooling run faster on the C66x vs. EVE. Running these layers on C66x can lower the per-frame latency.
  93
  94 Time to process a single frame 224x224x3 frame on AM574x IDK EVM (Arm @ 1GHz, C66x @ 0.75GHz, EVE @ 0.65GHz) with JacintoNet11 (tidl_net_imagenet_jacintonet11v2.bin), TIDL API v1.1:
  95
  96 ======      =======     ===================
  97 EVE         C66x        EVE + C66x
  98 ======      =======     ===================
  99 ~112ms      ~120ms      ~64ms :sup:`1`
 100 ======      =======     ===================
 101
 102 :sup:`1` BatchNorm and Convolution layers run on EVE are placed in a :term:`Layer Group` and run on EVE. Pooling, InnerProduct, SoftMax layers are placed in a second :term:`Layer Group` and run on C66x. The EVE layer group takes ~57.5ms, C66x layer group takes ~6.5ms.
 103
 104 .. _frame-across-eos:
 105 .. figure:: images/tidl-frame-across-eos.png
 106     :align: center
 107     :scale: 80
 108
 109     Processing a frame across EOs. Not to scale. Fn: Frame n, LG: Layer Group.
 110
 111 The network consists of 2 :term:`Layer Groups<Layer Group>`. :term:`Execution Objects<EO>` are organized into :term:`Execution Object Pipelines<EOP>` (EOP). Each :term:`EOP` processes a frame. The API manages inter-EO synchronization.
 112
 113 #. Determine if there are any TIDL capable :term:`compute cores<Compute core>` on the AM57x Processor:
 114
 115     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
 116         :language: c++
 117         :lines: 64-65
 118         :linenos:
 119
 120 #. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See ``examples/test/testvecs/config/infer`` for examples of configuration files.
 121
 122     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
 123         :language: c++
 124         :lines: 92-94
 125         :linenos:
 126
 127 #. Update the default layer group index assignment. Pooling (layer 12), InnerProduct (layer 13) and SoftMax (layer 14) are added to a second layer group. Refer :ref:`layer-group-override` for details.
 128
 129     .. literalinclude:: ../../examples/two_eo_per_frame/main.cpp
 130         :language: c++
 131         :lines: 101-102
 132         :linenos:
 133
 134 #. Create :term:`Executors<Executor>` on C66x and EVE. The EVE Executor runs layer group 1, the C66x executor runs layer group 2.
 135
 136 #. Create two :term:`Execution Object Pipelines<EOP>`.  Each EOP contains one EVE and one C66x :term:`Execution Object<EO>` respectively.
 137 #. Allocate input and output buffers for each ExecutionObject in the EOP. (:ref:`AllocateMemory2`)
 138 #. Run the network on each input frame.  The frames are processed with available EOPs in a pipelined manner. For ease of use, EOP and EO present the same interface to the user.
 139
 140    * Wait for the EOP to finish processing. If the EOP is not processing a frame (the first iteration on each EOP), the call to ``ProcessFrameWait`` returns false. ``ReportTime`` is used to report host and device execution times.
 141    * Read a frame and start running the network. ``ProcessFrameStartAsync`` is asynchronous and returns before processing is complete. ``ReadFrame`` is application specific and used to read an input frame for processing. For example, with OpenCV, ``ReadFrame`` is implemented using OpenCV APIs to capture a frame from the camera.
 142
 143
 144     .. literalinclude:: ../../examples/two_eo_per_frame/main.cpp
 145         :language: c++
 146         :lines: 110-138,140,147-153
 147         :linenos:
 148
 149     .. literalinclude:: ../../examples/common/utils.cpp
 150         :language: c++
 151         :lines: 225-240
 152         :linenos:
 153         :caption: AllocateMemory
 154         :name: AllocateMemory2
 155
 156
 157 The complete example is available at ``/usr/share/ti/tidl/examples/two_eo_per_frame/main.cpp``. Another example of using the EOP is :ref:`ssd-example`.
 158
 159 .. _use-case-3:
 160
 161 Using EOPs for double buffering
 162 ===============================
 163
 164 The timeline shown in :numref:`frame-across-eos` indicates that EO-EVE1 waits for processing on E0-DSP1 to complete before it starts processing its next frame. It is possible to optimize the example further and overlap processing F :sub:`n-2` on EO-DSP1 and F :sub:`n` on E0-EVE1. This is illustrated in :numref:`frame-across-eos-opt`.
 165
 166 .. _frame-across-eos-opt:
 167 .. figure:: images/tidl-frame-across-eos-opt.png
 168     :align: center
 169     :scale: 80
 170
 171     Optimizing using double buffered EOPs. Not to scale. Fn: Frame n, LG: Layer Group.
 172
 173 EOP1 and EOP2 use the same :term:`EOs<EO>`: E0-EVE1 and E0-DSP1. Each :term:`EOP` has it's own input and output buffer. This enables EOP2 to read an input frame when EOP1 is processing its input frame. This in turn enables EOP2 to start processing on EO-EVE1 as soon as EOP1 completes processing on E0-EVE1.
 174
 175 The only change in the code compared to :ref:`use-case-2` is to create an additional set of EOPs for double buffering:
 176
 177 .. literalinclude:: ../../examples/two_eo_per_frame_opt/main.cpp
 178     :language: c++
 179     :lines: 117-129
 180     :linenos:
 181     :caption: Setting up EOPs for double buffering
 182     :name: test-code
 183
 184 .. note::
 185     EOP1 in :numref:`frame-across-eos-opt` -> EOPs[0] in :numref:`test-code`.
 186     EOP2 in :numref:`frame-across-eos-opt` -> EOPs[1] in :numref:`test-code`.
 187     EOP3 in :numref:`frame-across-eos-opt` -> EOPs[2] in :numref:`test-code`.
 188     EOP4 in :numref:`frame-across-eos-opt` -> EOPs[3] in :numref:`test-code`.
 189
 190 The complete example is available at ``/usr/share/ti/tidl/examples/two_eo_per_frame_opt/main.cpp``.
 191
 192 .. _sizing_device_heaps:
 193
 194 Sizing device side heaps
 195 ++++++++++++++++++++++++
 196
 197 TIDL API allocates 2 heaps for device size allocations during network setup/initialization:
 198
 199 +-----------+-----------------------------------+-----------------------------+
 200 | Heap Name | Configuration parameter           | Default size                |
 201 +-----------+-----------------------------------+-----------------------------+
 202 | Parameter | Configuration::PARAM_HEAP_SIZE    | 9MB,  1 per Executor        |
 203 +-----------+-----------------------------------+-----------------------------+
 204 | Network   | Configuration::NETWORK_HEAP_SIZE  | 64MB, 1 per ExecutionObject |
 205 +-----------+-----------------------------------+-----------------------------+
 206
 207 Depending on the network being deployed, these defaults may be smaller or larger than required. In order to determine the exact sizes for the heaps, the following approach can be used:
 208
 209 Start with the default heap sizes. The API displays heap usage statistics when Configuration::showHeapStats is set to true.
 210
 211 .. code-block:: c++
 212
 213     Configuration configuration;
 214     bool status = configuration.ReadFromFile(config_file);
 215     configuration.showHeapStats = true;
 216
 217 If the heap size is larger than required by device side allocations, the API displays usage statistics. When ``Free`` > 0, the heaps are larger than required.
 218
 219 .. code-block:: bash
 220
 221     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 222     API Version: 01.01.00.00.e4e45c8
 223     [eve 0]         TIDL Device Trace: PARAM heap: Size 9437184, Free 6556180, Total requested 2881004
 224     [eve 0]         TIDL Device Trace: NETWORK heap: Size 67108864, Free 47047680, Total requested 20061184
 225
 226
 227 Update the application to set the heap sizes to the "Total requested size" displayed:
 228
 229 .. code-block:: c++
 230
 231     configuration.PARAM_HEAP_SIZE   = 2881004;
 232     configuration.NETWORK_HEAP_SIZE = 20061184;
 233
 234 .. code-block:: bash
 235
 236     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 237     API Version: 01.01.00.00.e4e45c8
 238     [eve 0]         TIDL Device Trace: PARAM heap: Size 2881004, Free 0, Total requested 2881004
 239     [eve 0]         TIDL Device Trace: NETWORK heap: Size 20061184, Free 0, Total requested 20061184
 240
 241 Now, the heaps are sized as required by network execution (i.e. ``Free`` is 0)
 242 and the ``configuration.showHeapStats = true`` line can be removed.
 243
 244 .. note::
 245
 246     If the default heap sizes are smaller than required, the device will report an allocation failure and indicate the required minimum size. E.g.
 247 .. code-block:: bash
 248
 249     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 250     API Version: 01.01.00.00.0ba86d4
 251     [eve 0]         TIDL Device Error:  Allocation failure with NETWORK heap, request size 161472, avail 102512
 252     [eve 0]         TIDL Device Error: Network heap must be >= 20061184 bytes, 19960944 not sufficient. Update Configuration::NETWORK_HEAP_SIZE
 253     TIDL Error: [src/execution_object.cpp, Wait, 548]: Allocation failed on device
 254
 255 .. note::
 256
 257     The memory for parameter and network heaps is itself allocated from OpenCL global memory (CMEM). Refer :ref:`opencl-global-memory` for details.
 258
 259
 260
 261 .. _network_layer_output:
 262
 263 Accessing outputs of network layers
 264 +++++++++++++++++++++++++++++++++++
 265
 266 TIDL API v1.1 and higher provides the following APIs to access the output buffers associated with network layers:
 267
 268 * :cpp:`ExecutionObject::WriteLayerOutputsToFile` - write outputs from each layer into individual files. Files are named ``<filename_prefix>_<layer_index>.bin``.
 269 * :cpp:`ExecutionObject::GetOutputsFromAllLayers` - Get output buffers from all layers.
 270 * :cpp:`ExecutionObject::GetOutputFromLayer` - Get a single output buffer from a layer.
 271
 272 See ``examples/layer_output/main.cpp, ProcessTrace()`` for examples of using these tracing APIs.
 273
 274 .. note::
 275     The :cpp:`ExecutionObject::GetOutputsFromAllLayers` method can be memory intensive if the network has a large number of layers. This method allocates sufficient host memory to hold all output buffers from all layers.
 276
 277 .. _Processor SDK Linux Software Developer's Guide: http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/index.html
 278 .. _Processor SDK Linux Software Developer's Guide (TIDL chapter): http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components_TIDL.html