docs/source/using_api.rst

   1 .. _using-tidl-api:
   2
   3 *************
   4 Using the API
   5 *************
   6
   7 This section illustrates using TIDL APIs to leverage deep learning in user applications. The overall flow is as follows:
   8
   9 * Create a :term:`Configuration` object to specify the set of parameters required for network exectution.
  10 * Create :term:`Executor` objects - one to manage overall execution on the EVEs, the other for C66x DSPs.
  11 * Use the :term:`Execution Objects<EO>` (EO) created by the Executor to process :term:`frames<Frame>`. There are two approaches to processing frames using Execution Objects:
  12
  13 .. list-table:: TIDL API Use Cases
  14    :header-rows: 1
  15    :widths: 30 50 20
  16
  17    * - Use Case
  18      - Application/Network characteristic
  19      - Examples
  20    * - Each :term:`EO` processes a single frame. The network consists of a single :term:`Layer Group` and the entire Layer Group is processed on a single EO. See :ref:`use-case-1`.
  21      -
  22
  23        * The latency to run the network on a single frame meets application requirements
  24        * If frame processing time is fairly similar across EVE and C66x, multiple EOs can be used to increase throughtput
  25        * The latency to read an input frame can be hidden by using 2 :term:`EOPs<EOP>` with the same :term:`EO`.
  26      - one_eo_per_frame, imagenet, segmentation
  27
  28    * - Split processing a single frame across multiple :term:`EOs<EO>` using an :term:`ExecutionObjectPipeline`.  The network consists of 2 or more :term:`Layer Groups<Layer Group>`. See :ref:`use-case-2` and :ref:`use-case-3`.
  29      -
  30
  31        * Network execution must be split across EVE and C66x to meet single frame latency requirements.
  32        * Split processing a single frame across multiple :term:`EOs<EO>` using an :term:`ExecutionObjectPipeline`.
  33        * Splitting can lower the overall latency because memory bound :term:`layers<Layer>` tend to run faster on the C66x.
  34      - two_eo_per_frame, two_eo_per_frame_opt, ssd_multibox
  35
  36
  37 Refer Section :ref:`api-documentation` for API documentation.
  38
  39 Use Cases
  40 +++++++++
  41
  42 .. _use-case-1:
  43
  44 Each EO processes a single frame
  45 ================================
  46
  47 In this approach, the :term:`network<Network>` is set up as a single :term:`Layer Group`. An :term:`EO` runs the entire layer group on a single frame. To increase throughput, frame processing can be pipelined across available EOs. For example, on AM5749, frames can be processed by 4 EOs: one each on EVE1, EVE2, DSP1, and DSP2.
  48
  49
  50 .. figure:: images/tidl-one-eo-per-frame.png
  51     :align: center
  52     :scale: 80
  53
  54     Processing a frame with one EO. Not to scale. Fn: Frame n, LG: Layer Group.
  55
  56 #. Determine if there are any TIDL capable :term:`compute cores<Compute core>` on the AM57x Processor:
  57
  58     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  59         :language: c++
  60         :lines: 64-65
  61         :linenos:
  62
  63 #. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See ``examples/test/testvecs/config/infer`` for examples of configuration files.
  64
  65     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  66         :language: c++
  67         :lines: 97-99
  68         :linenos:
  69
  70 #. Create Executor on C66x and EVE. In this example, all available C66x and EVE  cores are used (lines 2-3 and :ref:`CreateExecutor`).
  71 #. Create a vector of available ExecutionObjects from both Executors (lines 7-8 and :ref:`CollectEOs`).
  72 #. Allocate input and output buffers for each ExecutionObject (:ref:`AllocateMemory`)
  73 #. Run the network on each input frame.  The frames are processed with available execution objects in a pipelined manner. The additional num_eos iterations are required to flush the pipeline (lines 15-26).
  74
  75    * Wait for the EO to finish processing. If the EO is not processing a frame (the first iteration on each EO), the call to ``ProcessFrameWait`` returns false. ``ReportTime`` is used to report host and device execution times.
  76    * Read a frame and start running the network. ``ProcessFrameStartAsync`` is asynchronous and returns before processing is complete. ``ReadFrame`` is application specific and used to read an input frame for processing. For example, with OpenCV, ``ReadFrame`` is implemented using OpenCV APIs to capture a frame from the camera.
  77
  78     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  79         :language: c++
  80         :lines: 113-132,134,138-142
  81         :linenos:
  82
  83     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  84         :language: c++
  85         :lines: 159-168
  86         :linenos:
  87         :caption: CreateExecutor
  88         :name: CreateExecutor
  89
  90     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
  91         :language: c++
  92         :lines: 171-177
  93         :linenos:
  94         :caption: CollectEOs
  95         :name: CollectEOs
  96
  97     .. literalinclude:: ../../examples/common/utils.cpp
  98         :language: c++
  99         :lines: 176-191
 100         :linenos:
 101         :caption: AllocateMemory
 102         :name: AllocateMemory
 103
 104 The complete example is available at ``/usr/share/ti/tidl/examples/one_eo_per_frame/main.cpp``.
 105
 106 .. note::
 107     The double buffering technique described in :ref:`use-case-3` can be used with a single :term:`ExecutionObject` to overlap reading an input frame with the processing of the previous input frame. Refer to ``examples/imagenet/main.cpp``.
 108
 109 .. _use-case-2:
 110
 111 Frame split across EOs
 112 ======================
 113 This approach is typically used to reduce the latency of processing a single frame. Certain network layers such as Softmax and Pooling run faster on the C66x vs. EVE. Running these layers on C66x can lower the per-frame latency.
 114
 115 Time to process a single frame 224x224x3 frame on AM574x IDK EVM (Arm @ 1GHz, C66x @ 0.75GHz, EVE @ 0.65GHz) with JacintoNet11 (tidl_net_imagenet_jacintonet11v2.bin), TIDL API v1.1:
 116
 117 ======      =======     ===================
 118 EVE         C66x        EVE + C66x
 119 ======      =======     ===================
 120 ~112ms      ~120ms      ~64ms :sup:`1`
 121 ======      =======     ===================
 122
 123 :sup:`1` BatchNorm and Convolution layers run on EVE are placed in a :term:`Layer Group` and run on EVE. Pooling, InnerProduct, SoftMax layers are placed in a second :term:`Layer Group` and run on C66x. The EVE layer group takes ~57.5ms, C66x layer group takes ~6.5ms.
 124
 125 .. _frame-across-eos:
 126 .. figure:: images/tidl-frame-across-eos.png
 127     :align: center
 128     :scale: 80
 129
 130     Processing a frame across EOs. Not to scale. Fn: Frame n, LG: Layer Group.
 131
 132 The network consists of 2 :term:`Layer Groups<Layer Group>`. :term:`Execution Objects<EO>` are organized into :term:`Execution Object Pipelines<EOP>` (EOP). Each :term:`EOP` processes a frame. The API manages inter-EO synchronization.
 133
 134 #. Determine if there are any TIDL capable :term:`compute cores<Compute core>` on the AM57x Processor:
 135
 136     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
 137         :language: c++
 138         :lines: 64-65
 139         :linenos:
 140
 141 #. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See ``examples/test/testvecs/config/infer`` for examples of configuration files.
 142
 143     .. literalinclude:: ../../examples/one_eo_per_frame/main.cpp
 144         :language: c++
 145         :lines: 97-99
 146         :linenos:
 147
 148 #. Update the default layer group index assignment. Pooling (layer 12), InnerProduct (layer 13) and SoftMax (layer 14) are added to a second layer group. Refer :ref:`layer-group-override` for details.
 149
 150     .. literalinclude:: ../../examples/two_eo_per_frame/main.cpp
 151         :language: c++
 152         :lines: 101-102
 153         :linenos:
 154
 155 #. Create :term:`Executors<Executor>` on C66x and EVE. The EVE Executor runs layer group 1, the C66x executor runs layer group 2.
 156
 157 #. Create two :term:`Execution Object Pipelines<EOP>`.  Each EOP contains one EVE and one C66x :term:`Execution Object<EO>` respectively.
 158 #. Allocate input and output buffers for each ExecutionObject in the EOP. (:ref:`AllocateMemory2`)
 159 #. Run the network on each input frame.  The frames are processed with available EOPs in a pipelined manner. For ease of use, EOP and EO present the same interface to the user.
 160
 161    * Wait for the EOP to finish processing. If the EOP is not processing a frame (the first iteration on each EOP), the call to ``ProcessFrameWait`` returns false. ``ReportTime`` is used to report host and device execution times.
 162    * Read a frame and start running the network. ``ProcessFrameStartAsync`` is asynchronous and returns before processing is complete. ``ReadFrame`` is application specific and used to read an input frame for processing. For example, with OpenCV, ``ReadFrame`` is implemented using OpenCV APIs to capture a frame from the camera.
 163
 164
 165     .. literalinclude:: ../../examples/two_eo_per_frame/main.cpp
 166         :language: c++
 167         :lines: 132-139,144-149
 168         :linenos:
 169
 170     .. literalinclude:: ../../examples/common/utils.cpp
 171         :language: c++
 172         :lines: 204-219
 173         :linenos:
 174         :caption: AllocateMemory
 175         :name: AllocateMemory2
 176
 177
 178 The complete example is available at ``/usr/share/ti/tidl/examples/two_eo_per_frame/main.cpp``. Another example of using the EOP is :ref:`ssd-example`.
 179
 180 .. _use-case-3:
 181
 182 Using EOPs for double buffering
 183 ===============================
 184
 185 The timeline shown in :numref:`frame-across-eos` indicates that EO-EVE1 waits for processing on E0-DSP1 to complete before it starts processing its next frame. It is possible to optimize the example further and overlap processing F :sub:`n-2` on EO-DSP1 and F :sub:`n` on E0-EVE1. This is illustrated in :numref:`frame-across-eos-opt`.
 186
 187 .. _frame-across-eos-opt:
 188 .. figure:: images/tidl-frame-across-eos-opt.png
 189     :align: center
 190     :scale: 80
 191
 192     Optimizing using double buffered EOPs. Not to scale. Fn: Frame n, LG: Layer Group.
 193
 194 EOP1 and EOP2 use the same :term:`EOs<EO>`: E0-EVE1 and E0-DSP1. Each :term:`EOP` has it's own input and output buffer. This enables EOP2 to read an input frame when EOP1 is processing its input frame. This in turn enables EOP2 to start processing on EO-EVE1 as soon as EOP1 completes processing on E0-EVE1.
 195
 196 The only change in the code compared to :ref:`use-case-2` is to create an additional set of EOPs for double buffering:
 197
 198 .. literalinclude:: ../../examples/two_eo_per_frame_opt/main.cpp
 199     :language: c++
 200     :lines: 122-134
 201     :linenos:
 202     :caption: Setting up EOPs for double buffering
 203     :name: test-code
 204
 205 .. note::
 206     EOP1 in :numref:`frame-across-eos-opt` -> EOPs[0] in :numref:`test-code`.
 207
 208     EOP2 in :numref:`frame-across-eos-opt` -> EOPs[1] in :numref:`test-code`.
 209
 210     EOP3 in :numref:`frame-across-eos-opt` -> EOPs[2] in :numref:`test-code`.
 211
 212     EOP4 in :numref:`frame-across-eos-opt` -> EOPs[3] in :numref:`test-code`.
 213
 214 The complete example is available at ``/usr/share/ti/tidl/examples/two_eo_per_frame_opt/main.cpp``.
 215
 216 .. _sizing_device_heaps:
 217
 218 Sizing device side heaps
 219 ++++++++++++++++++++++++
 220
 221 TIDL API allocates 2 heaps for device size allocations during network setup/initialization:
 222
 223 +-----------+-----------------------------------+-----------------------------+
 224 | Heap Name | Configuration parameter           | Default size                |
 225 +-----------+-----------------------------------+-----------------------------+
 226 | Parameter | Configuration::PARAM_HEAP_SIZE    | 9MB,  1 per Executor        |
 227 +-----------+-----------------------------------+-----------------------------+
 228 | Network   | Configuration::NETWORK_HEAP_SIZE  | 64MB, 1 per ExecutionObject |
 229 +-----------+-----------------------------------+-----------------------------+
 230
 231 Depending on the network being deployed, these defaults may be smaller or larger than required. In order to determine the exact sizes for the heaps, the following approach can be used:
 232
 233 Start with the default heap sizes. The API displays heap usage statistics when Configuration::showHeapStats is set to true.
 234
 235 .. code-block:: c++
 236
 237     Configuration configuration;
 238     bool status = configuration.ReadFromFile(config_file);
 239     configuration.showHeapStats = true;
 240
 241 If the heap size is larger than required by device side allocations, the API displays usage statistics. When ``Free`` > 0, the heaps are larger than required.
 242
 243 .. code-block:: bash
 244
 245     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 246     API Version: 01.01.00.00.e4e45c8
 247     [eve 0]         TIDL Device Trace: PARAM heap: Size 9437184, Free 6556180, Total requested 2881004
 248     [eve 0]         TIDL Device Trace: NETWORK heap: Size 67108864, Free 47047680, Total requested 20061184
 249
 250
 251 Update the application to set the heap sizes to the "Total requested size" displayed:
 252
 253 .. code-block:: c++
 254
 255     configuration.PARAM_HEAP_SIZE   = 2881004;
 256     configuration.NETWORK_HEAP_SIZE = 20061184;
 257
 258 .. code-block:: bash
 259
 260     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 261     API Version: 01.01.00.00.e4e45c8
 262     [eve 0]         TIDL Device Trace: PARAM heap: Size 2881004, Free 0, Total requested 2881004
 263     [eve 0]         TIDL Device Trace: NETWORK heap: Size 20061184, Free 0, Total requested 20061184
 264
 265 Now, the heaps are sized as required by network execution (i.e. ``Free`` is 0)
 266 and the ``configuration.showHeapStats = true`` line can be removed.
 267
 268 .. note::
 269
 270     If the default heap sizes are smaller than required, the device will report an allocation failure and indicate the required minimum size. E.g.
 271 .. code-block:: bash
 272
 273     # ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
 274     API Version: 01.01.00.00.0ba86d4
 275     [eve 0]         TIDL Device Error:  Allocation failure with NETWORK heap, request size 161472, avail 102512
 276     [eve 0]         TIDL Device Error: Network heap must be >= 20061184 bytes, 19960944 not sufficient. Update Configuration::NETWORK_HEAP_SIZE
 277     TIDL Error: [src/execution_object.cpp, Wait, 548]: Allocation failed on device
 278
 279 .. note::
 280
 281     The memory for parameter and network heaps is itself allocated from OpenCL global memory (CMEM). Refer :ref:`opencl-global-memory` for details.
 282
 283
 284
 285 .. _network_layer_output:
 286
 287 Accessing outputs of network layers
 288 +++++++++++++++++++++++++++++++++++
 289
 290 TIDL API v1.1 and higher provides the following APIs to access the output buffers associated with network layers:
 291
 292 * :cpp:`ExecutionObject::WriteLayerOutputsToFile` - write outputs from each layer into individual files. Files are named ``<filename_prefix>_<layer_index>.bin``.
 293 * :cpp:`ExecutionObject::GetOutputsFromAllLayers` - Get output buffers from all layers.
 294 * :cpp:`ExecutionObject::GetOutputFromLayer` - Get a single output buffer from a layer.
 295
 296 See ``examples/layer_output/main.cpp, ProcessTrace()`` for examples of using these tracing APIs.
 297
 298 .. note::
 299     The :cpp:`ExecutionObject::GetOutputsFromAllLayers` method can be memory intensive if the network has a large number of layers. This method allocates sufficient host memory to hold all output buffers from all layers.
 300
 301 .. _Processor SDK Linux Software Developer's Guide: http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/index.html
 302 .. _Processor SDK Linux Software Developer's Guide (TIDL chapter): http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components_TIDL.html