docs/source/example.rst

   1 .. _examples:
   2
   3 ********
   4 Examples
   5 ********
   6
   7 .. list-table:: TIDL API Examples
   8    :header-rows: 1
   9    :widths: 12 43 20 25
  10
  11    * - Example
  12      - Description
  13      - Compute cores
  14      - Input image
  15    * - one_eo_per_frame
  16      - Processes a single frame with one :term:`EO` using the j11_v2 network. Throughput is increased by distributing frame processing across EOs. Refer :ref:`use-case-1`.
  17      - EVE or C66x
  18      - Pre-processed image read from file.
  19    * - two_eo_per_frame
  20      - Processes a single frame with an :term:`EOP` using the j11_v2 network to reduce per-frame processing latency. Also increases throughput by distributing frame processing across EOPs. The EOP consists of two EOs. Refer :ref:`use-case-2`.
  21      - EVE and C66x (network is split across both EVE and C66x)
  22      - Pre-processed image read from file.
  23    * - two_eo_per_frame_opt
  24      - Builds on ``two_eo_per_frame``. Adds double buffering to improve performance. Refer :ref:`use-case-3`.
  25      - EVE and C66x (network is split across both EVE and C66x)
  26      - Pre-processed image read from file.
  27
  28    * - imagenet
  29      - Classification example
  30      - EVE or C66x
  31      - OpenCV used to read input image from file or capture from camera.
  32    * - segmentation
  33      - Pixel level segmentation example
  34      - EVE or C66x
  35      - OpenCV used to read input image from file or capture from camera.
  36    * - ssd_multibox
  37      - Object detection
  38      - EVE and C66x (network is split across both EVE and C66x)
  39      - OpenCV used to read input image from file or capture from camera.
  40    * - mnist
  41      - handwritten digits recognition (MNIST).  This example illustrates
  42        low TIDL API overhead (~1.8%) for small networks with low compute
  43        requirements (<5ms).
  44      - EVE
  45      - Pre-processed white-on-black images read from file, with or without
  46        MNIST database file headers.
  47    * - classification
  48      - Classification example, called from the Matrix GUI.
  49      - EVE or C66x
  50      - OpenCV used to read input image from file or capture from camera.
  51    * - mcbench
  52      - Used to benchmark supported networks. Refer ``mcbench/scripts`` for command line options.
  53      - EVE or C66x
  54      - Pre-processed image read from file.
  55    * - layer_output
  56      - Illustrates using TIDL APIs to access output buffers of intermediate :term:`layers<Layer>` in the network.
  57      - EVE or C66x
  58      - Pre-processed image read from file.
  59    * - test
  60      - This example is used to test pre-converted networks included in the TIDL API package (``test/testvecs/config/tidl_models``). When run without any arguments, the program ``test_tidl`` will run all available networks on the C66x DSPs and EVEs available on the SoC. Use the ``-c`` option to specify a single network. Run ``test_tidl -h``  for details.
  61      - C66x and EVEs (if available)
  62      - Pre-processed image read from file.
  63
  64 The included examples demonstrate three categories of deep learning networks: classification, segmentation and object detection.  ``imagenet`` and ``segmentation`` can run on AM57x processors with either EVE or C66x cores.  ``ssd_multibox`` requires AM57x processors with both EVE and C66x. The examples are available at ``/usr/share/ti/tidl/examples`` on the EVM file system and in the linux devkit.
  65
  66 The performance numbers were obtained using:
  67
  68 * `AM574x IDK EVM`_ with the Sitara `AM5749`_ Processor - 2 Arm Cortex-A15 cores running at 1.0GHz, 2 EVE cores at 650MHz, and 2 C66x cores at 750MHz.
  69 * `Processor SDK Linux`_ v5.1 with TIDL API v1.1
  70
  71 For each example, device processing time, host processing time,
  72 and TIDL API overhead is reported.
  73
  74 * **Device processing time** is measured on the device, from the moment processing starts for a frame till processing finishes.
  75 * **Host processing time** is measured on the host, from the moment ``ProcessFrameStartAsync()`` is called till ``ProcessFrameWait()`` returns in user application.  It includes the TIDL API overhead, the OpenCL runtime overhead, and the time to copy user input data into padded TIDL internal buffers. ``Host processing time = Device processing time + TIDL API overhead``.
  76
  77
  78 Imagenet
  79 --------
  80
  81 The imagenet example takes an image as input and outputs 1000 probabilities.
  82 Each probability corresponds to one object in the 1000 objects that the
  83 network is pre-trained with.  The example outputs top 5 (up to) predictions
  84 with probabilities of 5% or higher for a given input image.
  85
  86 The following figure and tables shows an input image, top 5 predicted
  87 objects as output, and the processing time on either EVE or C66x.
  88
  89 .. image:: ../../examples/test/testvecs/input/objects/cat-pet-animal-domestic-104827.jpeg
  90    :width: 600
  91
  92
  93 ==== ============== ===========
  94 Rank Object Classes Probability
  95 ==== ============== ===========
  96 1    tabby          52.55%
  97 2    Egyptian_cat   21.18%
  98 3    tiger_cat      17.65%
  99 ==== ============== ===========
 100
 101 =======   ====================== ==================== ============
 102 Device    Device Processing Time Host Processing Time API Overhead
 103 =======   ====================== ==================== ============
 104 EVE       106.5 ms               107.9 ms             1.37 %
 105 C66x      117.9 ms               118.7 ms             0.93 %
 106 =======   ====================== ==================== ============
 107
 108 The :term:`network<Network>` used in the example is jacintonet11v2. It has
 109 14 layers. Input to the network is RGB image of 224x224. Users can specify whether to run the network on EVE or C66x.
 110
 111 The example code sets ``buffer_factor`` to 2 to create duplicated
 112 ExecutionObjectPipelines with identical ExecutionObjects to
 113 perform double buffering, so that host pre/post-processing can be overlapped
 114 with device processing (see comments in the code for details).
 115 The following table shows the loop overall time over 10 frames
 116 with single buffering and double buffering,
 117 ``./imagenet -f 10 -d <num> -e <num>``.
 118
 119 .. list-table:: Loop overall time over 10 frames
 120    :header-rows: 1
 121
 122    * - Device(s)
 123      - Single Buffering (buffer_factor=1)
 124      - Double Buffering (buffer_factor=2)
 125    * - 1 EVE
 126      - 1744 ms
 127      - 1167 ms
 128    * - 2 EVEs
 129      - 966 ms
 130      - 795 ms
 131    * - 1 C66x
 132      - 1879 ms
 133      - 1281 ms
 134    * - 2 C66xs
 135      - 1021 ms
 136      - 814 ms
 137
 138 Segmentation
 139 ------------
 140
 141 The segmentation example takes an image as input and performs pixel-level
 142 classification according to pre-trained categories.  The following figures
 143 show a street scene as input and the scene overlaid with pixel-level
 144 classifications as output: road in green, pedestrians in red, vehicles
 145 in blue and background in gray.
 146
 147 .. image:: ../../examples/test/testvecs/input/roads/pexels-photo-972355.jpeg
 148    :width: 600
 149
 150 .. image:: images/pexels-photo-972355-seg.jpg
 151    :width: 600
 152
 153 The :term:`network<Network>` used in the example is jsegnet21v2. It has
 154 26 layers.  Users can specify whether to run the network on EVE or C66x.
 155 Input to the network is RGB image of size 1024x512.  The output is 1024x512
 156 values, each value indicates which pre-trained category the current pixel
 157 belongs to.  The example will take the network output, create an overlay,
 158 and blend the overlay onto the original input image to create an output image.
 159 From the reported time in the following table, we can see that this network
 160 runs significantly faster on EVE than on C66x.
 161
 162 =======     ====================== ==================== ============
 163 Device      Device Processing Time Host Processing Time API Overhead
 164 =======     ====================== ==================== ============
 165 EVE         251.8 ms               254.2 ms             0.96 %
 166 C66x        812.7 ms               815.0 ms             0.27 %
 167 =======     ====================== ==================== ============
 168
 169 The example code sets ``buffer_factor`` to 2 to create duplicated
 170 ExecutionObjectPipelines with identical ExecutionObjects to
 171 perform double buffering, so that host pre/post-processing can be overlapped
 172 with device processing (see comments in the code for details).
 173 The following table shows the loop overall time over 10 frames
 174 with single buffering and double buffering,
 175 ``./segmentation -f 10 -d <num> -e <num>``.
 176
 177 .. list-table:: Loop overall time over 10 frames
 178    :header-rows: 1
 179
 180    * - Device(s)
 181      - Single Buffering (buffer_factor=1)
 182      - Double Buffering (buffer_factor=2)
 183    * - 1 EVE
 184      - 5233 ms
 185      - 3017 ms
 186    * - 2 EVEs
 187      - 3032 ms
 188      - 3015 ms
 189    * - 1 C66x
 190      - 10890 ms
 191      - 8416 ms
 192    * - 2 C66xs
 193      - 5742 ms
 194      - 4638 ms
 195
 196 .. _ssd-example:
 197
 198 SSD
 199 ---
 200
 201 SSD is the abbreviation for Single Shot multi-box Detector.
 202 The ssd_multibox example takes an image as input and detects multiple
 203 objects with bounding boxes according to pre-trained categories.
 204 The example supports the ssd network with two sets of pretrained categories:
 205 ``jdetnet_voc`` and ``jdetnet``.
 206
 207 The following figures show an image as input and the image with recognized
 208 objects boxed as output from ``jdetnet_voc``: person in red and horse in green.
 209
 210 .. figure:: images/horse.png
 211    :width: 600
 212
 213 .. figure:: images/horse_multibox.png
 214    :width: 600
 215
 216 The following figures show another street scene as input and the scene
 217 with recognized objects boxed as output from ``jdetnet``: pedestrians in red,
 218 vehicles in blue and road signs in yellow.
 219
 220 .. image:: ../../examples/test/testvecs/input/roads/pexels-photo-378570.jpeg
 221    :width: 600
 222
 223 .. image:: images/pexels-photo-378570-ssd.jpg
 224    :width: 600
 225
 226 Please use command line options to switch between these two sets of pre-trained
 227 categoris, e.g.
 228
 229 .. code-block:: shell
 230
 231    ./ssd_multibox # default is jdetnet_voc
 232    ./ssd_multibox -c jdetnet -l jdetnet_objects.json -p 16 -i ../test/testvecs/input/preproc_0_768x320.y
 233
 234 The ssd network used in both categories has 43 layers.
 235 Input to the network is RGB image of size 768x320.  Output is a list of
 236 boxes (up to 20), each box has information about the box coordinates, and
 237 which pre-trained category that the object inside the box belongs to.
 238 The example will take the network output, draw boxes accordingly,
 239 and create an output image.
 240 The network can be run entirely on either EVE or C66x.  However, the best
 241 performance comes with running the first 30 layers as a group on EVE
 242 and the next 13 layers as another group on C66x.
 243 Our end-to-end example shows how easy it is to assign a :term:`Layer Group` id
 244 to an :term:`Executor` and how easy it is to construct an :term:`ExecutionObjectPipeline` to connect the output of one *Executor*'s :term:`ExecutionObject`
 245 to the input of another *Executor*'s *ExecutionObject*.
 246
 247 ========      ====================== ==================== ============
 248 Device        Device Processing Time Host Processing Time API Overhead
 249 ========      ====================== ==================== ============
 250 EVE+C66x      169.5ms                172.0ms              1.68 %
 251 ========      ====================== ==================== ============
 252
 253 The example code sets ``pipeline_depth`` to 2 to create duplicated
 254 ExecutionObjectPipelines with identical ExecutionObjects to
 255 perform pipelined execution at the ExecutionObject level.
 256 The side effect is that it also overlaps host pre/post-processing
 257 with device processing (see comments in the code for details).
 258 The following table shows the loop overall time over 10 frames
 259 with pipelining at ExecutionObjectPipeline level
 260 versus ExecutionObject level.
 261 ``./ssd_multibox -f 10 -d <num> -e <num>``.
 262
 263 .. list-table:: Loop overall time over 10 frames
 264    :header-rows: 1
 265
 266    * - Device(s)
 267      - pipeline_depth=1
 268      - pipeline_depth=2
 269    * - 1 EVE + 1 C66x
 270      - 2900 ms
 271      - 1735 ms
 272    * - 2 EVEs + 2 C66xs
 273      - 1630 ms
 274      - 1408 ms
 275
 276 .. _mnist-example:
 277
 278 MNIST
 279 -----
 280
 281 The MNIST example takes a pre-processed 28x28 white-on-black frame from
 282 a file as input and predicts the hand-written digit in the frame.
 283 For example, the example will predict 0 for the following frame.
 284
 285 .. code-block:: none
 286
 287     root@am57xx-evm:~/tidl/examples/mnist# hexdump -v -e '28/1 "%2x" "\n"' -n 784 ../test/testvecs/input/digits10_images_28x28.y
 288      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 289      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 290      0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 314 8 0 0 0 0 0 0 0 0 0 0
 291      0 0 0 0 0 0 0 0 0 0 0 0319bdfeec1671b 0 0 0 0 0 0 0 0 0
 292      0 0 0 0 0 0 0 0 0 0 01ed5ffd2a4e4ec89 0 0 0 0 0 0 0 0 0
 293      0 0 0 0 0 0 0 0 0 0 1bcffee2a 031e6e225 0 0 0 0 0 0 0 0
 294      0 0 0 0 0 0 0 0 0 05ff7ffbf 2 0 078ffa1 0 0 0 0 0 0 0 0
 295      0 0 0 0 0 0 0 0 0 0b2f2f34e 0 0 015e0d8 0 0 0 0 0 0 0 0
 296      0 0 0 0 0 0 0 0 0148deab2 0 0 0 0 0bdec 2 0 0 0 0 0 0 0
 297      0 0 0 0 0 0 0 0 0 084f845 0 0 0 0 0a4f222 0 0 0 0 0 0 0
 298      0 0 0 0 0 0 0 0 0 0c4d3 5 0 0 0 0 096f21c 0 0 0 0 0 0 0
 299      0 0 0 0 0 0 0 0 052f695 0 0 0 0 0 0a7ed 8 0 0 0 0 0 0 0
 300      0 0 0 0 0 0 0 0 09af329 0 0 0 0 0 0d1cf 0 0 0 0 0 0 0 0
 301      0 0 0 0 0 0 0 0 2d4c8 0 0 0 0 0 01ae9a2 0 0 0 0 0 0 0 0
 302      0 0 0 0 0 0 0 038fa9a 0 0 0 0 0 062ff76 0 0 0 0 0 0 0 0
 303      0 0 0 0 0 0 0 07afe5d 0 0 0 0 0 0a9e215 0 0 0 0 0 0 0 0
 304      0 0 0 0 0 0 0 0bdec1d 0 0 0 0 017e7aa 0 0 0 0 0 0 0 0 0
 305      0 0 0 0 0 0 0 1e7d6 0 0 0 0 0 096f85a 0 0 0 0 0 0 0 0 0
 306      0 0 0 0 0 0 01df2bf 0 0 0 0 015e1ca 0 0 0 0 0 0 0 0 0 0
 307      0 0 0 0 0 0 061fc95 0 0 0 0 084f767 0 0 0 0 0 0 0 0 0 0
 308      0 0 0 0 0 0 06eff8b 0 0 0 033e8ca 4 0 0 0 0 0 0 0 0 0 0
 309      0 0 0 0 0 0 060fc9e 0 0 0 092d63e 0 0 0 0 0 0 0 0 0 0 0
 310      0 0 0 0 0 0 01bf1da 6 0 019b656 0 0 0 0 0 0 0 0 0 0 0 0
 311      0 0 0 0 0 0 0 0c3fb8e a613e7b 5 0 0 0 0 0 0 0 0 0 0 0 0
 312      0 0 0 0 0 0 0 049f1fcf5f696 9 0 0 0 0 0 0 0 0 0 0 0 0 0
 313      0 0 0 0 0 0 0 0 04ca0b872 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 314      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 315      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 316
 317 The file can contain multiple frames.  If an optional label file is also
 318 given, the example will compare predicted result against pre-determined
 319 label for accuracy.  The input files may or may not have `MNIST dataset
 320 file headers <http://yann.lecun.com/exdb/mnist/>`_.  If using headers,
 321 input filenames must end with idx3-ubyte or idx1-ubyte.
 322
 323 The MNIST example also illustrates low overhead of TIDL API for small
 324 networks with low compute requirements (<5ms).  The network runs about 3ms
 325 on EVE for a single frame.  As shown in the following table, when running
 326 over 1000 frames, the overhead is about 1.8%.
 327
 328 .. list-table:: Loop overall time over 1000 frames
 329    :header-rows: 1
 330
 331    * - Device(s)
 332      - Device Processing Time
 333      - Host Processing Time
 334      - API Overhead
 335    * - 1 EVE
 336      - 3091 ms
 337      - 3146 ms
 338      - 1.78%
 339
 340 Running Examples
 341 ----------------
 342
 343 The examples are located in ``/usr/share/ti/tidl/examples`` on
 344 the EVM file system.  **Each example needs to be run in its own directory** due to relative paths to configuration files.
 345 Running an example with ``-h`` will show help message with option set.
 346 The following listing illustrates how to build and run the examples.
 347
 348 .. code-block:: shell
 349
 350    root@am57xx-evm:~/tidl/examples/imagenet# ./imagenet
 351    Input: ../test/testvecs/input/objects/cat-pet-animal-domestic-104827.jpeg
 352    1: tabby,   prob = 52.55%
 353    2: Egyptian_cat,   prob = 21.18%
 354    3: tiger_cat,   prob = 17.65%
 355    Loop total time (including read/write/opencv/print/etc):  183.3ms
 356    imagenet PASSED
 357
 358    root@am57xx-evm:~/tidl-api/examples/segmentation# ./segmentation
 359    Input: ../test/testvecs/input/000100_1024x512_bgr.y
 360    frame[  0]: Time on EVE0: 251.74 ms, host: 258.02 ms API overhead: 2.43 %
 361    Saving frame 0 to: frame_0.png
 362    Saving frame 0 overlayed with segmentation to: overlay_0.png
 363    frame[  1]: Time on EVE0: 251.76 ms, host: 255.79 ms API overhead: 1.58 %
 364    Saving frame 1 to: frame_1.png
 365    Saving frame 1 overlayed with segmentation to: overlay_1.png
 366    ...
 367    frame[  8]: Time on EVE0: 251.75 ms, host: 254.21 ms API overhead: 0.97 %
 368    Saving frame 8 to: frame_8.png
 369    Saving frame 8 overlayed with segmentation to: overlay_8.png
 370    Loop total time (including read/write/opencv/print/etc):   4809ms
 371    segmentation PASSED
 372
 373    root@am57xx-evm:~/tidl-api/examples/ssd_multibox# ./ssd_multibox
 374    Input: ../test/testvecs/input/preproc_0_768x320.y
 375    frame[  0]: Time on EVE0+DSP0: 169.44 ms, host: 173.56 ms API overhead: 2.37 %
 376    Saving frame 0 to: frame_0.png
 377    Saving frame 0 with SSD multiboxes to: multibox_0.png
 378    Loop total time (including read/write/opencv/print/etc):  320.2ms
 379    ssd_multibox PASSED
 380
 381    root@am57xx-evm:~/tidl/examples/mnist# ./mnist
 382    Input images: ../test/testvecs/input/digits10_images_28x28.y
 383    Input labels: ../test/testvecs/input/digits10_labels_10x1.y
 384    0
 385    1
 386    2
 387    3
 388    4
 389    5
 390    6
 391    7
 392    8
 393    9
 394    Device total time:  31.02ms
 395    Loop total time (including read/write/print/etc):  32.49ms
 396    Accuracy:    100%
 397    mnist PASSED
 398
 399
 400 Image input
 401 ^^^^^^^^^^^
 402
 403 The image input option, ``-i <image>``, takes an image file as input.
 404 You can supply an image file with format that OpenCV can read, since
 405 we use OpenCV for image pre/post-processing.  When ``-f <number>`` option
 406 is used, the same image will be processed repeatedly.
 407
 408 Camera (live video) input
 409 ^^^^^^^^^^^^^^^^^^^^^^^^^
 410
 411 The input option, ``-i camera<number>``, enables live frame inputs
 412 from camera.  ``<number>`` is the video input port number
 413 of your camera in Linux.  Use the following command to check video
 414 input ports.  The number defaults to ``1`` for TMDSCM572X camera module
 415 used on AM57x EVMs.  You can use ``-f <number>`` to specify the number
 416 of frames you want to process.
 417
 418 .. code-block:: shell
 419
 420   root@am57xx-evm:~# v4l2-ctl --list-devices
 421   omapwb-cap (platform:omapwb-cap):
 422         /dev/video11
 423
 424   omapwb-m2m (platform:omapwb-m2m):
 425         /dev/video10
 426
 427   vip (platform:vip):
 428         /dev/video1
 429
 430   vpe (platform:vpe):
 431         /dev/video0
 432
 433
 434 Pre-recorded video (mp4/mov/avi) input
 435 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 436
 437 The input option, ``-i <name>.{mp4,mov,avi}``, enables frame inputs from
 438 pre-recorded video file in mp4, mov or avi format.  If you have a video in
 439 a different OpenCV-supported format/suffix, you can simply create a softlink
 440 with one of the mp4, mov or avi suffixes and feed it into the example.
 441 Again, use ``-f <number>`` to specify the number of frames you want to process.
 442
 443 Displaying video output
 444 ^^^^^^^^^^^^^^^^^^^^^^^
 445
 446 When using video input, live or pre-recorded, the example will display
 447 the output in a window using OpenCV.  If you have a LCD screen attached
 448 to the EVM, you will need to kill the ``matrix-gui`` first in order to
 449 see the example display window, as shown in the following example.
 450
 451 .. code-block:: shell
 452
 453   root@am57xx-evm:/usr/share/ti/tidl/examples/ssd_multibox# /etc/init.d/matrix-gui-2.0 stop
 454   Stopping Matrix GUI application.
 455   root@am57xx-evm:/usr/share/ti/tidl/examples/ssd_multibox# ./ssd_multibox -i camera -f 100
 456   Input: camera
 457   init done
 458   Using Wayland-EGL
 459   wlpvr: PVR Services Initialised
 460   Using the 'xdg-shell-v5' shell integration
 461   ... ...
 462   root@am57xx-evm:/usr/share/ti/tidl/examples/ssd_multibox# /etc/init.d/matrix-gui-2.0 start
 463   /usr/share/ti/tidl/examples/ssd_multibox
 464   Removing stale PID file /var/run/matrix-gui-2.0.pid.
 465   Starting Matrix GUI application.
 466
 467
 468 .. _AM574x IDK EVM:  http://www.ti.com/tool/tmdsidk574
 469 .. _AM5749: http://www.ti.com/product/AM5749/
 470 .. _Processor SDK Linux: http://software-dl.ti.com/processor-sdk-linux/esd/AM57X/latest/index_FDS.html