index ed6564d2dbb9fb17712509fb414f0e29f5af527d..eed7a43e450bff02f96b23ad60c36c6f1ae9a3e7 100644 (file)
--- a/docs/source/example.rst
+++ b/docs/source/example.rst
We ship three end-to-end examples within the tidl-api package
to demonstrate three categories of deep learning networks. The first
-two examples can run on AM57x SoCs with either DLA or DSP. The last
-example requires AM57x SoCs with both DLA and DSP. The performance
+two examples can run on AM57x SoCs with either EVE or DSP devices. The last
+example requires AM57x SoCs with both EVE and DSP. The performance
numbers that we present here were obtained on an AM5729 EVM, which
-includes 2 ARM A15 cores running at 1.5GHz, 4 DLA cores at 535MHz, and
+includes 2 ARM A15 cores running at 1.5GHz, 4 EVE cores at 535MHz, and
2 DSP cores at 750MHz.
+For each example, we report device processing time, host processing time,
+and TIDL API overhead. **Device processing time** is measured on the device,
+from the moment processing starts for a frame till processing finishes.
+**Host processing time** is measured on the host, from the moment
+``ProcessFrameStartAsync()`` is called till ``ProcessFrameWait()`` returns
+in user application. It includes the TIDL API overhead, the OpenCL runtime
+overhead, and the time to copy user input data into padded TIDL internal
+buffers.
+
Imagenet
--------
as the most likely objects that the input image can be.
The following figure and tables shows an input image, top 5 predicted
-objects as output, and the processing time on either DLA or DSP.
+objects as output, and the processing time on either EVE or DSP.
.. image:: ../../examples/test/testvecs/input/objects/cat-pet-animal-domestic-104827.jpeg
:width: 600
.. table::
- ==== ============== ============
- Rank Object Classes Probability
- ==== ============== ============
+ ==== ============== =====================
+ Rank Object Classes Probability (softmax)
+ ==== ============== =====================
1 tabby 0.996
2 Egyptian_cat 0.977
3 tiger_cat 0.973
4 lynx 0.941
5 Persian_cat 0.922
- ==== ============== ============
+ ==== ============== =====================
.. table::
====================== ==================== ============
Device Processing Time Host Processing Time API Overhead
====================== ==================== ============
- DLA: 123.1 ms 124.7 ms 1.34 %
+ EVE: 123.1 ms 124.7 ms 1.34 %
**OR**
DSP: 117.9 ms 119.3 ms 1.14 %
====================== ==================== ============
The particular network that we ran in this category, jacintonet11v2,
-has 14 layers. User can specify whether to run the network on DLA or DSP
-for acceleration. We can see that DLA time is slightly higher than DSP time.
-Host time includes the OpenCL runtime overhead and the time to copy user
-input data into padded TIDL buffers. We can see that the overall overhead
-is less than 1.5%.
+has 14 layers. User can specify whether to run the network on EVE or DSP
+for acceleration. We can see that EVE time is slightly higher than DSP time.
+We can also see that the overall overhead is less than 1.5%.
+
+.. note::
+ The probabilities reported here are the output of the softmax layer
+ in the network, and are not normalized to the real probabilities.
Segmentation
------------
The network we ran in this category is jsegnet21v2, which has 26 layers.
From the reported time in the following table, we can see that this network
-runs significantly faster on DLA than on DSP.
+runs significantly faster on EVE than on DSP.
.. table::
====================== ==================== ============
Device Processing Time Host Processing Time API Overhead
====================== ==================== ============
- DLA: 296.5 ms 303.3 ms 2.26 %
+ EVE: 296.5 ms 303.3 ms 2.26 %
**OR**
DSP: 812.0 ms 818.4 ms 0.79 %
====================== ==================== ============
.. image:: images/pexels-photo-378570-ssd.jpg
:width: 600
-The network can be run entirely on either DLA or DSP. But the best
-performance comes with running the first 30 layers on DLA and the
+The network can be run entirely on either EVE or DSP. But the best
+performance comes with running the first 30 layers on EVE and the
next 13 layers on DSP, for this particular jdetnet_ssd network.
Note the **AND** in the following table for the reported time.
Our end-to-end example shows how easy it is to assign a layers group id
====================== ==================== ============
Device Processing Time Host Processing Time API Overhead
====================== ==================== ============
- DLA: 175.2 ms 179.1 ms 2.14 %
+ EVE: 175.2 ms 179.1 ms 2.14 %
**AND**
DSP: 21.1 ms 22.3 ms 5.62 %
====================== ==================== ============
Running Examples
----------------
-The examples are located in ``/usr/share/ti/tidl-api/examples`` on
+The examples are located in ``/usr/share/ti/tidl/examples`` on
the EVM file system. Each example needs to be run its own directory.
Running an example with ``-h`` will show help message with option set.
The following code section shows how to run the examples, and
root@am57xx-evm:/usr/share/ti/tidl-api/examples/segmentation# cd ../ssd_multibox/; make -j4
root@am57xx-evm:/usr/share/ti/tidl-api/examples/ssd_multibox# ./ssd_multibox -i ../test/testvecs/input/roads/pexels-photo-378570.jpeg
Input: ../test/testvecs/input/roads/pexels-photo-378570.jpeg
- frame[0]: Time on DLA: 175.2ms, host: 179ms API overhead: 2.1 %
+ frame[0]: Time on EVE: 175.2ms, host: 179ms API overhead: 2.1 %
frame[0]: Time on DSP: 21.06ms, host: 22.43ms API overhead: 6.08 %
Saving frame 0 with SSD multiboxes to: multibox_0.png
Loop total time (including read/write/print/etc): 423.8ms