1 # Semantic Segmentation
3 Semantic segmentation assigns a class to each pixel of the image. It is useful for tasks such as lane detection, road segmentation etc.
5 Commonly used Training/Validation commands are listed in the file [run_segmentation.sh](../run_segmentation.sh). Uncommend one line and run the file to start the run.
7 ## Model Configurations
8 A set of common example model configurations are defined for all pixel to pixel tasks. These models can support multiple inputs (for example image and optical flow) as well as support multiple decoders for multi-task prediction (for example semantic segmentation + depth estimation + motion segmentation).
10 Whether to use multiple inputs or how many decoders to use are fully configurable. This framework is also flexible to add different model architectures or backbone networks. Some of the model configurations are currently available are:<br>
11 * **deeplabv3lite_mobilenetv2_tv**: (default) This model is mostly similar to the DeepLabV3+ model [[6]] using MobileNetV2 backbone. The difference with DeepLabV3+ is that we removed the convolutions after the shortcut and kep one set of depthwise separable convolutions to generate the prediction. The ASPP module that we used is a lite-weight variant with depthwise separable convolutions (DWASPP). We found that this reduces complexity without sacrificing accuracy. Due to this we call this model DeepLabV3+(Lite) or simply DeepLabV3Lite. (Note: The suffix "_tv" is used to indicate that our backbone model is from torchvision)<br>
12 * **fpnlite_pixel2pixel_aspp_mobilenetv2_tv**: This is similar to Feature Pyramid Network [[3]], but adapted for pixel2pixel tasks. We stop the decoder at a stride of 4 and then upsample to the final resolution from there. We also use DWASPP module to improve the receptive field. We call this model FPNPixel2Pixel.
13 * **fpnlite_pixel2pixel_aspp_mobilenetv2_tv_fd**: This is also FPN, but with a larger encoder stride(64). This is a low complexity model (using Fast Downsampling Strategy [8]) that can be used with higher resolutions.
14 * **fpnlite_pixel2pixel_aspp_resnet50**: Feature Pyramid Network (FPN) based pixel2pixel using ResNet50 backbone with DWASPP.
16 ## Datasets: Cityscapes Dataset
18 * Download the Cityscapes dataset [[1]] from https://www.cityscapes-dataset.com/. You will need need to register before the data can be downloaded. Unzip the data into the folder ./data/datasets/cityscapes/data. This folder should contain leftImg8bit and gtFine folders of cityscapes.
20 ## Datasets: VOC Dataset
21 * The PASCAL VOC dataset [[2]] can be downloaded using the following:<br>
22 ```
23 mkdir ./data/datasets/voc
24 cd /data/datasets/voc
25 wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
26 wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
27 wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
28 ```
29 * Extact the dataset files into ./data/datasets/voc/VOCdevkit
30 ```
31 tar -xvf VOCtrainval_11-May-2012.tar
32 tar -xvf VOCtrainval_06-Nov-2007.tar
33 tar -xvf VOCtest_06-Nov-2007.tar
34 ```
35 * Download Extra annotations: Download the augumented annotations as explained here: https://github.com/DrSleep/tensorflow-deeplab-resnet. For this, using a browser, download the zip file SegmentationClassAug.zip from: https://www.dropbox.com/s/oeu149j8qtbs1x0/SegmentationClassAug.zip?dl=0
36 * Unzip SegmentationClassAug.zip and place the images in the folder ./data/datasets/voc/VOCdevkit/VOC2012/SegmentationClassAug
37 * Create a list of those images in the ImageSets folder using the following:<br>
38 ```
39 cd VOCdevkit/VOC2012/ImageSets/Segmentation
40 ls -1 SegmentationClassAug | sed s/.png// > trainaug.txt
41 wget http://home.bharathh.info/pubs/codes/SBD/train_noval.txt
42 mv train_noval.txt trainaug_noval.txt
43 ```
45 ## Training
46 * These examples use two gpus because we use slightly higher accuracy when we restricted the number of GPUs used.
48 * **Cityscapes Segmentation Training** with MobileNetV2 backbone and DeeplabV3Lite decoder can be done as follows:<br>
49 ```
50 python ./scripts/train_segmentation_main.py --model_name deeplabv3lite_mobilenetv2_tv --dataset_name cityscapes_segmentation --data_path ./data/datasets/cityscapes/data --img_resize 384 768 --output_size 1024 2048 --pretrained https://download.pytorch.org/models/mobilenet_v2-b0353104.pth --gpus 0 1
51 ```
53 * Cityscapes Segmentation Training with **RegNet800MF backbone and FPN decoder** can be done as follows:<br>
54 ```
55 python ./scripts/train_segmentation_main.py --dataset_name cityscapes_segmentation --model_name fpnlite_pixel2pixel_aspp_regnetx800mf --data_path ./data/datasets/cityscapes/data --img_resize 384 768 --output_size 1024 2048 --gpus 0 1 --pretrained https://dl.fbaipublicfiles.com/pycls/dds_baselines/160906036/RegNetX-800MF_dds_8gpu.pyth
56 ```
58 * It is possible to use a **different image size**. For example, we trained for 1536x768 resolution by the following. (We used a smaller crop size compared to the image resize resolution to reduce GPU memory usage). <br>
59 ```
60 python ./scripts/train_segmentation_main.py --model_name deeplabv3lite_mobilenetv2_tv --dataset_name cityscapes_segmentation --data_path ./data/datasets/cityscapes/data --img_resize 768 1536 --rand_crop 512 1024 --output_size 1024 2048 --pretrained https://download.pytorch.org/models/mobilenet_v2-b0353104.pth --gpus 0 1
61 ```
63 * Train **FPNPixel2Pixel model at 1536x768 resolution** (use 1024x512 crop to reduce memory usage):<br>
64 ```
65 python ./scripts/train_segmentation_main.py --model_name fpnlite_pixel2pixel_aspp_mobilenetv2_tv --dataset_name cityscapes_segmentation --data_path ./data/datasets/cityscapes/data --img_resize 768 1536 --rand_crop 512 1024 --output_size 1024 2048 --pretrained https://download.pytorch.org/models/mobilenet_v2-b0353104.pth --gpus 0 1
66 ```
68 * **VOC Segmentation Training** can be done as follows:<br>
69 ```
70 python ./scripts/train_segmentation_main.py --model_name deeplabv3lite_mobilenetv2_tv --dataset_name voc_segmentation --data_path ./data/datasets/voc --img_resize 512 512 --output_size 512 512 --pretrained https://download.pytorch.org/models/mobilenet_v2-b0353104.pth --gpus 0 1
71 ```
73 ## Validation
74 * During the training, **validation** accuracy will also be printed. But to explicitly check the accuracy again with **validation** set, it can be done as follows (fill in the path to the pretrained model):<br>
75 ```
76 python ./scripts/train_segmentation_main.py --phase validation --model_name deeplabv3lite_mobilenetv2_tv --dataset_name cityscapes_segmentation --data_path ./data/datasets/cityscapes/data --img_resize 384 768 --output_size 1024 2048 --gpus 0 1 --pretrained ?????
77 ```
79 ## Inference
80 Inference can be done as follows (fill in the path to the pretrained model):<br>
81 ```
82 python ./scripts/infer_segmentation_main.py --phase validation --model_name deeplabv3lite_mobilenetv2_tv --dataset_name cityscapes_segmentation_measure --data_path ./data/datasets/cityscapes/data --img_resize 384 768 --output_size 1024 2048 --gpus 0 1 --pretrained ?????
83 ```
85 ## Results
87 ### Cityscapes Segmentation
89 |Dataset |Mode Architecture |Backbone Model |Backbone Stride|Resolution |Complexity (GigaMACS)|MeanIoU% |Model Configuration Name |
90 |--------- |---------- |----------- |-------------- |-----------|-------- |----------|---------------------------------------- |
91 |Cityscapes |FPNLitePixel2Pixel with DWASPP|FD-MobileNetV2 |64 |768x384 |0.99 |62.43 |fpnlite_pixel2pixel_aspp_mobilenetv2_tv_fd |
92 |Cityscapes |UNetLite with DWASPP |MobileNetV2 |32 |768x384 |**2.20** |**68.94** |**unetlite_pixel2pixel_aspp_mobilenetv2_tv** |
93 |Cityscapes |DeepLabV3Lite with DWASPP |MobileNetV2 |16 |768x384 |**3.54** |**69.13** |**deeplabv3lite_mobilenetv2_tv** |
94 |Cityscapes |FPNLitePixel2Pixel |MobileNetV2 |32(\*2\*2) |768x384 |3.66 |70.30 |fpnlite_pixel2pixel_mobilenetv2_tv |
95 |Cityscapes |FPNLitePixel2Pixel with DWASPP|MobileNetV2 |32 |768x384 |3.84 |70.39 |fpnlite_pixel2pixel_aspp_mobilenetv2_tv |
96 |Cityscapes |FPNLitePixel2Pixel |FD-MobileNetV2 |64(\*2\*2) |1536x768 |3.85 |69.82 |fpnlite_pixel2pixel_mobilenetv2_tv_fd |
97 |Cityscapes |FPNLitePixel2Pixel with DWASPP|FD-MobileNetV2 |64 |1536x768 |**3.96** |**71.28** |**fpnlite_pixel2pixel_aspp_mobilenetv2_tv_fd**|
98 |Cityscapes |FPNLitePixel2Pixel with DWASPP|FD-MobileNetV2 |64 |2048x1024 |7.03 |72.67 |fpnlite_pixel2pixel_aspp_mobilenetv2_tv_fd |
99 |Cityscapes |DeepLabV3Lite with DWASPP |MobileNetV2 |16 |1536x768 |14.48 |73.59 |deeplabv3lite_mobilenetv2_tv |
100 |Cityscapes |FPNLitePixel2Pixel with DWASPP|MobileNetV2 |32 |1536x768 |**15.37** |**74.98** |**fpnlite_pixel2pixel_aspp_mobilenetv2_tv** |
101 |Cityscapes |FPNLitePixel2Pixel with DWASPP|FD-ResNet50 |64 |1536x768 |30.91 |- |fpnlite_pixel2pixel_aspp_resnet50_fd |
102 |Cityscapes |FPNLitePixel2Pixel with DWASPP|ResNet50 |32 |1536x768 |114.42 |- |fpnlite_pixel2pixel_aspp_resnet50 |
103 |.
104 |Cityscapes |DeepLabV3Lite GroupedConvASPP |RegNet800MF [9]|32 |768x384 |**11.19** |**68.44** |**deeplav3lite_pixel2pixel_aspp_regnetx800mf**|
105 |Cityscapes |FPNLite GroupedConvASPP |RegNet800MF [9]|32 |768x384 |**7.29** |**70.22** |**fpnlite_pixel2pixel_aspp_regnetx800mf** |
106 |Cityscapes |UNetLite GroupedConvASPP |RegNet800MF [9]|32 |768x384 |**6.09** |**69.93** |**unetlite_pixel2pixel_aspp_regnetx800mf** |
109 For comparison, here we list a few models from the literature:
111 |Dataset |Mode Architecture |Backbone Model |Backbone Stride|Resolution |Complexity (GigaMACS)|MeanIoU% |Model Configuration Name |
112 |--------- |---------- |----------- |-------------- |-----------|-------- |----------|------------------------|
113 |Cityscapes |ERFNet [4] |- |- |1024x512 |27.705 |69.7 |- |
114 |Cityscapes |SwiftNetMNV2 [5] |MobileNetV2 |- |2048x1024 |41.0 |75.3 |- |
115 |Cityscapes |DeepLabV3Plus [6,7] |MobileNetV2 |16 | |21.27 |70.71 |- |
116 |Cityscapes |DeepLabV3Plus [6,7] |Xception65 |16 | |418.64 |78.79 |- |
118 Notes:
119 - The suffix **'Lite'** in the model names indicates complexity optimizations in the Decoder part of the model - especially the use of DepthWise Separable Convolutions instead of regular convolutions.
120 - (\*2\*2) in the above table represents two additional Depthwise Separable Convolutions with strides (at the end of the backbone encoder).
121 - FD-MobileNetV2 Backbone uses a stride of 64 (this is used in some rows of the above table) and is achieved by Fast Downsampling Strategy [8]
122 - As can be seen from the table, the models included in this repository provide a good Accuracy/Complexity tradeoff.
123 - However, the Complexity (in GigaMACS) does not always indicate the speed of inference on an embedded device. We have to also consider the fact that regular convolutions and Grouped convolutions are typically more efficient in utilizing the available compute resources (as they have more compute per data trasnsfer) compared to Depthwise convolutions.
124 - Hence, although the MobileNetV2 based models may have less GigaMACS compared to the RegNetX models, the RegNetX based models may not be slower in practice. Overall, RegNetX based models are highly recommend as they strike a good balance between Complexity (GigaMACS), Compute Efficiency (more compute per data transfer) and easiness of Quantization.
127 ## References
128 [1]The Cityscapes Dataset for Semantic Urban Scene Understanding, Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele, CVPR 2016, https://www.cityscapes-dataset.com/
130 [2] The PASCAL Visual Object Classes (VOC) Challenge
131 Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A.
132 International Journal of Computer Vision, 88(2), 303-338, 2010, http://host.robots.ox.ac.uk/pascal/VOC/
134 [3] Feature Pyramid Networks for Object Detection, Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, CVPR 2017
136 [4] ERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic Segmentation, E. Romera, J. M. Alvarez, L. M. Bergasa and R. Arroyo, Transactions on Intelligent Transportation Systems (T-ITS), 2017
138 [5] In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images, Marin Orsic, Ivan Kreso, Petra Bevandic, Sinisa Segvic, CVPR 2019.
140 [6] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, CVPR 2018
142 [7] Tensorflow/Deeplab Model Zoo, https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md
144 [8] FD-MobileNet: Improved MobileNet with a Fast Downsampling Strategy, Zheng Qin, Zhaoning Zhang, Xiaotao Chen, Yuxing Peng - https://arxiv.org/abs/1802.03750)
146 [9] Designing Network Design Spaces, Ilija Radosavovic Raj Prateek Kosaraju Ross Girshick Kaiming He Piotr DollarĀ“, Facebook AI Research (FAIR), https://arxiv.org/pdf/2003.13678.pdf, https://github.com/facebookresearch/pycls