aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorManu Mathew2020-04-30 07:57:41 -0500
committerManu Mathew2020-04-30 08:00:11 -0500
commitf0e7f9271e6f49f21e18124115d7bbb322b7cc22 (patch)
treea3179d85497b04264439f882b76167e300495a93
parentf9b3c521edc52a1db97e15cb937ce7fff24a8b0e (diff)
downloadpytorch-jacinto-ai-devkit-f0e7f9271e6f49f21e18124115d7bbb322b7cc22.tar.gz
pytorch-jacinto-ai-devkit-f0e7f9271e6f49f21e18124115d7bbb322b7cc22.tar.xz
pytorch-jacinto-ai-devkit-f0e7f9271e6f49f21e18124115d7bbb322b7cc22.zip
quantization docs updated
-rw-r--r--docs/Quantization.md17
1 files changed, 7 insertions, 10 deletions
diff --git a/docs/Quantization.md b/docs/Quantization.md
index b5db9fa..0e111d1 100644
--- a/docs/Quantization.md
+++ b/docs/Quantization.md
@@ -10,7 +10,6 @@ TI Deep Learning Library (TIDL) is a highly optimized runtime for Deep Learning
10- Quantization Aware Training (QAT): This is needed if accuracy obtained with Calibration is not satisfactory (eg. Quantization Accuracy Drop >2%). QAT operates as a second phase after the initial training in floating point is done. We have provide this PyTorch Jacinto AI DevKit to enable QAT with PyTorch. There also a plan to make TensorFlow Jacinto AI DevKit available. Further Details are available at: [https://github.com/TexasInstruments/jacinto-ai-devkit](https://github.com/TexasInstruments/jacinto-ai-devkit)<br> 10- Quantization Aware Training (QAT): This is needed if accuracy obtained with Calibration is not satisfactory (eg. Quantization Accuracy Drop >2%). QAT operates as a second phase after the initial training in floating point is done. We have provide this PyTorch Jacinto AI DevKit to enable QAT with PyTorch. There also a plan to make TensorFlow Jacinto AI DevKit available. Further Details are available at: [https://github.com/TexasInstruments/jacinto-ai-devkit](https://github.com/TexasInstruments/jacinto-ai-devkit)<br>
11 11
12## Guidelines For Training To Get Best Accuracy With Quantization 12## Guidelines For Training To Get Best Accuracy With Quantization
13- We are listing these guidelines upfront, because it is important to read and follow these. Closely following these guidelines can save hours & hours of debugging related to accuracy issues with quantization.
14- We recommend that the training uses sufficient amount of regularization / weight decay. Regularization / weight decay ensures that the weights, biases and other parameters (if any) are small and compact - this is good for quantization. These features are supported in most of the popular training framework.<br> 13- We recommend that the training uses sufficient amount of regularization / weight decay. Regularization / weight decay ensures that the weights, biases and other parameters (if any) are small and compact - this is good for quantization. These features are supported in most of the popular training framework.<br>
15- We have noticed that some training code bases do not use weight decay for biases. Some other code bases do not use weight decay for the parameters in Depthwise convolution layers. All these are bad strategies for quantization. These poor choices done (probably to get a 0.1% accuracy lift with floating point) will result in a huge degradation in fixed point - sometimes several percentage points. The weight decay factor should not be too small. We have used a weight decay factor of 1e-4 for training several networks and we highly recommend a similar value. Please do no use small values such as 1e-5.<br> 14- We have noticed that some training code bases do not use weight decay for biases. Some other code bases do not use weight decay for the parameters in Depthwise convolution layers. All these are bad strategies for quantization. These poor choices done (probably to get a 0.1% accuracy lift with floating point) will result in a huge degradation in fixed point - sometimes several percentage points. The weight decay factor should not be too small. We have used a weight decay factor of 1e-4 for training several networks and we highly recommend a similar value. Please do no use small values such as 1e-5.<br>
16- We also highly recommend to use Batch Normalization immediately after every Convolution layer. This helps the feature map to be properly regularized/normalized. If this is not done, there can be accuracy degradation with quantization. This especially true for Depthwise Convolution layers. However applying Batch Normalization to the very last Convolution layer (for example, the prediction layer in segmentation/object detection network) may hurt accuracy and can be avoided.<br> 15- We also highly recommend to use Batch Normalization immediately after every Convolution layer. This helps the feature map to be properly regularized/normalized. If this is not done, there can be accuracy degradation with quantization. This especially true for Depthwise Convolution layers. However applying Batch Normalization to the very last Convolution layer (for example, the prediction layer in segmentation/object detection network) may hurt accuracy and can be avoided.<br>
@@ -19,15 +18,6 @@ To get best accuracy at the quantization stage, it is important that the model i
19- Weight decay is applied to all layers / parameters and that weight decay factor is good.<br> 18- Weight decay is applied to all layers / parameters and that weight decay factor is good.<br>
20- Ensure that the Convolution layers in the network have Batch Normalization layers immediately after that. The only exception allowed to this rule is for the very last Convolution layer in the network (for example the prediction layer in a segmentation network or detection network, where adding Batch normalization might hurt the floating point accuracy).<br> 19- Ensure that the Convolution layers in the network have Batch Normalization layers immediately after that. The only exception allowed to this rule is for the very last Convolution layer in the network (for example the prediction layer in a segmentation network or detection network, where adding Batch normalization might hurt the floating point accuracy).<br>
21 20
22## Implementation Notes, Limitations & Recommendations
23- Again, we are listing these upfront since it is important to understand and follow these. Please read carefully and follow the recommendations to avoid surprises.
24- **Multi-GPU training/calibration/validation with DataParallel is not yet working with our quantization modules** QuantTrainModule/QuantCalibrateModule/QuantTestModule. We recommend not to wrap the modules in DataParallel if you are training/calibrating/testing with quantization - i.e. if your model is wrapped in QuantTrainModule/QuantCalibrateModule/QuantTestModule.<br>
25- If you get an error during training related to weights and input not being in the same GPU, please check and ensure that you are not using DataParallel with QuantTrainModule/QuantCalibrateModule/QuantTestModule. This may not be such a problem as calibration and quantization may not take as much time as the original floating point training. The original floating point training (without quantization) can still use Multi-GPU as usual and we do not have any restrictions on that. Please take a look at some of our example code to see that we have avoided DataParallel when quantization is enabled, but we use it when it is disabled.<br>
26- If your calibration/training crashes with insufficient GPU memory, reduce the batch size and try again.
27- **The same module should not be re-used multiple times within the module** in order that the activation range estimation is correct. Unfortunately, in the torchvision ResNet models, the ReLU module in the BasicBlock and BottleneckBlock are re-used multiple times. We have corrected this by defining separate ReLU modules. This change is minor and **does not** affect the loading of existing pretrained weights. See the [our modified ResNet model definition here](./modules/pytorch_jacinto_ai/vision/models/resnet.py).<br>
28- **Use Modules instead of functions** (we make use of modules to decide whether to do activation range clipping or not). For example use torch.nn.ReLU instead of torch.nn.functional.relu(), torch.nn.AdaptiveAvgPool2d() instead of torch.nn.functional.adaptive_avg_pool2d(), torch.nn.Flatten() instead of torch.nn.functional.flatten() etc. If you are using functions in your model and the model is giving poor quantized accuracy, then consider replacing those functions by the corresponding modules.<br>
29- After Quantization Aware Training (QAT), the output range of the feature map of a QAT model wrapped in QuantTrainModule is not expected to match with the range of the original floating point model. This is not really a limitation, but a property of QAT, because QAT is proper training and it may change the weights and feature map to make them friendly for quantization.
30
31## Post Training Calibration For Quantization (PTQ a.k.a. Calibration) 21## Post Training Calibration For Quantization (PTQ a.k.a. Calibration)
32**Note: this is not our recommended method in PyTorch.**<br> 22**Note: this is not our recommended method in PyTorch.**<br>
33Post Training Calibration or simply Calibration is a method to reduce the accuracy loss with quantization. This is an approximate method and does not require ground truth or back-propagation - hence it is suitable for implementation in an Import/Calibration tool. We have simulated this in PyTorch and can be used as fast method to improve the accuracy with Quantization. If you are interested, you can take a look at the [documentation of Calibration here](Calibration.md).<br> 23Post Training Calibration or simply Calibration is a method to reduce the accuracy loss with quantization. This is an approximate method and does not require ground truth or back-propagation - hence it is suitable for implementation in an Import/Calibration tool. We have simulated this in PyTorch and can be used as fast method to improve the accuracy with Quantization. If you are interested, you can take a look at the [documentation of Calibration here](Calibration.md).<br>
@@ -94,6 +84,13 @@ As can be seen, it is easy to incorporate QuantTrainModule in your existing trai
94 84
95Optional: We have provided a utility function called pytorch_jacinto_ai.xnn.utils.load_weights() that prints which parameters are loaded correctly and which are not - you can use this load function if needed to ensure that your parameters are loaded correctly. 85Optional: We have provided a utility function called pytorch_jacinto_ai.xnn.utils.load_weights() that prints which parameters are loaded correctly and which are not - you can use this load function if needed to ensure that your parameters are loaded correctly.
96 86
87## Important Notes - read carefully
88- **Multi-GPU training/calibration/validation with DataParallel is not yet working with our quantization modules** QuantTrainModule/QuantCalibrateModule/QuantTestModule. We recommend not to wrap the modules in DataParallel if you are training/calibrating/testing with quantization - i.e. if your model is wrapped in QuantTrainModule/QuantCalibrateModule/QuantTestModule.<br>
89- If you get an error during training related to weights and input not being in the same GPU, please check and ensure that you are not using DataParallel with QuantTrainModule/QuantCalibrateModule/QuantTestModule. This may not be such a problem as calibration and quantization may not take as much time as the original floating point training. The original floating point training (without quantization) can use Multi-GPU as usual and we do not have any restrictions on that.<br>
90- If your calibration/training crashes with insufficient GPU memory, reduce the batch size and try again.
91- **The same module should not be re-used multiple times within the module** in order that the activation range estimation is correct. Unfortunately, in the torchvision ResNet models, the ReLU module in the BasicBlock and BottleneckBlock are re-used multiple times. We have corrected this by defining separate ReLU modules. This change is minor and **does not** affect the loading of existing pretrained weights. See the [our modified ResNet model definition here](./modules/pytorch_jacinto_ai/vision/models/resnet.py).<br>
92- **Use Modules instead of functions** (we make use of modules to decide whether to do activation range clipping or not). For example use torch.nn.reLU instead of torch.nn.functional.relu(), torch.nn.AdaptiveAvgPool2d() instead of torch.nn.functional.adaptive_avg_pool2d(), torch.nn.Flatten() instead of torch.nn.functional.flatten() etc. If you are using functions in your model and is giving poor quantized accuracy, then consider replacing those functions by the corresponding modules.<br>
93
97#### Example commands for QAT 94#### Example commands for QAT
98ImageNet Classification: *In this example, only a fraction of the training samples are used in each training epoch to speedup training. Remove the argument --epoch_size to use all the training samples.* 95ImageNet Classification: *In this example, only a fraction of the training samples are used in each training epoch to speedup training. Remove the argument --epoch_size to use all the training samples.*
99``` 96```