|
| 1 | +## Quantization during Training |
| 2 | + |
| 3 | +#### Motivation |
| 4 | +Low precision weight and activation could greatly reduce the storage and memory footprint of detection models and improve the inference latency. We provide the inference time measured on TensorRT of INT8 and FP32 version of `faster_r50v1c4_c5_512roi_1x` as an example below. |
| 5 | + |
| 6 | +| dtype | time(ms) | minival mAP| |
| 7 | +| ----- | -------- | -----------| |
| 8 | +| fp32 | 260 | 35.7 | |
| 9 | +| int8 | 100 | 35.8 | |
| 10 | + |
| 11 | +**detail configs** |
| 12 | + |
| 13 | +```shell |
| 14 | +batch size=1 |
| 15 | +device = GTX 1080 |
| 16 | +data shape = (1, 3, 800, 1200) |
| 17 | +``` |
| 18 | + |
| 19 | +### Implementation Details |
| 20 | + |
| 21 | +#### the Quantization Methods |
| 22 | + |
| 23 | +**for model weight:** |
| 24 | +```shell |
| 25 | +nbits = 8 |
| 26 | +QUANT_LEVEL = 2 ** (nbits - 1) - 1 |
| 27 | +threshold = max(abs(w_tensor)) |
| 28 | +quant_unit = threshold / QUANT_LEVEL |
| 29 | +quantized_w = round(w_tensor / quant_unit) * quant_unit |
| 30 | +``` |
| 31 | + |
| 32 | +**for model activation:** The threshold is maintained with exponetial moving average of max absolute activation. [ref](<https://arxiv.org/pdf/1712.05877.pdf>) |
| 33 | + |
| 34 | +```shell |
| 35 | +nbits = 8 |
| 36 | +QUANT_LEVEL = 2**(nbits -1) -1 |
| 37 | +history_threshold; # initialized by max(abs(act_tensor)) |
| 38 | +curr_max = max(abs(act_tensor)) |
| 39 | +threshold = 0.99 * history_threshold + 0.01 * curr_max |
| 40 | +quant_unit = threshold / QUANT_LEVEL |
| 41 | +quantized_act = round(w_tensor / quant_unit) * quant_unit |
| 42 | +``` |
| 43 | + |
| 44 | +### Quantization Configs |
| 45 | +The quantization configs are in the `ModelParam.QuantizeTrainingParam` class, which give users more flexibility during quantization. |
| 46 | + |
| 47 | +**quantize_flag:** to quantize the model or not. |
| 48 | + |
| 49 | +**quantized_op:** the operators to quantize. |
| 50 | + |
| 51 | +`WeightQuantizeParam` and `ActQuantizeParam` is attributes need by `Quantization_int8` operator for quantizing `weight` and `activation`. |
| 52 | + |
| 53 | +### Attributes of the `quantization_int8` operator |
| 54 | + |
| 55 | +**delya_quant:** after delay_quant iters, the quantization working actually. |
| 56 | + |
| 57 | +**ema_decay:** the hyperparameter for activation threshold update. |
| 58 | + |
| 59 | +**grad_mode:** the mode for gradients pass. there are two mode: ste or clip. ste mean straightforward pass the out gradients to data, clip mean only pass the gradients whose value of data in the range of [-threshold, threshold], the gradients of outer is settting to 0. |
| 60 | + |
| 61 | +**workspace:** the temporary space used in grad_mode='clip' |
| 62 | + |
| 63 | +**is_weight:** the tensor to be quantized is weight or not. |
| 64 | + |
| 65 | +**is_weight_perchannel:** the granularity of quantization for weight : per tensor or per channel. Only used when the tensor is weight. Currently, only support pertensor mode. |
| 66 | + |
| 67 | +**quant_mode:** the quantization methods: `minmax` or `power2`, Currently, only support minmax mode. |
| 68 | + |
| 69 | + |
| 70 | +### How to reproduce the result |
| 71 | +1. Install a custom version of MXNet |
| 72 | +[[CUDA90]](https://1dv.alarge.space/mxnet_cu90-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl) |
| 73 | +[[CUDA100]](https://1dv.alarge.space/mxnet_cu100-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl) |
| 74 | +[[CUDA101]](https://1dv.alarge.space/mxnet_cu101-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl) |
| 75 | +2. To train a fp32 model with the default config. |
| 76 | +3. Finetune the trained fp32 model with quantization training. Our finetuning setting are: `begin_epoch=6` and `end_epoch=12`. All other configs remains the same as fp32 training configs. |
| 77 | +4. We provide an example [model](https://1dv.alarge.space/faster_r50v1bc4_c5_512roi_1x_int8.zip) for `faster_r50v1c4_c5_512roi_1x`. |
| 78 | + |
| 79 | +### Drawbacks |
| 80 | +TensorRT does not provide API to set `quantize scale` as user's own `scale` instead of `scale` calcuated by itself. So the learned `threshold` can't be directly deployed to TensorRT currently. You may need to tweak with the weight file generated by TensorRT. |
0 commit comments