Skip to content

Commit 5b0da62

Browse files
XiaotaoChenRogerChern
authored andcommitted
support int8 training (tusen-ai#243)
* support quantize training * fix attrs bug for data quantize attrs * modify code with reviewer's advices * remove workspace in quantization int8 op and align the shape of mmean and mvar with the source shape in merge_bn * add reshape name behind roi_align to avoid the diff name between train_symobl and test symbol * update int8 result * modify ema upate method and fix allocate tempspace bug * add warning for attach quantized node * update readme * Update README.md * Update README.md * Update README.md
1 parent f09f507 commit 5b0da62

File tree

12 files changed

+1080
-15
lines changed

12 files changed

+1080
-15
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
- Add FitNet based Knowledge Distill (2019.08.27)
2424
- Add SE and train from scratch (2019.08.30)
2525
- Add a lot of docs (2019.09.03)
26+
- Add support for INT8 training (2019.10.24)
2627

2728
### Setup
2829
#### All-in-one Script

config/faster_r50v1c4_c5_512roi_1x.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ class General:
1414
batch_image = 2 if is_train else 1
1515
fp16 = False
1616

17-
1817
class KvstoreParam:
1918
kvstore = "local"
2019
batch_image = General.batch_image

config/int8/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
## Quantization during Training
2+
3+
#### Motivation
4+
Low precision weight and activation could greatly reduce the storage and memory footprint of detection models and improve the inference latency. We provide the inference time measured on TensorRT of INT8 and FP32 version of `faster_r50v1c4_c5_512roi_1x` as an example below.
5+
6+
| dtype | time(ms) | minival mAP|
7+
| ----- | -------- | -----------|
8+
| fp32 | 260 | 35.7 |
9+
| int8 | 100 | 35.8 |
10+
11+
**detail configs**
12+
13+
```shell
14+
batch size=1
15+
device = GTX 1080
16+
data shape = (1, 3, 800, 1200)
17+
```
18+
19+
### Implementation Details
20+
21+
#### the Quantization Methods
22+
23+
**for model weight:**
24+
```shell
25+
nbits = 8
26+
QUANT_LEVEL = 2 ** (nbits - 1) - 1
27+
threshold = max(abs(w_tensor))
28+
quant_unit = threshold / QUANT_LEVEL
29+
quantized_w = round(w_tensor / quant_unit) * quant_unit
30+
```
31+
32+
**for model activation:** The threshold is maintained with exponetial moving average of max absolute activation. [ref](<https://arxiv.org/pdf/1712.05877.pdf>)
33+
34+
```shell
35+
nbits = 8
36+
QUANT_LEVEL = 2**(nbits -1) -1
37+
history_threshold; # initialized by max(abs(act_tensor))
38+
curr_max = max(abs(act_tensor))
39+
threshold = 0.99 * history_threshold + 0.01 * curr_max
40+
quant_unit = threshold / QUANT_LEVEL
41+
quantized_act = round(w_tensor / quant_unit) * quant_unit
42+
```
43+
44+
### Quantization Configs
45+
The quantization configs are in the `ModelParam.QuantizeTrainingParam` class, which give users more flexibility during quantization.
46+
47+
**quantize_flag:** to quantize the model or not.
48+
49+
**quantized_op:** the operators to quantize.
50+
51+
`WeightQuantizeParam` and `ActQuantizeParam` is attributes need by `Quantization_int8` operator for quantizing `weight` and `activation`.
52+
53+
### Attributes of the `quantization_int8` operator
54+
55+
**delya_quant:** after delay_quant iters, the quantization working actually.
56+
57+
**ema_decay:** the hyperparameter for activation threshold update.
58+
59+
**grad_mode:** the mode for gradients pass. there are two mode: ste or clip. ste mean straightforward pass the out gradients to data, clip mean only pass the gradients whose value of data in the range of [-threshold, threshold], the gradients of outer is settting to 0.
60+
61+
**workspace:** the temporary space used in grad_mode='clip'
62+
63+
**is_weight:** the tensor to be quantized is weight or not.
64+
65+
**is_weight_perchannel:** the granularity of quantization for weight : per tensor or per channel. Only used when the tensor is weight. Currently, only support pertensor mode.
66+
67+
**quant_mode:** the quantization methods: `minmax` or `power2`, Currently, only support minmax mode.
68+
69+
70+
### How to reproduce the result
71+
1. Install a custom version of MXNet
72+
[[CUDA90]](https://1dv.alarge.space/mxnet_cu90-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl)
73+
[[CUDA100]](https://1dv.alarge.space/mxnet_cu100-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl)
74+
[[CUDA101]](https://1dv.alarge.space/mxnet_cu101-1.6.0b20191018-py2.py3-none-manylinux1_x86_64.whl)
75+
2. To train a fp32 model with the default config.
76+
3. Finetune the trained fp32 model with quantization training. Our finetuning setting are: `begin_epoch=6` and `end_epoch=12`. All other configs remains the same as fp32 training configs.
77+
4. We provide an example [model](https://1dv.alarge.space/faster_r50v1bc4_c5_512roi_1x_int8.zip) for `faster_r50v1c4_c5_512roi_1x`.
78+
79+
### Drawbacks
80+
TensorRT does not provide API to set `quantize scale` as user's own `scale` instead of `scale` calcuated by itself. So the learned `threshold` can't be directly deployed to TensorRT currently. You may need to tweak with the weight file generated by TensorRT.

0 commit comments

Comments
 (0)