|
17 | 17 |
|
18 | 18 | - <img src="https://latex.codecogs.com/svg.latex?\;\mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)})" title="\mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)})" /> : a pair-wise *matching cost* between ground truth <img src="https://latex.codecogs.com/svg.latex?\;y_i" title="y_i" /> and a prediction with index <img src="https://latex.codecogs.com/svg.latex?\;\sigma(i)" title="\sigma(i)" />
|
19 | 19 | - Hungarian algorithm [[arxiv]](https://arxiv.org/abs/1506.04878)
|
20 |
| - - <img src="https://latex.codecogs.com/svg.latex?\;\mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)}) = \mathbb{1}_{\{c_{i} \neq \emptyset\}} \hat{p}_{\sigma(i)}(c_i) + \mathbb{1}_{c_{i} \neq \emptyset} \mathcal{L}_{box}(b_i, \hat{b}_{\sigma(i)})" title="\mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)}) = \mathbb{1}_{\{c_{i} \neq \emptyset\}} \hat{p}_{\sigma(i)}(c_i) + \mathbb{1}_{c_{i} \neq \emptyset} \mathcal{L}_{box}(b_i, \hat{b}_{\sigma(i)})" /> |
| 20 | + - <img src="https://latex.codecogs.com/svg.latex?\;\mathcal{L}_{match}(y_i,\hat{y}_{\sigma(i)})=\mathbb{1}_{\{c_{i}\neq\emptyset\}}\hat{p}_{\sigma(i)}(c_i)+\mathbb{1}_{c_{i}\neq\emptyset}\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)})" title="\mathcal{L}_{match}(y_i,\hat{y}_{\sigma(i)})=\mathbb{1}_{\{c_{i}\neq\emptyset\}}\hat{p}_{\sigma(i)}(c_i)+\mathbb{1}_{c_{i}\neq\emptyset}\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)})" /> |
21 | 21 | - *Hungarian loss*
|
22 | 22 | <p align="center"><img width="100%" src="img/eq2.PNG" /></p>
|
23 | 23 |
|
24 | 24 | - <img src="https://latex.codecogs.com/svg.latex?\;\hat{\sigma}" title="\hat{\sigma}" /> : the optimal assignment computed in the first step (eq.1)
|
25 | 25 | ### Bounding box loss
|
26 | 26 | - A linear combination of the <img src="https://latex.codecogs.com/svg.latex?\;l_1" title="l_1" /> loss
|
27 | 27 | - The generalized IoU loss
|
28 |
| -- <img src="https://latex.codecogs.com/svg.latex?\;\mathcal{L}_{box}(b_i, \hat{b}_{\sigma(i)}) = \lambda_{iou} \mathcal{L}_{iou} (b_i, \hat{b}_{\sigma(i)}) + \lambda_{L1} ||b_i - \hat{b}_{\sigma(i)}||_1" title="\mathcal{L}_{box}(b_i, \hat{b}_{\sigma(i)}) = \lambda_{iou} \mathcal{L}_{iou} (b_i, \hat{b}_{\sigma(i)}) + \lambda_{L1} ||b_i - \hat{b}_{\sigma(i)}||_1" /> |
| 28 | +- <img src="https://latex.codecogs.com/svg.latex?\;\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)})=\lambda_{iou}\mathcal{L}_{iou}(b_i,\hat{b}_{\sigma(i)})+\lambda_{L1}||b_i-\hat{b}_{\sigma(i)}||_1" title="\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)})=\lambda_{iou}\mathcal{L}_{iou}(b_i,\hat{b}_{\sigma(i)})+\lambda_{L1}||b_i-\hat{b}_{\sigma(i)}||_1" /> |
29 | 29 |
|
30 | 30 | ## DETR architecture
|
31 | 31 | <p align="center"><img width="100%" src="img/fig2.PNG" /></p>
|
32 | 32 |
|
33 | 33 | ### Backbone
|
34 |
| -- <img src="https://latex.codecogs.com/svg.latex?\;x_{img} \in \mathbb{R}^{3 \times H_0 \times W_0}" title="x_{img} \in \mathbb{R}^{3 \times H_0 \times W_0}" /> : the initial image |
35 |
| -- <img src="https://latex.codecogs.com/svg.latex?\;f \in \mathbb{R}^{C \times H \times W}" title="f \in \mathbb{R}^{C \times H \times W}" /> : a lower-resolution activation map |
36 |
| - - <img src="https://latex.codecogs.com/svg.latex?\;C=2048\ \text{and}\ H,W = \frac{H_0}{32}, \frac{W_0}{32}" title="C=2048\ \text{and}\ H,W = \frac{H_0}{32}, \frac{W_0}{32}" /> |
| 34 | +- <img src="https://latex.codecogs.com/svg.latex?\;x_{img}\in\mathbb{R}^{3\times{H_0}\times{W_0}}" title="x_{img}\in\mathbb{R}^{3\times{H_0}\times{W_0}}" /> : the initial image |
| 35 | +- <img src="https://latex.codecogs.com/svg.latex?\;f\in\mathbb{R}^{C\times{H}\times{W}}" title="f\in\mathbb{R}^{C\times{H}\times{W}}" /> : a lower-resolution activation map |
| 36 | + - <img src="https://latex.codecogs.com/svg.latex?\;C=2048" title="C=2048" /> and <img src="https://latex.codecogs.com/svg.latex?\;{H},{W}=\frac{H_0}{32},\frac{W_0}{32}" title="{H},{W}=\frac{H_0}{32},\frac{W_0}{32}" /> |
37 | 37 |
|
38 | 38 | ### Transformer encoder
|
39 |
| -- A 1x1 convolution reduces the channel dimension of the high-level activation map <img src="https://latex.codecogs.com/svg.latex?\;f" title="f" /> from <img src="https://latex.codecogs.com/svg.latex?\;C" title="C" /> to a smaller dimension <img src="https://latex.codecogs.com/svg.latex?\;d" title="d" /> creating a new feature map <img src="https://latex.codecogs.com/svg.latex?\;z_0 \in \mathbb{R}^{d \times H \times W}" title="z_0 \in \mathbb{R}^{d \times H \times W}" />. |
40 |
| -- The spatial dimensions of <img src="https://latex.codecogs.com/svg.latex?\;z_0" title="z_0" /> is collapsed into one dimension, resulting in a <img src="https://latex.codecogs.com/svg.latex?\;d \times HW" title="d \times HW" /> feature map. |
| 39 | +- A 1x1 convolution reduces the channel dimension of the high-level activation map <img src="https://latex.codecogs.com/svg.latex?\;f" title="f" /> from <img src="https://latex.codecogs.com/svg.latex?\;C" title="C" /> to a smaller dimension <img src="https://latex.codecogs.com/svg.latex?\;d" title="d" /> creating a new feature map <img src="https://latex.codecogs.com/svg.latex?\;z_0\in\mathbb{R}^{d\times{H}\times{W}}" title="z_0\in\mathbb{R}^{d\times{H}\times{W}}" />. |
| 40 | +- The spatial dimensions of <img src="https://latex.codecogs.com/svg.latex?\;z_0" title="z_0" /> is collapsed into one dimension, resulting in a <img src="https://latex.codecogs.com/svg.latex?\;d\times{HW}" title="d\times{HW}" /> feature map. |
41 | 41 | - Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN).
|
42 | 42 | - Each encoder layer is supplemented with fixed positional encodings that are added to the input of each attention layer.
|
43 | 43 |
|
|
52 | 52 | - Because of a fixed size set of <img src="https://latex.codecogs.com/svg.latex?\;N" title="N" /> bounding boxes, an additional special class label <img src="https://latex.codecogs.com/svg.latex?\;\emptyset" title="\emptyset" /> is used to represent that no object is detected within a slot. This class plays a similar role to the "background" class in the standard object detection approaches.
|
53 | 53 |
|
54 | 54 | ### Auxiliary decoding losses
|
55 |
| -- To help the model output the correct number of objects of each class |
| 55 | +- To help the model output the correct number of objects of each class |
0 commit comments