|
| 1 | +# Fully Motion-Aware Network for Video Object Detection |
| 2 | + |
| 3 | + |
| 4 | +This implementation is a fork of [FGFA](https://github.com/msracver/Flow-Guided-Feature-Aggregation) and extended by [Shiyao Wang](https://github.com/wangshy31) through adding instance-level aggregation and motion pattern reasoning. |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +## Introduction |
| 9 | + |
| 10 | +**Fully Motion-Aware Network for Video Object Detection (MANet)** is initially described in an [ECCV 2018 paper](https://wangshy31.github.io/papers/2-MANet.pdf). It proposes an end-to-end model called fully motion-aware network (MANet), which jointly calibrates the features of objects on both pixel-level and instance-level in a unified framework. |
| 11 | + |
| 12 | +The contributions of this paper include: |
| 13 | + |
| 14 | +* Propose an instance-level feature calibration method by learning instance movements through time. The instance-level calibration is more robust to occlusions and outperforms pixel-level feature calibration. |
| 15 | +* Develop a motion pattern reasoning module to dynamically combine pixel-level and instance-level calibration according to the motion. |
| 16 | +* Demonstrate the MANet on the large-scale [ImageNet VID dataset](http://image-net.org/challenges/LSVRC/) with state-of-the-art performance. |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +## Installation |
| 21 | + |
| 22 | +1. Clone the repo, and we call the directory that you cloned as ${MANet_ROOT}. |
| 23 | + ``` |
| 24 | + git clone https://github.com/wangshy31/MANet_for_Video_Object_Detection.git |
| 25 | + ``` |
| 26 | +2. Python packages might missing: cython, opencv-python >= 3.2.0, easydict. If `pip` is set up on your system, those packages should be able to be fetched and installed by running |
| 27 | + ``` |
| 28 | + pip install Cython |
| 29 | + pip install opencv-python==3.2.0.6 |
| 30 | + pip install easydict==1.6 |
| 31 | + ``` |
| 32 | +3. Run `sh ./init.sh` to build cython module automatically and create some folders. |
| 33 | + |
| 34 | +4. Install MXNet as [FGFA](https://github.com/msracver/Flow-Guided-Feature-Aggregation): |
| 35 | + |
| 36 | + 4.1 Clone MXNet and checkout to [MXNet@(v0.10.0)](https://github.com/apache/incubator-mxnet/tree/v0.10.0) by |
| 37 | + |
| 38 | + ``` |
| 39 | + git clone --recursive https://github.com/apache/incubator-mxnet.git |
| 40 | + cd incubator-mxnet |
| 41 | + git checkout v0.10.0 |
| 42 | + git submodule update |
| 43 | + ``` |
| 44 | + |
| 45 | + We also provide a [repo](https://github.com/wangshy31/mxnet.git) that contains mxnet configured as required. |
| 46 | + |
| 47 | + 4.2 Copy operators in `$(MANet_ROOT)/manet_rfcn/operator_cxx` to `$(YOUR_MXNET_FOLDER)/src/operator/contrib` by |
| 48 | + |
| 49 | + ```cp -r $(MANet_ROOT)/manet_rfcn/operator_cxx/* $(MXNET_ROOT)/src/operator/contrib/``` |
| 50 | + |
| 51 | + 4.3 Compile MXNet |
| 52 | + |
| 53 | + ``` |
| 54 | + cd ${MXNET_ROOT} |
| 55 | + make -j4 |
| 56 | + ``` |
| 57 | + 4.4 Install the MXNet Python binding by |
| 58 | + ``` |
| 59 | + cd python |
| 60 | + sudo python setup.py install |
| 61 | + ``` |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | +## Preparation for Training & Testing |
| 66 | + |
| 67 | +**For data processing**: |
| 68 | + |
| 69 | +1. Please download ILSVRC2015 DET and ILSVRC2015 VID dataset, and make sure it looks like this: |
| 70 | + |
| 71 | + ``` |
| 72 | + ./data/ILSVRC2015/ |
| 73 | + ./data/ILSVRC2015/Annotations/DET |
| 74 | + ./data/ILSVRC2015/Annotations/VID |
| 75 | + ./data/ILSVRC2015/Data/DET |
| 76 | + ./data/ILSVRC2015/Data/VID |
| 77 | + ./data/ILSVRC2015/ImageSets |
| 78 | + ``` |
| 79 | + |
| 80 | +2. Please download ImageNet pre-trained ResNet-v1-101 model and Flying-Chairs pre-trained FlowNet model manually from [OneDrive](https://1drv.ms/u/s!Am-5JzdW2XHzhqMOBdCBiNaKbcjPrA), and put it under folder `./model`. Make sure it looks like this: |
| 81 | + |
| 82 | + ``` |
| 83 | + ./model/pretrained_model/resnet_v1_101-0000.params |
| 84 | + ./model/pretrained_model/flownet-0000.params |
| 85 | + ``` |
| 86 | + |
| 87 | +**For training & testing**: |
| 88 | + |
| 89 | +1. Three-phase training is performed on the mixture of ImageNet DET+VID which is useful for the final performance. |
| 90 | + |
| 91 | + **Phase 1**: Fix the weights of ResNet, combine pixel-level aggregated features and instance-level aggregated features by average operation. See script/train/phase-1; |
| 92 | + |
| 93 | + **Phase 2**: Similar to phase 1 but joint train ResNet. See script/train/phase-2; |
| 94 | + |
| 95 | + **Phase 3**: Fix the weights of ResNet, change the average operation to learnable weights and sample more VID data. See script/train/phase-3; |
| 96 | + |
| 97 | + We use 4 GPUs to train models on ImageNet VID. Any NVIDIA GPUs with at least 8GB memory should be OK. |
| 98 | + |
| 99 | +2. To perform experiments, run the python script with the corresponding config file as input. For example, to train and test MANet with R-FCN, use the following command |
| 100 | + |
| 101 | + ``` |
| 102 | + ./run.sh |
| 103 | + ``` |
| 104 | + |
| 105 | + A cache folder would be created automatically to save the model and the log under |
| 106 | + |
| 107 | + `imagenet_vid/`. |
| 108 | + |
| 109 | +3. Please find more details in config files and in our code. |
| 110 | + |
| 111 | +## Main Results |
| 112 | + |
| 113 | +1. We conduct an ablation study so as to validate the effectiveness of the proposed network. |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +**Table 1**. Accuracy of different methods on ImageNet VID validation, using ResNet-101 feature extraction networks. Detection accuracy of slow (motion IoU > 0.9), medium (0.7 ≤ motion IoU ≤ 0.9), and fast (motion IoU < 0.7) moving object instances. |
| 118 | + |
| 119 | +2. We attempt to take a deeper look at detection results and prove that two calibrated features have respective strengths. |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | +**Figure 1**. Visualization of two typical examples: occluded and non-rigid objects. They show respective strengths of the two calibration methods. |
| 124 | + |
| 125 | + |
| 126 | + |
| 127 | +**Table 2**. Statistical analysis on different validation sets. The instance-level calibration is better when objects are occluded or move more regularly while the pixel-level calibration performs well on non-rigid motion. Combination of these two module can achieve best performance. |
| 128 | + |
| 129 | + |
| 130 | +## Download Trained Models |
| 131 | +You can download the trained MANet from [drive](https://drive.google.com/file/d/1tKFfOKaFUeZanKTCCwVw-xaKu0wAw71t/view?usp=sharing). It can achieve 78.03% mAP without sequence-level post-processing (e.g., SeqNMS). |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | +## Citing MANet |
| 136 | + |
| 137 | +If you find Fully Motion-Aware Network for Video Object Detection useful in your research, please consider citing: |
| 138 | +``` |
| 139 | +@inproceedings{wang2018fully, |
| 140 | + Author = {Wang, Shiyao and Zhou, Yucong and Yan, Junjie and Deng, Zhidong}, |
| 141 | + Title = {Fully Motion-Aware Network for Video Object Detection}, |
| 142 | + booktitle={Proceedings of the European Conference on Computer Vision (ECCV)}, |
| 143 | + pages={542--557}, |
| 144 | + Year = {2018} |
| 145 | +} |
| 146 | +
|
| 147 | +``` |
| 148 | + |
| 149 | + |
0 commit comments