0% found this document useful (0 votes)
71 views11 pages

Car Detection From Low-Altitude UAV Imagery With

Uploaded by

Sarah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views11 pages

Car Detection From Low-Altitude UAV Imagery With

Uploaded by

Sarah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Hindawi

Journal of Advanced Transportation


Volume 2017, Article ID 2823617, 10 pages
https://doi.org/10.1155/2017/2823617

Research Article
Car Detection from Low-Altitude UAV Imagery with
the Faster R-CNN

Yongzheng Xu,1,2 Guizhen Yu,1,2 Yunpeng Wang,1,2 Xinkai Wu,1,2 and Yalong Ma1,2
1
Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems and Safety Control,
School of Transportation Science and Engineering, Beihang University, Beijing 100191, China
2
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, SiPaiLou No. 2, Nanjing 210096, China

Correspondence should be addressed to Guizhen Yu; [email protected]

Received 2 December 2016; Revised 12 July 2017; Accepted 25 July 2017; Published 29 August 2017

Academic Editor: Pascal Vasseur

Copyright © 2017 Yongzheng Xu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

UAV based traffic monitoring holds distinct advantages over traditional traffic sensors, such as loop detectors, as UAVs have higher
mobility, wider field of view, and less impact on the observed traffic. For traffic monitoring from UAV images, the essential but
challenging task is vehicle detection. This paper extends the framework of Faster R-CNN for car detection from low-altitude UAV
imagery captured over signalized intersections. Experimental results show that Faster R-CNN can achieve promising car detection
results compared with other methods. Our tests further demonstrate that Faster R-CNN is robust to illumination changes and cars’
in-plane rotation. Besides, the detection speed of Faster R-CNN is insensitive to the detection load, that is, the number of detected
cars in a frame; therefore, the detection speed is almost constant for each frame. In addition, our tests show that Faster R-CNN holds
great potential for parking lot car detection. This paper tries to guide the readers to choose the best vehicle detection framework
according to their applications. Future research will be focusing on expanding the current framework to detect other transportation
modes such as buses, trucks, motorcycles, and bicycles.

1. Introduction flow [8], and so on, can only achieve low accuracy; and some
methods, such as frame difference and optical flow, can only
Unmanned aerial vehicles (UAVs) hold promise of great detect moving vehicles. In order to improve detection accu-
value for transportation research, particularly for traffic data racy and efficiency, many object detection schemes have been
collection (e.g., [1–5]). UAVs have many advantages over applied for vehicle detection from UAV images, including
ground based traffic sensors [2]: great maneuverability and Viola-Jones (V-J) object detection scheme [9], the linear sup-
mobility, wide field of view, and zero impact on ground traffic. port machine (SVM) with histogram of orientated gradient
Due to the high cost and challenges of image processing, (HOG) features [10] (SVM + HOG), and Discriminatively
UAVs have not been extensively exploited for transportation Trained Part Based Models (DPM) [11]. Generally, these
research. However, with the recent price drop of off-the- object detection schemes are less sensitive to image noise
shelf UAV products and widely applications of surveillance and complex scenarios therefore are more robust and efficient
video technologies, UAVs are becoming more prominent in for vehicle detection. However, most of these methods are
transportation safety, planning, engineering, and operations. sensitive to objects’ in-plane rotation; that is, only objects
For UAV based applications in traffic monitoring, one in one particular orientation can be detected. Furthermore,
essential task is vehicle detection. This task has become chal- many methods, like V-J, are sensitive to illumination changes.
lenging due to the following reasons: varying illumination In recent years, convolutional neural network (CNN) has
conditions, background motions due to UAV movements, shown impressive performance on object classification and
complicated scenes, and different traffic conditions (con- detection. The structure of CNN was first proposed by LeCun
gested or noncongested). Many traditional techniques, such et al. [12]. As a feature learning architecture, CNN contains
as background subtraction [6], frame difference [7], optical convolution and max-pooling layers. Each convolutional
2 Journal of Advanced Transportation

layer of CNN generates feature maps using several different Training stage
convolution kernels on the local receptive fields from the Training
preceding layer. The output layer in the CNN combines videos
Detection stage
the extracted features for classification. By applying down-
Testing
pooling, the sizes of feature map can be decreased and the Training dataset videos
extracted features become more complex and global. Many
studies [13–15] have shown that CNN can achieve promising Joint training
performance in object detection and classifications. Image extraction
However, directly combining CNN with sliding window Region Proposal
Networks (RPN)
strategy has difficulties to precisely localize objects [16, 17].
Vehicle detection
To address above issues, region-based CNN, that is, R-CNN Fast R-CNN
with Faster R-CNN
[18], SPPnet [19], and Fast-R-CNN have, been proposed
to improve objected detection performance. But the region
proposal generation step consumes too much computation Final Faster Vehicle detection results
R-CNN Model
time. Therefore, Ren et al. further improved Fast R-CNN
[20] and developed the Faster R-CNN [21], which achieves
Figure 1: Car detection framework with the Faster R-CNN.
state-of-the-date object detection accuracy with real-time
detection speed. Inspired by the success of Faster R-CNN [21]
in object detection, this research aims to apply Faster R-CNN
[21] for vehicle detection from UAV imagery. Inspired by the success of Faster R-CNN in both detection
The rest of the paper is organized as follows: Section 2 accuracy and detection speed, this work proposed a car
briefly reviews some related work about vehicle detection detection method based on Faster R-CNN [21] to detect cars
with CNN from UAV images, followed by the methodological from low-altitude UAV imagery. The details of the proposed
details of the Faster R-CNN [21] in Section 3. Section 4 method are presented in the following section.
presents a comprehensive evaluation of the Faster R-CNN for
car detection. Section 5 presents a discussion on some key 3. Car Detection with Faster R-CNN
characteristics of Faster R-CNN. Finally, Section 6 concludes
this paper with some remarks. Faster R-CNN [21] has achieved state-of-the-art performance
for multiclass object detection in many fields (e.g., [19]). But
so far no direct application of Faster R-CNN on car detection
2. Related Work from low-altitude UAV imagery, particularly under urban
A large amount of research has been performed on vehicle environment, has been applied. This paper aims to fill this
detection over the years. Here we only focus on vehicle gap by proposing a framework for car detection from UAV
detection with CNN from UAV images. Some of the most images using Faster R-CNN, as shown in Figure 1.
related work is reviewed here.
Pérez et al. [22] developed a traditional object detection 3.1. Architecture of Faster R-CNN. The Faster R-CNN consists
framework based on the sliding window strategy with a of two modules: the Regional Proposal Network (RPN)
classifier. This paper designed a simple CNN network instead and the Fast R-CNN detector (see Figure 2). RPN is a
of using traditional classifiers (SVM, Boosted Trees, etc.). fully convolutional network for efficiently generating region
As the sliding window strategy is time-consuming when proposals with a wide range of scales and aspect ratios which
handling multiscale objects detection, the framework of [22] will be fed into the second module. Region proposals are
is time-consuming for vehicle detection from UAV images. rectangular regions which may or may not contain candidate
Ammour et al. [23] proposed a two-stage car detection objects. Fast R-CNN detector, the second module, is used to
method, including candidate regions extraction and classi- refine the proposals. The RPN and Fast R-CNN detector share
fication stage. In the candidate regions extraction stage, the the same convolutional layers, allowing for joint training.
authors employed the mean-shift algorithm [24] to segment The Faster R-CNN runs through the CNN only once for the
images. Then fine-tuned VGG16 model [25] was used to entire input image and then refines object proposals. Due
extract region feature. Finally, SVM was used to classify to the sharing of convolutional layers, it is possible to use a
the features into “car” and “non-car” objects. The proposed very deep network (e.g., VGG16 [25]) for generating high-
framework of [23] is similar to R-CNN [18], which was quality object proposals. The entire architecture is a single and
time-consuming when generating region proposals. Besides, unified network for object detection (see Figure 2).
different models should be trained for the three separate
stages, which increases the complexity of [23]. 3.2. Fast R-CNN Detector. The Fast R-CNN detector takes
Chen et al. [15] proposed a hybrid deep convolutional multiple regions of interest (RoIs) as input. For each RoI (see
neural network (HDNN) for vehicle detection in satellite Figure 2), a fixed-length feature vector is extracted by the
images to handle large-scale variance of vehicles. However, RoI pooling layer from the convolutional layer. Each feature
when applying HDNN for vehicle detection from satellite vector is fed into a sequence of fully connected (FC) layers.
images, it takes about 7-8 seconds to detect one image even The final outputs of the detector through the softmax layer
using Graphics Processing Unit (GPU). and the bounding-box regressor layer include (1) softmax
Journal of Advanced Transportation 3

Proposals
Region Fast R-CNN detector
Proposal FC
softmax
Network
bbox
Car 0.944
Fully regressor
connected RoI FC
RoI pooling layer Layers feature
(FCs) vector For each RoI
feature maps

Convolutional layer

UAV image

Figure 2: The architecture of Faster R-CNN, from [20, 21].

Car0.723

2k scores 4k coordinates k anchor boxes Car0.898


Car0.994 Car0.999 Car0.971

cls layer reg layer Car0.998


Car0.990 Car0.955
Car0.998
Car0.997 Car0.755 Car0.971
512d (VGG-16)
Intermediate layer Car0.997

Car1.000

Car0.944

Sliding window ..
.
Conv feature map
(a) (b)

Figure 3: (a) Region Proposal Network (RPN), from [21]. (b) Car detection using RPN proposals on our UAV image.

probabilities which estimate over 𝐾 object classes plus the output in the cls layer to estimate probability whether each
“background” class and (2) related bounding-box (bbox) proposal contains a car or a non-car object (see Figure 3(b)).
values. In this research, the value of 𝐾 is 1, namely, the As many proposals highly overlap with each other,
object classes only contain one object “passenger car” plus the nonmaximum suppression (NMS) was applied to merge
“background” class. proposals that have high intersection-over-union (IoU). After
NMS, the remaining proposals were ranked based on the
3.3. Region Proposal Networks and Joint Training. When object probability score, and only the top 𝑁 proposals are
using RPN to predict car proposals from UAV images, the used for detection.
RPN takes a UAV image as input and outputs a set of For training RPNs, each proposal is assigned a binary
rectangular car proposals (i.e., bounding boxes), each with an class label which indicates whether the proposal is an object
objectless score. In this paper, the VGG-16 model [25], which (i.e., car) or just background. A positive training example is
has 13 shareable convolutional layers, was used as the Faster- designated if the proposal overlaps with a ground-truth box
RCNN convolutional backend. with an IoU more than a predefined threshold (0.7 in [21]), or
The RPN utilizes sliding windows over the convolutional if it has the highest IoU with a ground-truth.
feature map output by the last shared convolutional layer to A proposal will be assigned as a negative example if its
generate rectangular region proposals for each position (see maximum IoU is lower than the predefined threshold (0.3
Figure 3(a)). A 𝑛 × 𝑛 spatial window (filter) was convolved in [21]) for all ground-truth boxes. Following the multitask
with the input convolutional feature map. Then each sliding loss in Fast R-CNN network [20], the RPN is trained by a
window is projected to a lower-dimensional feature (512-d for multitask loss, which is defined as
VGG-16), by convolving with two 1 by 1 filters, respectively,
1
for a box-regression layer (reg) and a box-classification layer 𝐿 ({𝑝𝑖 } , {𝑡𝑖 }) = ∑𝐿 (𝑝 , 𝑝∗ )
(cls). For each sliding window location, 𝑘 possible proposals 𝑁cls 𝑖 cls 𝑖 𝑖
(i.e., anchors in [21]) were generated in the cls layer. For the (1)
1
reg layer, 4𝑘 outputs were generated to encode the coordinates +𝜆 ∑𝑝∗ 𝐿 (𝑡 , 𝑡∗ ) ,
of 𝑘 bounding boxes. Meanwhile, 2𝑘 objectness scores were 𝑁reg 𝑖 𝑖 reg 𝑖 𝑖
4 Journal of Advanced Transportation

(a) (b)

Figure 4: Car detection. (a) Signalized intersection; (b) arterial road.

where 𝑖 is the index of an anchor and 𝑝𝑖 is the predicted The bounding-box regression is achieved by using fea-
probability of anchor 𝑖 being an object. The ground-truth tures with the same spatial size on the feature maps. A set of 𝑘
label 𝑝𝑖∗ is 1 if the anchor is positive and 0 if the anchor bounding-box regressors are trained to adapt for varying size.
is negative. The multitask loss has two parts, a classification Since the RPN and Fast R-CNN detector can share the
component 𝐿 cls and a regression component 𝐿 reg . In (1), 𝑡𝑖 is same convolutional layers, these two networks can be trained
a vector representing the 4 parameterized coordinates of the jointly to learn a unified network through the following 4-
predicted bounding-box; and 𝑡𝑖∗ is the vector of the ground- step training algorithm: first, training the RPN as described
truth box associated with a positive anchor. These two terms above; second, training the detector network using propos-
are normalized by 𝑁cls and 𝑁reg and weighted by a balancing als generated by the RPN trained in the first step; third,
parameter 𝜆. In the released code [26], the cls term in (1) is initializing RPN training by the detector network but only
normalized by the minibatch size (i.e., 𝑁cls = 256), the reg train the RPN specific layers; and finally, training the detector
term is normalized by the number of anchor locations (i.e., network using the new RPN’s proposals. Figure 4 shows two
𝑁reg ∼ 2,400), and 𝜆 is set as 10. screenshots of car detection with the Faster R-CNN.
Bounding-box regression is to find the best nearby
ground-truth box of an anchor box. The parameterization of 4. Experiments
the 4 coordinates of an anchor is described as follows:
4.1. Data Set Descriptions. The airborne platform used in this
(𝑥 − 𝑥𝑎 ) research is a DJI Phantom 2 quadcopter integrated with a 3-
𝑡𝑥 = ,
𝑤𝑎 axis stabilized gimbal (see Figure 5).
Videos are collected by a Gopro Hero Black Edition 3
(𝑦 − 𝑦𝑎 ) camera mounted on the UAV. The resolution of the videos
𝑡𝑦 = ,
ℎ𝑎 is 1920 × 1080 and the frame rate is 24 frames per second
(f/s). The stabilized gimbal is used to stabilize the videos
𝑤
𝑡𝑤 = log ( ), and eliminate video jitters caused by UAV therefore greatly
𝑤𝑎 reducing the impact from external factors, such as wind. In
ℎ addition, an On-Screen Display (OSD), an image transmis-
𝑡ℎ = log ( ), sion module, and a video monitor are installed in the system
ℎ𝑎 for data transmission and airborne flying status monitoring
(2)
(𝑥∗ − 𝑥𝑎 ) and control.
𝑡𝑥∗ = , A UAV image dataset is built for training and testing
𝑤𝑎
the proposed car detection framework. For training video
(𝑦∗ − 𝑦𝑎 ) collection, we followed the following two key suggestions:
𝑡𝑦∗ = , (1) collecting videos with cars of different orientations; (2)
ℎ𝑎
collecting videos with cars of a wide range of scales and aspect
𝑤∗ ratios. To collect videos with cars of different orientations,
𝑡𝑤∗ = log ( ), UAV videos from signalized intersections were recorded;
𝑤𝑎
since cars at intersections have different orientations while
ℎ∗ making turning. To collect videos covering cars of a wide
𝑡ℎ∗ = log ( ), range of scales and aspect ratio, UAV videos at different
ℎ𝑎
flight height, ranging from 100 m to 150 m, were recorded.
where 𝑥, 𝑦, 𝑤, and ℎ denote the bounding-box’s center In this work, UAV videos were collected from two different
coordinates, width, and height, respectively. 𝑥, 𝑥𝑎 , and 𝑥∗ signalized intersections. For each intersection, videos 1-hour
are for the predicted box, anchor box, and ground-truth box, long were captured. Totally, videos two hours long were
respectively. Similar definitions apply for 𝑦, 𝑤, and ℎ. collected for building the training and testing datasets.
Journal of Advanced Transportation 5

Flight Status &


Video Data On-Screen Display (OSD)

Image transmission module

Camera & Gimbal

Airborne platform: Phantom 2

Video monitor

Figure 5: UAV system architecture.

In our experiment, the training and testing datasets detection speed (frames per second, f/s), Correctness, Com-
include 400 and 100 images, respectively. Note the images pleteness, and Quality, as defined in (3):
for training and testing are collected from different UAV
videos. The whole dataset contains 400 images with 12,240 TP
Correctness = ,
samples for training and 100 images with 3,115 samples for TP + FP
testing. Note the samples for training and testing are collected TP
from different UAV videos. Training and testing samples Completeness = , (3)
TP + FN
are annotated using the tool LabelImg [27]. During the
testing and training stage, in order to avoid the same car in TP
Quality = ,
consecutive frames being used too many times, images were TP + FP + FN
extracted every 10 seconds from UAV videos. where TP is the number of “true” detected cars; FP is the
number of “false” detected objects which are non-car objects;
4.2. Training Faster R-CNN Model. Faster-RCNN was pow- and FN is the number of cars missed. In particular, Quality
erful in multiclass object detection. But in this research, we is considered as the strictest criterion, which contains both
only trained the Faster-RCNN model for passenger cars. possible detection errors (false positives and false negatives).
Particularly, we applied the VGG-16 model [25]. For the RPN
of the Faster-RCNN, 300 RPN proposals were used. The 4.3.2. Description of Algorithms for Comparison. To compre-
source code of Faster R-CNN was from [26]. GPU was used hensively evaluate the car detection performance of Faster R-
during the training. The main configurations of the computer CNN from UAV images, four other algorithms were included
used in this research are for comparison. The four algorithms are
(i) CPU: Intel Core i7 hexa-core 5930 [email protected] GHz, 32 GB (1) ViBe, a universal background subtraction algorithm
DDR4; [6];
(ii) Graphics card: Nvidia TITAN X, 12 GB GDDR5; (2) Frame difference [7];
(iii) Operating system: Linux (Ubuntu 14.04). (3) The AdaBoost method using Haar-like features (V-J)
The training and detection implementation in this paper [9];
is all performed on the open source code released by the (4) Linear SVM classifier with HOG features (HOG +
authors of Faster R-CNN [21]. The inputs for training and SVM) [10].
testing are images with the original size (1920 × 1080) without
any preprocessing steps. As ViBe [6] and frame difference [7] are sensitive to
background motions, image registration [28] is applied first
4.3. Performance Evaluation to compensate UAV motions and delete UAV video jitters.
The time for image registration is included in the detection
4.3.1. Evaluation Indicator. The performance of car detection time for these two methods. The performance indicators are
by Faster R-CNN is evaluated by four typical indicators: calculated based on the same 100 images as the testing dataset.
6 Journal of Advanced Transportation

Table 1: Car detection results.


Metrics ViBe Frame difference V-J HOG + SVM Faster R-CNN
Correctness (%) 76.64% 78.17% 84.74% 84.33% 98.43%
Completeness (%) 38.65% 39.78% 41.89% 43.18% 96.40%
Quality (%) 34.58% 35.80% 38.96% 39.97% 94.94%
Detection speed (f/s)
CPU mode 7.42 11.83 3.38 1.45 0.018
GPU mode N/A N/A 20.61 6.82 2.10

Note, for ViBe and Frame Difference, the postprocessing


for blob segmentation results is very important for the final
car detection accuracy as blob segmentation using ViBe and
Frame Difference may yield segmentation errors. In this
work, two rules are designed to screen out segmentation
errors: (1) the area of a detected blob is too large (2 times
larger than that of a normal passenger car) or too small
(smaller than 1/2 of a normal passenger car); (2) the aspect
ratio of the minimum enclosing rectangle of a detected blob
is larger than 2. Note, the area of the normal passenger car Figure 6: Car detection under illumination changing condition
was obtained by human. If any of the two rules is met, the using Faster R-CNN.
detected blob will be screened out as segmentation errors.
The V-J [9] and HOG + SVM [10] methods are trained on
12,240 positive samples and 30,000 negative samples. These
12,240 samples only contain cars orientated in the horizontal only cars in the horizontal direction can be detected. A
direction. Besides, all positive samples are normalized to a sensitivity analysis of the impact of cars’ in-plane rotations
compressed size of 40 × 20. The performance evaluations of has been provided in Discussion.
Faster R-CNN, V-J, and HOG + SVM are run on our testing The method of Faster R-CNN achieved the best perfor-
dataset (100 images, 3,115 testing samples). mance (Quality, 94.94%) among all five methods. As Faster
R-CNN can intelligently learn the information of orientation,
4.3.3. Experiment Results. The testing results of five methods aspect ratio, and scale during training, this method is not
are presented in Table 1. The detection speed was an average sensitive to cars’ in-plane rotation and scale variations.
of the 100 tested images. To comprehensively evaluate the Therefore, Faster R-CNN achieves high Correctness (98.43%)
performance of different algorithms on both CPU and GPU and Completeness (96.40%).
architectures, detection speeds for V-J, HOG + SVM, and Though Faster R-CNN achieved 2.1 f/s under GPU mode,
Faster R-CNN were tested on the i7 CPU and the high-end which is slower than other methods, 2.1 f/s can still satisfy
GPU, respectively. real-time applications.
The results show that Faster R-CNN achieved the best
Quality (94.94%) compared with other four methods. ViBe 5. Discussion
and Frame Difference achieved fast detection speed under
CPU mode but with very low Completeness. The reason is that 5.1. Robustness to Illumination Changing Condition. For car
many stopped cars (such as cars waiting at the traffic light) detection from UAV videos, one most challenging issue is
are recognized as background objects, therefore generating the illumination changing. Our testing datasets (100 images,
many false negatives and leading to a low Completeness. Only 3,115 testing samples) do not contain cars in such scenes;
when those stopped cars run again could they be detected. As for example, cars travel from an illumination (or shadowed)
many moving non-car objects (such as tricycles and moving area to a shadowed (or illumination) area. Therefore, we
pedestrians) lead to false positives, the Correctness of those further conducted an experiment using a 10 min long video
two methods is low (76.64% and 78.17%, resp.). captured under illumination changing condition to evaluate
Although the two object detection schemes V-J and the performance of the Faster R-CNN (see Figure 6).
HOG + SVM are nonsensitive to image background motions The testing results are highlighted in Table 2. The results
compared with ViBe and Frame Difference, the Completeness show that Faster R-CNN achieved a Completeness of 94.16%,
of these two methods is also as low as 41.61% and 42.89%, which is slightly lower than that in Table 1 (96.40%), due
respectively, which is only slightly higher than that of ViBe to the oversaturation of the image sensor under strong
and Frame Difference. The reason, as mentioned in Section 1, illumination condition. The Correctness of Faster R-CNN is
is that both V-J and HOG + SVM are sensitive to objects’ in- 98.26%. The results shown in Table 2 confirm that illumina-
plane rotation. Only cars in the same orientation with the tion changing condition has little impact on the accuracy of
positive training samples could be detected. In this paper, vehicle detection using Faster R-CNN.
Journal of Advanced Transportation 7

Table 2: Vehicle detection under illumination changing condition.

Metrics ViBe Frame difference V-J HOG + SVM Faster R-CNN


Correctness (%) 81.91% 80.15% 87.27% 88.45% 98.26%
Completeness (%) 67.90% 64.69% 81.36% 82.38% 94.16%
Quality (%) 59.05% 55.76% 72.73% 74.38% 92.61%

Figure 7: Car detection by HOG + SVM using image dataset which contain cars orientated in different orientations (0∘ , 10∘ , 20∘ , 30∘ , 40∘ , 50∘ ,
60∘ , 70∘ , 80∘ , and 90∘ ).

The methods of ViBe and Frame Difference achieved 100


higher Quality than that shown in Table 1. That is because
this test scene is an arterial road (see Figure 6), where most 80
cars were running fast along the road; therefore these moving
Completeness (%)

cars can be easily detected by ViBe and Frame Difference. 60


However, many black cars that have similar color as the
road surface and cars under strong illuminations could not
40
be detected; therefore, the Completeness of ViBe and Frame
Difference are still low (67.90% and 64.69%, resp.). The V-
20
J and HOG + SVM methods achieved higher Completeness
(81.36% and 82.38%, resp.) than those shown in Table 1
(41.61% and 42.89%, resp.); because most of these cars in this 0
0 10 20 30 40 50 60 70 80 90
testing scene (see Figure 6) are orientated in the horizontal
Vehicle Orientation (∘ )
direction; thus these vehicles can be successfully detected by
V-J and HOG + SVM. However, the Completeness of these Viola-Jones
two methods is significantly lower than that of the Faster R- HOG + SVM
CNN. As argued by some research [29], methods like the V-J Faster R-CNN
method are sensitive to lighting conditions. Figure 8: Sensitivity to vehicles’ in-plane rotation.

5.2. Sensitivity to Vehicles’ In-Plane Rotation. As mentioned


in Section 1, methods like V-J and HOG + SVM are sensitive
to vehicles’ in-plane rotation. As the vehicle orientations are
generally unknown in UAV images, the detection rates (Com- different orientations as 0∘ , 5∘ , 10∘ , . . . , 85∘ , 90∘ at an interval
pleteness) of different methods may be affected significantly of 5∘ .
by the vehicles’ in-plane rotation. From Figure 8 we can see that the Completeness of the V-J
To analyze the sensitivity of different methods to vehi- downgrades significantly as the vehicles’ orientation exceeds
cles’ in-plane rotation, experiments are conducted based 10 degrees. Compared to V-J, HOG + SVM is less sensitive
on dataset which contains vehicles orientated in different to vehicles’ in-plane rotation, but the Completeness of HOG
directions (see Figure 7). The dataset contains 5 groups of + SVM still downgrades significantly when the vehicles’
images; each group contains 19 images which orientated in orientation exceeds about 45 degrees.
8 Journal of Advanced Transportation

14 25

12
20
Detection speed (f/s)

Detection speed (f/s)


10

8 15

6 10
4
5
2

0 0
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Number of vehicles Number of vehicles

Viola-Jones Frame difference Viola-Jones


HOG + SVM Faster R-CNN HOG + SVM
ViBe Faster R-CNN

Figure 9: Sensitivity of detection speed to different detection load Figure 10: Sensitivity of detection speed to different detection load
(tested on i7 CPU). (tested on GPU).

Table 3: Training cost.


Compared with V-J and HOG + SVM, Faster R-CNN
is insensitive to vehicles’ in-plane rotation (the red curve Metrics V-J HOG + SVM Faster R-CNN
in Figure 8). The reason is that the Faster R-CNN can Training time 6.8 days 5 minutes 21 hours
automatically learn the information of orientation, aspect
ratio, and scale of vehicles from vehicle training samples
during the training. Therefore, Faster R-CNN is insensitive similar speed characteristic as the ViBe and Frame Difference
to vehicles’ in-plane rotation. but with a smooth speed curve. The detection load almost has
no influence on the detection speed of Faster R-CNN. The
5.3. Sensitivity of Detection Speed to Different Detection reason is that when detecting vehicles using Faster R-CNN,
Load. Detection speed is crucial for real-time applications. the method is applied on the entire image. In the proposal
Detection speed can be easily affected by many factors, such regions generation stage, 2000 RPN proposals are generated
as the detection load (i.e., the number of detected vehicles from the original image [21]. The top-300 ranked proposal
in one image), hardware configuration, and video resolution. regions are fed into the Fast R-CNN [20] to check whether the
Among these factors, the most important factor is detection proposal region contains one car. The computational cost is
load. almost the same for each frame; therefore, the detection speed
To comprehensively explore the speed characteristic of of Faster R-CNN is nearly insensitive to detection load.
Faster R-CNN, experiments on images which contain dif-
ferent number of detected vehicles have been conducted 5.4. Training Cost Comparison. When applying the Faster R-
(see Figure 9). Other four methods are also included for CNN for vehicle detection, one important issue that should be
comparison. To fairly evaluate the detection speed of different considered is the computational cost of training procedures.
algorithms on different architectures, the speed tests are As the training samples may change, it is necessary to
performed on the i7 CPU and the high-end GPU, respectively. efficiently update the Faster R-CNN model to satisfy the
We explored the detection speed on i7 CPU for all five requirement of vehicle detection. The training costs of three
methods (see Figure 9) and explored the detection speed on different methods are shown in Table 3. Because the open
GPU for VJ, HOG + SVM, and Faster R-CNN (see Figure 10). source code of Faster R-CNN can only support training
From Figure 9 we can see that the detection speeds of V- function under GPU mode, only training time under GPU
J and HOG + SVM are monotonically decreasing with the mode was provided. For V-J and HOG + SVM, as the open
increase of the number of detected vehicles. The V-J method source code only supports CPU mode, only training time
presents a higher descending rate than HOG + SVM as the under CPU mode was provided.
number of detected vehicles increases. The speed curves of As shown in Table 3, the AdaBoost method using Haar-
ViBe and Frame Difference are unsmooth, but we can see like features (V-J) trained on 12,240 positive samples and
that the increase of the number of detected vehicles has little 30,000 negative samples takes about 6.8 days. The training
influence on the detection speed of the two methods. procedure was only run on CPU without parallel computing
The detection speed of Faster R-CNN was very slow under or other acceleration schemes. The linear SVM classifier with
CPU mode (see Figure 9). Under GPU mode (see Figure 10), HOG features (HOG + SVM) shares the fastest training speed
the detection speed of Faster R-CNN was about 2 f/s. From among all the three methods. It only takes 5 minutes on the
Figures 9 and 10, we can find that the Faster R-CNN holds same training set as the V-J method. Although HOG + SVM
Journal of Advanced Transportation 9

has the fastest training speed, its detection performance is platforms,” IEEE Transactions on Intelligent Transportation Sys-
significantly lower than that of Faster R-CNN (see Table 1). tems, vol. 4, no. 2, pp. 99–107, 2003.
The training of Faster R-CNN takes about 21 hours to [2] M. Hickman and P. Mirchandani, “Airborne traffic flow data
complete. For practical applications, 21 hours is acceptable, and traffic management,” in Proceedings of the 75 Years of the
as the annotation of training samples may take several days. Fundamental Diagram for Traffic Flow Theory: Greenshields
For example, in this paper, the annotation of the whole dataset Symposium, pp. 121–132, 2008.
(12,240 training samples and 3,115 testing samples, totally 500 [3] B. Coifman, M. Mccord, R. G. Mishalani, and K. Redmill,
images) using the tool LabelImg [27] costs 4 days by two Surface Transportation Surveillance from Unmanned Aerial
research fellows. Vehicles, 2004.
[4] J. Leitloff, D. Rosenbaum, F. Kurz, O. Meynberg, and P. Reinartz,
“An operational system for estimating road traffic information
6. Concluding Remarks from aerial images,” Remote Sensing, vol. 6, no. 11, pp. 11315–
11341, 2014.
Inspired by the impressive performance achieved by Faster R- [5] B. Coifman, M. McCord, R. Mishalani, M. Iswalt, and Y. Ji,
CNN on object detection, this research applied this method “Roadway traffic monitoring from an unmanned aerial vehicle,”
for passenger car detection from low-altitude UAV imagery. IEE Proceedings-Intelligent Transport Systems, vol. 153, no. 1, pp.
The experimental results demonstrate that Faster R-CNN 11–20, 2006.
can achieve highest Completeness (96.40%) and Correctness [6] O. Barnich and M. van Droogenbroeck, “ViBe: a universal
(98.43%) with real-time detection speed (2.10 f/s), compared background subtraction algorithm for video sequences,” IEEE
with four other popular vehicle detection methods. Transactions on Image Processing, vol. 20, no. 6, pp. 1709–1724,
Our tests further demonstrate that Faster R-CNN is 2011.
robust to illumination changing and cars’ in-plane rotation; [7] A. C. Shastry and R. A. Schowengerdt, “Airborne video registra-
therefore, Faster R-CNN can be applied for vehicle detection tion and traffic-flow parameter estimation,” IEEE Transactions
from both static and moving UAV platforms. Besides, the on Intelligent Transportation Systems, vol. 6, no. 4, pp. 391–405,
detection speed of Faster R-CNN is insensitive to the detec- 2005.
tion load (i.e., the number of detected vehicles). The training [8] H. Yalcin, M. Hebert, R. Collins, and M. J. Black, “A flow-based
cost of Faster R-CNN network is about 21 hours, which is approach to vehicle detection and background mosaicking in
acceptable for practical applications. airborne video,” in Proceedings of the 2005 IEEE Computer
It should be emphasized that this research provided a rich Society Conference on Computer Vision and Pattern Recognition,
comparison of different vehicle detection techniques which (CVPR ’05), p. 1202, San Diego, CA, USA, June 2005.
covers a lot of aspects of object detection challenges that are [9] P. Viola and M. Jones, “Rapid object detection using a boosted
usually partially covered in object detection papers: detection cascade of simple features,” in Proceedings of the IEEE Computer
rate without in-plane rotation, sensitivity to in-plane rotation, Society Conference on Computer Vision and Pattern Recognition,
pp. 511–518, December 2001.
detection speed, and sensitivity to the number of vehicle in
the image as well as the training cost. This paper tries to guide [10] N. Dalal and B. Triggs, “Histograms of oriented gradients for
human detection,” in Proceedings of the IEEE Computer Society
the readers to choose the best framework according to their
Conference on Computer Vision and Pattern Recognition (CVPR
applications. ’05), vol. 1, pp. 886–893, June 2005.
However, due to the lack of enough training samples,
[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D.
this research only tested the Faster-RCNN networks for
Ramanan, “Object detection with discriminatively trained part-
passenger cars. Future research will expand this method for based models,” IEEE Transactions on Pattern Analysis and
the detection of other transportation modes such as buses, Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
trucks, motorcycles, and bicycles. [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
Conflicts of Interest IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[13] Y. Huang, R. Wu, Y. Sun, W. Wang, and X. Ding, “Vehicle logo
The authors declare that there are no conflicts of interest recognition system based on convolutional neural networks
regarding the publication of this paper. with a pretraining strategy,” IEEE Transactions on Intelligent
Transportation Systems, vol. 16, no. 4, pp. 1951–1960, 2015.
[14] J. Tang, C. Deng, G.-B. Huang, and B. Zhao, “Compressed-
Acknowledgments domain ship detection on spaceborne optical image using
deep neural network and extreme learning machine,” IEEE
This work is partially supported by the Fundamental Transactions on Geoscience and Remote Sensing, vol. 53, no. 3,
Research Funds for the Central Universities and partially by pp. 1174–1185, 2015.
the National Science Foundation of China under Grant nos. [15] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Vehicle detection
61371076 and 51278021. in satellite images by hybrid deep convolutional neural net-
works,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no.
References 10, pp. 1797–1801, 2014.
[16] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Lecun, “Pedes-
[1] A. Angel, M. Hickman, P. Mirchandani, and D. Chandnani, trian detection with unsupervised multi-stage feature learning,”
“Methods of analyzing traffic imagery collected from Aerial in Proceedings of the 26th IEEE Conference on Computer Vision
10 Journal of Advanced Transportation

and Pattern Recognition (CVPR ’13), pp. 3626–3633, IEEE, June


2013.
[17] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original approach for
the localization of objects in images,” IEE Proceedings: Vision,
Image and Signal Processing, vol. 141, no. 4, pp. 245–250, 1994.
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmen-
tation,” in Proceedings of the 27th IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR ’14), pp. 580–587,
Columbus, Oh, USA, June 2014.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling
in deep convolutional networks for visual recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
37, no. 9, pp. 1904–1916, 2015.
[20] R. Girshick, “Fast R-CNN,” in Proceedings of the 15th IEEE Inter-
national Conference on Computer Vision (ICCV ’15), pp. 1440–
1448, December 2015.
[21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards
real-time object detection with region proposal networks,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
39, no. 6, pp. 1137–1149, 2017.
[22] A. Pérez, P. Chamoso, V. Parra, and A. J. Sánchez, “Ground
vehicle detection through aerial images taken by a UAV,” in
Proceedings of the 17th International Conference on Information
Fusion, (FUSION ’14), July 2014.
[23] N. Ammour, H. Alhichri, Y. Bazi, B. Benjdira, N. Alajlan, and
M. Zuair, “Deep learning approach for car detection in UAV
imagery,” Remote Sensing, vol. 9, no. 4, p. 312, 2017.
[24] D. Comaniciu and P. Meer, “Mean shift: a robust approach
toward feature space analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619,
2002.
[25] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” Computer Science,
2014.
[26] Faster R-CNN, 2016, https://github.com/rbgirshick/py-faster-
rcnn.
[27] LabelImg, 2016, https://github.com/tzutalin/labelImg.
[28] Y. Ma, X. Wu, G. Yu, Y. Xu, and Y. Wang, “Pedestrian detection
and tracking from low-resolution unmanned aerial vehicle
thermal imagery,” Sensors, vol. 16, no. 4, p. 446, 2016.
[29] R. Padilla, C. F. F. Costa Filho, and M. G. F. Costa, “Evaluation
of Haar Cascade Classifiers Designed for Face Detection,” in
Proceedings of the International Conference on Digital Image
Processing, 2012.
International Journal of

Rotating
Machinery

International Journal of
The Scientific
(QJLQHHULQJ Distributed
Journal of
Journal of

Hindawi Publishing Corporation


World Journal
Hindawi Publishing Corporation Hindawi Publishing Corporation
Sensors
Hindawi Publishing Corporation
Sensor Networks
Hindawi Publishing Corporation
http://www.hindawi.com Volume 201 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal of

Control Science
and Engineering

Advances in
Civil Engineering
Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Submit your manuscripts at


https://www.hindawi.com

Journal of
Journal of Electrical and Computer
Robotics
Hindawi Publishing Corporation
Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

VLSI Design
Advances in
OptoElectronics
,QWHUQDWLRQDO-RXUQDORI

International Journal of
Modelling &
Simulation
$HURVSDFH
Hindawi Publishing Corporation Volume 2014
Navigation and
Observation
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
in Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
(QJLQHHULQJ
+LQGDZL3XEOLVKLQJ&RUSRUDWLRQ
KWWSZZZKLQGDZLFRP 9ROXPH
Hindawi Publishing Corporation
http://www.hindawi.com
http://www.hindawi.com Volume 201-

International Journal of
International Journal of Antennas and Active and Passive Advances in
Chemical Engineering Propagation Electronic Components Shock and Vibration Acoustics and Vibration
Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

You might also like