Convolutional neural network based encoder-decoder for efficient real-time object detection
Convolutional neural network based encoder-decoder for efficient real-time object detection
Corresponding Author:
Nagarajan Mohankumar
Symbiosis Institute of Technology, Symbiosis International (Deemed University), Nagpur Campus
Pune, India
Email: [email protected]
1. INTRODUCTION
More than 90% of human understanding is visual, and various imaging equipment are frequently
used in fields that are directly related to human activity and living [1]. The processing of photos and other
information has been successfully adopted in various industries to the ongoing growth of machine learning
algorithms. The primary research challenge in computer vision, object detection, has drawn increasing
attention from academics. The object discovery typically contains two stages: first, looking for the item in the
image; second, employing bounding boxes to find the object. Convolutional neural networks (CNN) has
become highly effective at object detection in recent years [2]–[5], region based convolutional neural
network (R-CNN) [6], YOLO [7], the spatial pyramid pooling network (SPP) [8], and Fast R-CNN [9] object
detection techniques that are used in this field of study. Due to computational hardware and data availability,
traditional object detection algorithms have significant drawbacks [10]. Conversely, with the development of
artificial intelligence (AI) and processing power in recent years, the entire process can now be automated
with little to no human involvement. The primary distinction is that traditional object detection techniques
rely on human experience standards and expert judgement to extract features, whereas AI uses a sophisticated
neural network that can be trained to routinely identify powerful and judicial features.
In particular, encoder-decoder models based on fully convolutional networks (FCNs) have
significantly enhanced performance, such as semantic segmentation [11], [12], edge recognition [13] object
exposure [14], and crowd counting [15]. Essentially, the trend of popular object identification techniques are
operate within the encoder-decoder framework. For the detection task, some researchers created structures
based on the encoder-decoder paradigm and attained cutting-edge performance [16]. With regard to
benchmark datasets, CNN-based encoder-decoder models are particularly crucial for continuously improving
detection performance [17]. Convolution is done by the encoder, whereas deconvolution, un-pooling, and
up-sampling are done by the decoder to forecast pixel-wise class labels. The up-sampling decode that
corresponds the low-resolution encoder feature maps, is the important feature. This architecture employs the
encoder's pooling indicators to up-sample to map pixel-wise categorization while also significantly reducing
the number of trainable parameters. This paper is structured as follows.
The goal of object detection, which is typically done with photos or videos, is to find borders as well
as to show the object's range and location. The next step is to classify the object's category and to provide the
categorization likelihood. This task is more difficult than simple picture classification because the positions
of many items must be determined from the image or video. CNNs have been used for the detection and
classification of objects with success [18]. Current models include ways to categorise either a full input
window for each scene for a bounding box of several objects. Semantic segmentation has had a breakthrough
thanks to FCN. It has provided a potent method for boosting the effectiveness of CNNs by providing inputs
of any size [19]. The encoder-decoder-based concept that presented by [20]. It suggested for feature learning
that is unsupervised; then, neural networks backed by encoder-decoders have emerged as a potential
replacement for further aids. An intriguing pedestrian collision alert system for advanced driver assistance
systems was suggested in [21]. However, it is only capable of detecting and warning pedestrians. Facial
feature localization [22] extracted information from input strings that could only be one dimension using the
Viterbi decoding technique. Support vector machine (SVM)-based predictive modeling [23] utilised the
similar concept to expand SVM outcomes using two-dimensional maps.
As an attention generating module that learns to specifically attend to significant locations for every
pixel by employing bidirectional long short-term memory (Bi-LSTM) module within the feature maps,
paediatric intensive care audit network (PiCANet) was proposed in [24]. For C-elegans tissues with FCN
inference, coarse multi-class segmentation CNN with FCN architecture. In order to forecast pixel-level labels
and to improve the label map using conditional random field (CRF), network achieves denser score maps
using FCN architecture. One of the current major trends in CNN architecture design is the incorporation of
encoder and decoder to improve performance. Apart from these object detection models; several detection
algorithms are implemented on hardware platforms to improve the detection performance.
Pyramid scene analysis network (PSPNet) is yet another effective CNN architecture that was just
released. It is intended for prediction jobs at the pixel level. The global pyramid pooling structure that
combined global and local hints that produce the results builds the pixel-level features for effective
segmentation. Due to the PSPNet architecture's extreme complexity, training and testing processes need for a
sizable amount of processing power and graphics processing units (GPU) capabilities. The concept of
panoptic segmentation (PS) was recently introduced in a study about pixel-wise segmentation. To complete a
broad segmentation task, PS combines segmenting instances and segmentation based on semantics.
Comparatively speaking, it performs well when compared to previous visual geometry group (VGG) based
networks, although size is the design's main flaw.
The prophet algorithm, K-means clustering, and seasonal autoregressive integrated moving-average
methods act a task in enhancing the cloud infrastructures. Also, it grouping servers into clusters with similar
utilization patterns. K-means clustering enhances the resource allocation efficiency [25]. Internet of things
(IoT)-driven image recognition system utilizing CNNs to notice and quantify microplastics [26]. The data
collected by sensors is forward to a centralized monitoring system that decides whether or not an alarm
activated in the event if the situation diverge from their ideal state [27]. K-nearest neighbor (KNN) and SVM
algorithm forms a precise arrangement model to utilize the important data expectation exactness [28].
SVM with recurrent neural networks are powerful classification that makes it feasible to classify patients’
risks and predict how they will react to therapy [29]. Cloud computing grants the seizure prediction system to
improve accessible and scalable [30] and it examines the feature selection developed in for improving
accuracy [31]. Hybrid machine learning techniques like SVM with CNN algorithm to anticipates Alzheimer’s
sickness [32].
Convolutional neural network based encoder-decoder for efficient real-time … (Mothiram Rajasekaran)
1962 ISSN: 2252-8938
The key benefits of our suggested decoupled architecture are its simple training with various
environmental factors and ease of customization. For pixel-wise classification, the encoder creates
low-resolution feature maps, which the decoder up-samples through convolutioning the trainable filters to
yield intense feature maps [33]. The fundamental component of the suggested method is the decoding
procedure, which provides several useful advantages in terms of improving boundary delineation and
reduction. Also much improved is the ability to provide training by lowering the amount of trainable
attributes. It offers a simple training, which trains both the encoder as well as decoder at the same time.
With an input image, the network begins training and acts during the network to the top layers.
Adopting convolution with a prearranged set of filter banks to fabricate feature maps, the batch normalisation
process is fulfilled by the encoder. Afterwards, activations are accomplished by rectified linear units
(ReLUs). The max-pooling function is then fulfilled with a window size of 2x2 and a tread of 1. This
outcomes in a two-fold subsampling of the last image. Multiple pooling layers able to increase translation
invariance for effective categorization jobs, but the feature maps' spatial resolution is unnecessarily reduced.
Therefore, prior to the sub-sampling function, the boundary information needs to be recorded and
stored in the encoder feature maps. However, it is not practical to save the entire encoder feature maps
because to memory limitations. The best option is keep the max-pooling indicators in storage. For each 2x2
pooling window, two bits are used to memorise the positions of each max-pooling feature-map. Having a lot
of feature maps on hand is a really effective solution. With this approach, the encoder can store data much
more efficiently and fully connected layers can be dropped.
contrasted with the specified benchmark results.The dataset is divide by training and testing. Here, 90% is
allotted for training and 10% is allotted for testing.
Many weights are 0 because training models frequently use the ReLU activation function. In this
work, it was found that after creating the sparsity model, the gradient vanished during training with ReLU6.
This is as a result of the mask excluding 50% of the weights from the gradient update. As indicated in Table 1,
the public dataset MS-COCO evaluated and contrasted with the earlier techniques. In this work, there are 5 k
and 118 k photos are utilised for testing and training the model, respectively. To ensure that the suggested
method works, the outcomes of each trial were examined. For all classes, average precision (AP) is typically
determined, and its middling is known as the mean average precision (mAP). Additionally, for AP75
candidate images, regions with above 75% accuracy are counted, and the AP50 designates the 50% area
properly. Figure 2 shows the multi-object detection results received via training model on MS-COCO dataset.
Figure 2. Screenshot formulti-object detection of complex scenes using proposed model trained on
MS-COCO dataset
For complex scenes, the proposed CNN based encoder decoder model achieved better detection
performance. The detection results include various objects such as horse, potted plant and person as shown in
Figure 3. For this detection, floating point operations per second (FLOPs) is about 128.46 with model size is
134.22 MB. Figure 3 illustrates the detection of multiple objects on MS-COCO dataset using proposed
model. There are various objects are detected from sample complex images in MS-COCO dataset.
Figure 3. Results for object detection of complex scenes using proposed model trained on MS-COCO dataset
The proposed model achieved mAP of 54.1% at 327 FPS as shown in Table 1. With the help of this
investigation, the model's performance in real-time was guaranteed. MS-COCO dataset contains the FPS
value is 327, the percentage of mAP value is 54.1%, AP50 value is 77.2% and AP75 value is 69.3%. Table 2
demonstrates the comparative results of proposed model with existing approaches. Figure 4 explains the
execution analysis of single-shot detector (SSD), YOLOv3, EfficientDet, YOLOv4 tiny, RetinaNet, and
proposed CNN-based encoder decoder model for object detection. Compare to all other models, the proposed
model has provided better results.
Convolutional neural network based encoder-decoder for efficient real-time … (Mothiram Rajasekaran)
1964 ISSN: 2252-8938
Table 2. Comparison of proposed CNN-based object detection model with existing algorithms
Model Architecture AP75 (%) AP50 (%) mAP (%) FPS
SSD VGG 30.3 48.5 28.8 36
YOLOv3 Darknet-53 34.3 58 33 66
EfficientDet EfficientNet 35.8 52.2 33.8 16
YOLOv4 tiny CSPNet-15 20 40 22 330
RetinaNet ResNet101 36.8 53.1 34.4 11
Proposed VGG-19 69.3 77.2 54.1 327
Performance Comparison
100
AP75 (%) AP50 (%) mAP (%)
80
Percentage (%)
60
40
20
0
SSD YOLOv3 EfficientDet YOLOv4 tiny RetinaNet Proposed
Models
The outcomes of every experimentation are examined to confirm the efficiency of the proposed
method. For assessment, AP is utilized, that concerns to the region under the precision-recall curve. Usually,
AP is computed for all classes, and its average is determined as the mAP. In addition, the AP50 denotes to
the 50% region correctly detected in comparison to the ground truth, and for AP75 candidate images over
75% parts are considered. This study assured the operation of the model for real-time applications with a
good recognition accurateness.
4. CONCLUSION
We have noticed that recent efforts on object detection using CNN-based encoder-decoder models
have addressed salient object detection (SOD) as a classification task at the pixel level. The proposed method
was demonstrated through experimental findings on the open-source MS-COCO 2017 dataset to be capable
of good detection accuracy and quick execution. The objective of this work going forward is to significantly
enhance multiple object detection for high quality images without sacrificing prediction speed. It employs the
unique technique of pooling indices as well, which uses fewer processing parameters and speeds up
inference.With a mAP of 54.1 and 327 FPS, the suggested network model is highly suited for multiple object
identification. To sum up, the model's ease of training and the proposed method's low computational resource
requirements are its key features. As a result, the suggested approach is practical for many real-time
applications and offers a more economical alternative. Overall, the suggested method results in a system for
cutting-edge auto driving systems that is more affordable and more effective.
FUNDING INFORMATION
Funding information is not available.
Name of Author C M So Va Fo I R D O E Vi Su P Fu
Mothiram Rajasekaran
Chitra Sabapathy Ranganathan
Nagarajan Mohankumar
Rajeshkumar Sampathrajan
Thayalagaran Merlin Inbamalar
Nageshvaran Nandhini
Shanmugam Sujatha
DATA AVAILABILITY
The data that support the findings of this study are available on request from the corresponding
author, [NM]. The data, which contain information that could compromise the privacy of research
participants, are not publicly available due to certain restrictions.
REFERENCES
[1] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: a survey,” Proceedings of the IEEE, vol. 111, no. 3,
pp. 257–276, 2023, doi: 10.1109/JPROC.2023.3238524.
[2] Z. Li et al., “Deep learning-based object detection techniques for remote sensing images: a survey,” Remote Sensing, vol. 14,
no. 10, 2022, doi: 10.3390/rs14102385.
[3] J. Jegan, M. R. Suguna, M. Shobana, H. Azath, S. Murugan, and M. Rajmohan, “IoT-enabled black box for driver behavior
analysis using cloud computing,” in 2024 International Conference on Advances in Data Engineering and Intelligent Computing
Systems (ADICS), 2024, pp. 1–6, doi: 10.1109/ADICS58448.2024.10533471.
[4] K. Muhammad, J. Ahmad, Z. Lv, P. Bellavista, P. Yang, and S. W. Baik, “Efficient deep CNN-based fire detection and
localization in video surveillance applications,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 7,
pp. 1419–1434, 2019, doi: 10.1109/TSMC.2018.2830099.
[5] J.-M. Guo, J.-S. Yang, S. Seshathiri, and H.-W. Wu, “A light-weight CNN for object detection with sparse model and knowledge
distillation,” Electronics, vol. 11, no. 4, Feb. 2022, doi: 10.3390/electronics11040575.
[6] S. Srinivasan, R. Raja, C. Jehan, S. Murugan, C. Srinivasan, and M. Muthulekshmi, “IoT-enabled facial recognition for smart
hospitality for contactless guest services and identity verification,” in 2024 11th International Conference on Reliability, Infocom
Technologies and Optimization (Trends and Future Directions) (ICRITO), 2024, pp. 1–6, doi:
10.1109/ICRITO61523.2024.10522363.
[7] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv-Computer Science, pp. 1–6, 2018, doi:
10.48550/arXiv.1804.02767.
[8] Y. H. Wu, Y. Liu, X. Zhan, and M. M. Cheng, “P2T: pyramid pooling transformer for scene understanding,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 12760–12771, 2023, doi: 10.1109/TPAMI.2022.3202765.
[9] H. Jiang and E. Learned-Miller, “Face detection with the faster R-CNN,” in 2017 12th IEEE International Conference on
Automatic Face & Gesture Recognition (FG 2017), May 2017, pp. 650–657, doi: 10.1109/FG.2017.82.
[10] Z. Wang, J. Zhu, S. Fu, S. Mao, and Y. Ye, “RFPNet: Reorganizing feature pyramid networks for medical image segmentation,”
Computers in Biology and Medicine, vol. 163, 2023, doi: 10.1016/j.compbiomed.2023.107108.
[11] A. Tragakis, C. Kaul, R. Murray-Smith, and D. Husmeier, “The fully convolutional transformer for medical image segmentation,”
in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 3649–3658, doi:
10.1109/WACV56688.2023.00365.
[12] J. Ramasamy, E. Srividhya, V. Vaidehi, S. Vimaladevi, N. Mohankumar, and S. Murugan, “Cloud-enabled isolation forest for
anomaly detection in UAV-based power line inspection,” in 2024 2nd International Conference on Networking and
Communications (ICNWC), 2024, pp. 1–6, doi: 10.1109/ICNWC60771.2024.10537407.
[13] D. Bai, X. Zheng, T. Liu, K. Li, and J. Yang, “Finger disability recognition based on holistically-nested edge detection,” in
Intelligent Robotics and Applications, 2022, pp. 146–154, doi: 10.1007/978-3-031-13844-7_15.
[14] M. R. Sudha et al., “Predictive modeling for healthcare worker well-being with cloud computing and machine learning for stress
management,” International Journal of Electrical and Computer Engineering, vol. 15, no. 1, pp. 1218–1228, 2025, doi:
10.11591/ijece.v15i1.pp1218-1228.
[15] Y. Xie, Y. Lu, and S. Wang, “RSANet: deep recurrent scale-aware network for crowd counting,” Proceedings - International
Conference on Image Processing, ICIP, pp. 1531–1535, 2020, doi: 10.1109/ICIP40778.2020.9191086.
[16] I. Filali, M. S. Allili, and N. Benblidia, “Multi-scale salient object detection using graph ranking and global–local saliency
refinement,” Signal Processing: Image Communication, vol. 47, pp. 380–401, 2016, doi: 10.1016/j.image.2016.07.007.
[17] Z. Wu, G. Allibert, F. Meriaudeau, C. Ma, and C. Demonceaux, “HiDAnet: RGB-D salient object detection via hierarchical depth
awareness,” IEEE Transactions on Image Processing, vol. 32, pp. 2160–2173, 2023, doi: 10.1109/TIP.2023.3263111.
Convolutional neural network based encoder-decoder for efficient real-time … (Mothiram Rajasekaran)
1966 ISSN: 2252-8938
[18] P. Maheswari, S. Gowriswari, S. Balasubramani, A. R. Babu, N. K. Jijith, and S. Murugan, “Intelligent headlights for adapting beam
patterns with raspberry pi and convolutional neural networks,” in 2024 2nd International Conference on Device Intelligence,
Computing and Communication Technologies (DICCT), 2024, pp. 182–187, doi: 10.1109/DICCT61038.2024.10533159.
[19] J. Hai, Y. Hao, F. Zou, F. Lin, and S. Han, “Advanced RetinexNet: A fully convolutional network for low-light image
enhancement,” Signal Processing: Image Communication, vol. 112, 2023, doi: 10.1016/j.image.2022.116916.
[20] D. Stavens and S. Thrun, “Unsupervised learning of invariant features using video,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pp. 1649–1656, 2010, doi: 10.1109/CVPR.2010.5539773.
[21] C. C. Sekhar, K. Vijayalakshmi, A. S. Rao, V. Vedanarayanan, M. B. Sahaai, and S. Murugan, “Cloud-based water tank
management and control system,” in 2023 2nd International Conference on Smart Technologies for Smart Nation, SmartTechCon
2023, 2023, pp. 641–646, doi: 10.1109/SmartTechCon57526.2023.10391730.
[22] S. M. Hanif, L. Prevost, R. Belaroussi, and M. Milgram, “Real-time facial feature localization by combining space displacement
neural networks,” Pattern Recognition Letters, vol. 29, no. 8, pp. 1094–1104, 2008, doi: 10.1016/j.patrec.2007.09.016.
[23] B. J. Ganesh, P. Vijayan, V. Vaidehi, S. Murugan, R. Meenakshi, and M. Rajmohan, “SVM-based predictive modeling of
drowsiness in hospital staff for occupational safety solution via IoT infrastructure,” in 2024 2nd International Conference on
Computer, Communication and Control (IC4), 2024, pp. 1–5, doi: 10.1109/IC457434.2024.10486429.
[24] N. Liu, J. Han, and M. H. Yang, “PiCANet: pixel-wise contextual attention learning for accurate saliency detection,” IEEE
Transactions on Image Processing, vol. 29, pp. 6438–6451, 2020, doi: 10.1109/TIP.2020.2988568.
[25] A. R. Rathinam, B. S. Vathani, A. Komathi, J. Lenin, B. Bharathi, and S. M. Urugan, “Advances and predictions in predictive
auto-scaling and maintenance algorithms for cloud computing,” 2nd International Conference on Automation, Computing and
Renewable Systems, ICACRS 2023 - Proceedings, pp. 395–400, 2023, doi: 10.1109/ICACRS58579.2023.10404186.
[26] M. D. A. Hasan, K. Balasubadra, G. Vadivel, N. Arunfred, M. V. Ishwarya, and S. Murugan, “IoT-driven image recognition for
microplastic analysis in water systems using convolutional neural networks,” in 2024 2nd International Conference on Computer,
Communication and Control (IC4), 2024, pp. 1–6, doi: 10.1109/IC457434.2024.10486490.
[27] S. Selvarasu, K. Bashkaran, K. Radhika, S. Valarmathy, and S. Murugan, “IoT-enabled medication safety: real-time temperature
and storage monitoring for enhanced medication quality in hospitals,” 2nd International Conference on Automation, Computing
and Renewable Systems, ICACRS 2023 - Proceedings, pp. 256–261, 2023, doi: 10.1109/ICACRS58579.2023.10405212.
[28] K. Padmanaban, A. M. S. Kumar, H. Azath, A. K. Velmurugan, and M. Subbiah, “Hybrid data mining technique based breast
cancer prediction,” AIP Conference Proceedings, vol. 2523, 2023, doi: 10.1063/5.0110216.
[29] N. Mohankumar et al., “Advancing chronic pain relief cloud-based remote management with machine learning in healthcare,”
Indonesian Journal of Electrical Engineering and Computer Science, vol. 37, no. 2, pp. 1042–1052, 2025, doi:
10.11591/ijeecs.v37.i2.pp1042-1052.
[30] M. Vadivel, V. B. Marin, S. Balasubramani, S. Hemalatha, S. Murugan, and S. Velmurugan, “Cloud-based passenger experience
management in bus fare ticketing systems using random forest algorithm,” in 2024 11th International Conference on Reliability,
Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 2024, pp. 1–6, doi:
10.1109/ICRITO61523.2024.10522226.
[31] M. P. Aarthi, C. M. Reddy, A. Anbarasi, N. Mohankumar, M. V. Ishwarya, and S. Murugan, “Cloud-based road safety for real-
time vehicle rash driving alerts with random forest algorithm,” in 2024 3rd International Conference for Innovation in
Technology (INOCON), 2024, pp. 1–6, doi: 10.1109/INOCON60754.2024.10511316.
[32] M. S. Kumar, H. Azath, A. K. Velmurugan, K. Padmanaban, and M. Subbiah, “Prediction of Alzheimer’s disease using hybrid
machine learning technique,” AIP Conference Proceedings, vol. 2523, 2023, doi: 10.1063/5.0110283.
[33] E. P. Kannan and T. V. Chithra, “Lagrange interpolation for natural colour image demosaicing,” International Journal of
Advances in Signal and Image Sciences, vol. 7, no. 2, pp. 21–30, 2021, doi: 10.29284/ijasis.7.2.2021.21-30.
BIOGRAPHIES OF AUTHORS
Dr. Nagarajan Mohankumar was born in India in 1978. He received his B.E.
degree from Bharathiyar University, Tamilnadu, India in 2000 and M.E. and Ph.D degree
from Jadavpur University, Kolkata in 2004 & 2010. He joined the Nano Device Simulation
Laboratory in 2007 and worked as a Senior Research Fellow under CSIR direct Scheme till
September 2009. Later he joined SKP Engineering College as a Professor to develop research
activities in the field of VLSI and NANO technology. He is currently working as a
ResearchProfessor at Symbiosis Institute of Technology, Nagpur Campus, Symbiosis
(International) Deemed University, Pune, India. He is a senior member of IEEE. He has about
85 international journal publications in reputed journals and about 50 international conference
proceedings. He received the carrier award for young teachers (CAYT) from AICTE, New
Delhi in the year of 2012-2014. His research interest includes modeling and simulation study
of HEMTs, optimization of devices for RF applications and characterization of advanced
HEMT architecture, terahertz electronics, high frequency imaging, sensors and
communication. He can be contacted at email: [email protected].
Convolutional neural network based encoder-decoder for efficient real-time … (Mothiram Rajasekaran)