Generating Super-resolution Images using
Computer Vision Approaches
Abiseban P. (184004X)
Faculty of Information Technology,
University of Moratuwa,
Moratuwa, Sri Lanka.
[email protected] Abstract— Computer Vision is one of the most exciting Super-resolution algorithms using computer vision
fields in Artificial Intelligence. A widely studied issue in the methods differs in various loss function types, network
field of computer vision is image super-resolution (SR).
Image SR has been proven to be very important and has architectures, learning principles, and strategies. Still,
vital real-world applications. A variety of computer vision there are many demanding open topics, regarding new
approaches are being used in generating Super-resolution network architectures and objective functions, for
images, such as the earliest Convolutional Neural Network
large-scale images and deep images, and images with
(CNN) based methods to the recent Generative Adversarial
Network (GAN) based approach. Therefore, this study various types of corruption.
explores how computer vision approaches are used in the
process of Generating Super-resolution images and II. BACKGROUND
critically reviews the different computer vision approaches
used in developing super-resolution images.
A. Traditional SR algorithms
Keywords— Image super-resolution (SR), Computer
vision, deep learning, neural networks. It is necessary to know the limitations of classical
non-AI algorithms to understand the need for computer
I. INTRODUCTION vision and artificial intelligence approaches to increase
image resolution.
Image super-resolution (SR) is the process of
generating visually better high-resolution (HR) images When increasing the resolution of an image, most of
from low-resolution (LR) images or video sequences. the new pixels do not have a specific color value
The SR algorithm's primary goal is to increase the compared to the previous resolution, so they have to be
number of pixels per unit area in an LR image produced filled in somehow. The easiest way to increase the
by a given imaging device and generate more precise resolution is to copy the color value of adjacent pixels.
details than the sampling grid. Many applications, Still, the result is not a higher resolution image but an
especially in video surveillance, UHD television, image with the same resolution but more dots per pixel.
low-resolution face recognition, medical imaging, This algorithm is called Nearest Neighbor. Other methods
astronomical imaging, and remote sensing imaging, need are based on interpolation algorithms. They use the color
zooming of a particular area in an image. In such a value of neighboring pixels to find out the value of
situation, the importance of high resolution becomes intermediate pixels, using more or fewer samples to
necessary. Even Though there is a need for determine the value of said pixels. The most famous of
high-resolution images, since this setup is too expensive, them is the bilinear interpolation. None of these scaling
they are not always reachable. In addition, this may not algorithms achieve optimal quality levels, and they never
always be possible due to the inherent limitations of produce images as sharp as creating an image directly at
sensors and optics production technology. The concept of a higher resolution.
super-resolution can overcome these problems since
Image super-resolution is relatively inexpensive, and SR B. Deep Learning
can be used to utilize the existing low-resolution imaging
systems. An essential branch of Artificial intelligence is
Machine Learning. The most crucial aspect of machine
Benefiting from its broad application prospects, SR learning is deep learning. The main objective of deep
has aroused great interest and is currently one of the most learning is to extract the high-level abstract features of a
active research topics in the field of image processing dataset by using multi-layer nonlinear transformation. In
and computer vision. Due to the rapid development of addition, it aims to gain the ability to make reasonable
deep learning in recent years, SR image generating predictions on a new dataset by learning the potential
methods based on deep learning have shown good distribution rules of data. In 2006, due to the
performance in some applications. The family of development of Computer hardware and Artificial
intelligence, the concept of deep learning was proposed uses MSE as the final loss function. Unlike in the past,
by Hinton et al. Due to the powerful fitting capabilities of SRResNet uses a sufficiently deep residual convolutional
Deep learning, it has been strongly raised in many fields. network model. Compared with other residual learning
Likewise, in the field of image and computer vision and reconstruction algorithms, SRResNet itself can also
where convolutional neural networks (CNN) have made a achieve better results.
difference. Hence, deep learning was introduced into the
Super-resolution construction area by more and more III. SRCNN ALGORITHM
researchers.
The earliest super-resolution reconstruction
In the field of image super-resolution reconstruction, algorithm using deep learning is the SRCNN algorithm.
deep learning was first applied in 2014 by Dong et al. To Its principle is straightforward. For an input
study the mapping relationship between low-resolution low-resolution image, SRCNN first uses bicubic
images and high-resolution images, they have used a interpolation to enlarge it to the target size and then uses
three-layered CNN. For this, they have designed a a three-layer CNN architecture to restore. The neural
network called the Super-Resolution Convolutional network fits the nonlinear mapping between the
Neural Network (SRCNN). low-resolution image and the high-resolution image.
Finally, it uses the output of the network as the
SRCNN uses an interpolation method to upsample reconstructed high-resolution image. Although the
the low-resolution image first, and then it increases the principle is simple, it relies on the deep learning model
resolution through the model before or at the first layer of and the learning of extensive sample data. The
the network. This approach applies the convolutional performance surpasses the traditional image processing
neural network directly to the upsampled low-resolution algorithms, opening the research journey of deep learning
image. Shi et al. believe that this method has already in the field of super-resolution.
affected performance. They proposed a new algorithm
called ESPCN (Efficient Sub-pixel Convolutional Neural
Network). In this algorithm, there is no need to do an
upsampling process on the given LR image before
sending it to the neural network. It introduces a layer of
sub-pixel convolution to realize the image enlargement
method indirectly. ESPCN increases the resolution by the
last layer of the network. This approach dramatically
reduces the amount of calculation of SRCNN and
improves the reconstruction efficiency.
Fig.1 Sketch of the SRCNN architecture (source : [3])
It should be noted here that whether it is SRCNN or
As an early pioneering research paper, SRCNN also
ESPCN, they both use MSE as the objective function to
laid the basic process for dealing with the over-division
train the model. In 2017, Christian Ledig et al. proceeded
problem for the following work:
from the perspective of photo perception to perform
super-resolution reconstruction by confronting the
(1) Find a large number of image samples in real
network ( thesis title: Photo-Realistic Single Image
scenes.
Super-Resolution Using a Generative Adversarial
Network ). Most deep learning super-resolution (2) Perform down-sampling on each image to reduce
algorithms used the MSE loss function. They believe this the image resolution. Generally, there are 2x
will make the reconstructed image much smoother, and downsampling, 3x downsampling, 4x downsampling.
the image will not look like photorealism. Therefore, the The image before downsampling is used as the
research team has switched to using Generative high-resolution image H. The image after down-sampling
Adversarial Networks (GAN) to define a new perceptual is used as the low-resolution image L. L and H form an
objective algorithm for reconstruction. The generator is effective image pair for later model training.
responsible for synthesizing high-resolution images. At
the same time, the discriminator is responsible for (3) When training the model, the low-resolution
identifying whether a given image is an actual sample or image L is enlarged and restored to a high-resolution
reconstructed by the generator. The generator can image SR and then compared with the original
reconstruct a high-resolution image from a given high-resolution image H. The difference is used to adjust
low-resolution image through a game of confrontation the model's parameters, and the difference is minimized
process. In the SRGAN paper, the author also proposed a through iterative training.
comparison algorithm called SRResNet. SRResNet still
(4) The trained model can be used to reconstruct a large image can be obtained, which is the so-called pixel
new low-resolution image to obtain a high-resolution cleaning. Through pixel cleaning, the number of feature
image. channels is restored to the original input size, but the size
of each feature map becomes larger. Here that the
From the practical point of view, the entire SR corresponding convolution determines the expansion
reconstruction is divided into two steps: image method of each pixel. At this time, the parameters of the
enlargement and restoration. Super-resolution convolution need to be learned. Therefore, compared
reconstruction can not only enlarge the image size but with the manually designed amplification method, this
also has the function of image restoration in a sense, learning-based amplification method can better fit the
which can reduce the noise and blur in the image to a relationship between pixels.
certain extent.
The SRResNet model also uses sub-pixel convolution to
enlarge the image. Specifically, two sub-pixel
convolution modules are added to the model. Each
sub-pixel convolution module makes the input image
magnify two times, so this model can finally enlarge the
image Zoom in four times.
Fig.2 Simplified SR reconstruction flow
V. THE SRRESNET ALGORITHM
SRCNN only uses three convolutional layers to
IV. SUB-PIXEL CONVOLUTION
achieve super-division reconstruction. Some literature
Sub-pixel convolution is an ingenious image and points out that if a deeper network structure model is
feature map enlargement method, also called pixel used, higher quality images can be reconstructed, because
cleaning. In deep learning super-resolution deeper network models can extract more advanced image
reconstruction, common scaling methods include direct features. This deep model can better express the image.
upsampling, bilinear interpolation, deconvolution, and so After SRCNN, many researchers have tried to deepen the
on. A super-resolution scaling method is proposed in the network structure in order to achieve better
ESPCN algorithm, that is, the sub-pixel convolution reconstruction performance, but the deeper the model, the
method, which has also been applied to the SRResNet less concurrent it is, and the desired result cannot be
and SRGAN algorithms. Therefore, it is necessary to obtained. Some researchers use transfer learning to
introduce the principle and implementation of sub-pixel increase the depth of the model gradually, but this way of
convolution first. deepening is limited. Therefore, there is an urgent need
for an effective model to make it easy and effective to
The use of CNN to enlarge the feature map generally build a deep network model. This problem was not
uses methods such as deconvolution. This method usually effectively solved until He Kaiming's team proposed the
introduces too many artificial factors, and sub-pixel ResNet network in 2015.
convolution greatly reduces this risk. Because the
parameters used for sub-pixel convolutional The primary function of the residual neural network
amplification need to be learned, this method of learning (ResNet) is image classification. Now it is widely used in
through samples has more accurate amplification image segmentation, target detection, and other fields.
performance than those manually set. ResNet adds residual learning to the traditional
convolutional neural network, which solves the problems
Suppose that if the original image needs to enlarge by of gradient dispersion and accuracy drop (training set) in
three times, it needs to generate 3^2=9 feature maps of deep networks, making the network deeper and deeper,
the same size, that is, the number of channels is expanded ensuring accuracy and control Speed.
by nine times (this can be achieved through ordinary
convolution operations). Then nine feature maps of the ResNet can intuitively understand the meaning
same size are assembled into a large image magnified behind it. The neural network model learned a mapping
three times, which is the sub-pixel convolution operation. of y = f(x) for each layer in the past. It can be imagined
that as the number of layers deepens, the y error mapped
When implementing, first expand the number of channels by each function gradually accumulates, and the error
of the original feature map through convolution. If it becomes larger and larger. The gradient diverges more
needs to enlarge four times, then it needs to expand the and more in the process of backpropagation. At this time,
number of channels to sixteen times. After the feature if you change the mapping relationship of each layer to y
map is convolved and arranged in a specific format, a = f(x) + x, that is, add the original input at the end of each
layer. At this time, the input is x, and the output is f(x) + mainly used to enlarge the image size. The complete
x. Then the natural f(x) tends to 0, or f(x) is a relatively SRResNet network result is shown in the figure below:
small value so that even if the number of layers continues
to increase, the error f(x) is still controlled at a small
value. It is not easy to diverge the entire model during
training.
Fig.5 SRResNet algorithm network structure.
In the above figure, k represents the size of the
convolution kernel, n represents the number of output
channels, and s represents the step size. In addition to the
Fig.3 Schematic diagram of the residual network. residual depth module and the sub-pixel convolution
module, a convolution module is added to the entire
The above figure is the schematic diagram of the
model input and output for data adjustment and
residual network. You can see that a line directly crosses
enhancement.
the two-layer network (jumping chain), bringing the
original data x into the output. At this time, F(x) predicts
The SRResNet model uses MSE as the objective
a difference. With the robust network structure of residual
function, which is the mean square error between the
learning, a deep neural network for super-resolution
high-resolution image restored by the model and the
reconstruction can be constructed according to the idea of
original high-resolution image. The formula is as follows:
SRCNN. The backbone of the SRResNet algorithm uses
this network structure, as shown in the following figure:
Fig.6 The formula.
MSE is also the objective function used by most of
Fig.4 SRResNet algorithm network structure. the current super-division reconstruction algorithms. We
will see later that the super-division image reconstructed
The above model uses multiple deep residual using this objective function does not fit the human eye's
modules for image feature extraction and chain skipping subjective perception very well, and the SRGAN
technology to connect the input to the network output algorithm is based on this improvement.
multiple times. This structure can ensure the stability of
the entire network. Due to the use of the deep model, it VII. SRGAN ALGORITHM
can mine image features more effectively than the
The main inspiration of GAN comes from the idea of
shallow model and can surpass the shallow model
a game theory. It is to construct two deep learning models
algorithm in performance (SRResNet uses 16 residual
when applied to deep learning. They are, generation
modules). Note that each layer of the above model only
network G (Generator) and discriminant network D
changes the number of channels of the image, and does
(Discriminator), and then the two models continue to
not change the size of the image. In this sense, this
play games, and then Make G generate a realistic image,
network can be considered as the aforementioned repair
and D has a very strong ability to judge the authenticity
model.
of the image. The main functions of generating network
and discriminating network are: G is a generative
VI. SRRESNET STRUCTURE ANALYSIS
network, which generates images through a specific
SRResNet uses a deep residual network to construct network structure and objective function. D is a
a super-division reconstruction model, which mainly discriminant network that determines whether the input
includes two parts: a deep residual model and a sub-pixel image is real or generated by G.
convolution model. The deep residual model is used for
efficient feature extraction, which can reduce image noise The role of G is to generate as realistic images as
to a certain extent. The sub-pixel convolution model is possible to confuse D and make D's judgment fail. The
role of D is to explore the flaws of G as much as possible
to judge whether the image generated by G is fake and truncation is to extract only a part of the original model
inferior. The whole process is like a chess game between and then use it as a new separate model.
two novices. With the increase in the number of games,
one of the puzzles becomes more and more sophisticated, So far, we have rearranged the calculation method of
and the other is more and more powerful in content loss:
discriminating skills. In the end, both novices become ● Reconstruct the high-definition image SR
masters. At this time, if G is allowed to play chess with through the SRResNet model;
other people, it is conceivable that G's ability to confuse
has surpassed that of ordinary players. ● The original high-definition image H and the
reconstructed high-definition image SR are
The above is the principle of the GAN algorithm calculated separately through the
used in the image field, such as style transfer, truncated_vgg19 model, and the feature maps
super-resolution reconstruction, image completion, H_fea and SR_fea corresponding to the two
denoising. The use of GAN can avoid the difficulty of images are obtained;
loss function design. Compared with all other models, ● Calculate the MSE value of H_fea and SR_fea;
GAN can produce more precise and more realistic
samples. It can be seen from the above calculation method
that the original calculation method is to calculate the
VIII. PERCEIVED LOSS MSE value of H and SR directly. Still, after switching to
the new content loss, it is only necessary to use the
In order to prevent the reconstructed image from
truncated_vgg19 model to perform an inference on the
being excessively smooth, SRGAN redefined the loss
image to obtain the feature map and then on the feature
function and named it Perceptual loss. Perceived loss is
map Calculation.
composed of two parts:
IX. SRGAN STRUCTURE ANALYSIS
Perceived loss = content loss + counter loss
SRGAN is divided into the generator model
Confrontation loss is the loss of the reconstructed (Generator) and the discriminator model (Discriminator).
picture that is correctly judged by the discriminator. This The generator model uses exactly the same structure of
part of the content is the same as the general GAN SRResNet, except that the truncated VGG19 model needs
definition. One of the significant innovations of SRGAN to be used to calculate the loss function. Note here that
is content loss. SRGAN hopes to make the entire network the truncated VGG19 model is only used to calculate
pay more attention to the difference in semantic features image features, and it is not added as a sub-module
between the reconstructed picture and the original picture behind the generator. The VGG19 model here can be
in the learning process, rather than the color brightness understood as static (gradient is not updated), just use it
difference between pixel by pixel. In the past, when we to calculate the features, and its use is similar to the
calculated the difference between the super-division general image filter sobel, canny operator, etc.
reconstructed image and the original high-definition
image, we directly compared it to the pixel image using The structure of the discriminator model is as
the MSE criterion. The proponents of the SRGAN follows:
algorithm believe that this approach will only excessively
allow the model to learn these pixel differences while
ignoring the inherent characteristics of the reconstructed
image. The actual difference calculation should be
calculated on the inherent characteristics of the image.
But how to express this inherent characteristic? In fact, it
is very simple. Many models have been specifically Fig.7 SRGAN discriminator model.
proposed to extract inherent features of images and then
perform tasks such as classification. We only need to cut
X. CONCLUSION
out the feature extraction modules in these models and
then calculate the features of the reconstructed image and This paper explores how computer vision approaches
the original image. These features are semantic features, are used in the process of Generating Super-resolution
and then perform the MSE calculation of the two images images and compares different computer vision
on the feature layer. Among many models, SRGAN approaches used in the process of Generating
selected the VGG19 model, and the intercepted model Super-resolution images. The algorithmic principles of
was named truncated_vgg19. The so-called model Super-Resolution Convolutional Neural Network
(SRCNN) and Super-Resolution Generative Adversarial
Networks (SRGAN) have been presented briefly in this
study.
XI. ACKNOWLEDGMENT
In this section of my study, I would like to thank the
supervisor of Independent Studies, Ms. Kumarasinghe
K.M.S.J., for her advice and valuable guidance
throughout the study for this accomplishment.
XII. REFERENCE
[1] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken,
R. Bishop, D. Rueckert, and Z. Wang, “Real-time single
image and video superresolution using an efficient
sub-pixel convolutional neural network,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 1874–1883
[2] M. S. Sajjadi, B. Scholkopf, and M. Hirsch,
“EnhanceNet: Single image ¨ super-resolution through
automated texture synthesis,” in Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp.
4501–4510.
[3] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and
Q. Liao, “Deep Learning for Single Image
Super-resolution: A brief review,” IEEE Transactions on
Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.
[4] C. T. B. S. H. MIAP, “Deep Learning based super
resolution, without using a gan,” Medium, 25-Mar-2020.
[Online]. Available:
https://towardsdatascience.com/deep-learning-based-supe
r-resolution-without-using-a-gan-11c9bb5b6cd5.
[Accessed: 31-Oct-2021].
[5] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S.
Kweon, “Learning a deep convolutional network for
light-field image Super-Resolution,” 2015 IEEE
International Conference on Computer Vision Workshop
(ICCVW), 2015.