Module 2
Module 2
& Extraction
Introduction to Feature Detection
Human brain does a lot of pattern recognition
1 2 3 4
Feature Feature Feature Feature
Detection Description Matching Tracking
FEATURE DETECTORS
How can we find image locations where we
can reliably find correspondences with other
images?
What Points to choose?
•Textureless patches are nearly impossible to localize.
•Patches with large contrast changes (gradients) are easier to localize.
•But straight line segments at a single orientation suffer from the
aperture problem, i.e., it is only possible to align the patches along the
direction normal to the edge direction.
•Gradients in at least two (significantly) different orientations are the
easiest, e.g., corners.
Finding interest points in an
image
Aperture Problem
Matching Criteria
•Consider shifting the window W by (u; v)
• how do the pixels in W change?
• compare each pixel before and after by summing up the squared differences
(SSD) or using correlation
CORRELATION BETWEEN
IMAGES
f1 f2 f3 h1 h2 h3 f*h =
f4 f5 f6 h4 h5 h6
f7 f8 f9 h7 h8 h9
TYPES OF CORRELATION
•CROSS CORRELATION: between 2 different images
W4 W5 W6
W7 W8 W9
-1 -1 -1
-1 8 -1
-1 -1 -1
We say a point has been detected at the location on which the mask is
centered only if
|R|>T
We take |R| because we want to detect both kinds of points, i.e., white points
on a black background as well as black points on a white background
Line Detection
Example
0 0 0 10 0 0 0 0
0 0 0 10 0 0 0 0
0 0 0 10 0 0 0 0
0 0 0 10 0 0 0 0
0 10 10 10 10 10 10 0
0 0 0 10 0 0 0 0
0 0 0 10 0 0 0 0
0 0 0 10 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 40 40 40 40 60 40 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 40 40 40 40 60 40 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
EDGE DETECTION
EDGE DETECTION
EDGE DETECTION
•Edges are defined as “sudden and significant
changes in the intensity” in an image
Why to detect an edge?
•To understand the shape of an object in an image
•Edges prove to be very efficient for techniques like
segmentation and object identification
How to Extract the Edges
From An Image?
Steps in Edge Detection
•Smoothing
•Enhancement
•Detection
•Localization
Prewitt Operator
used for edge detection in an image. It detects two types of
edges
•Horizontal edges
•Vertical Edges
All the derivative masks should have the following properties:
•Opposite sign should be present in the mask.
•Sum of mask should be equal to zero.
•More weight means more edge detection.
Prewitt Masks
-1 -1 -1 -1 0 1
0 0 0 -1 0 1
1 1 1 -1 0 1
Fx Fy
Example
1. Original Image
2. After applying vertical mask
3. After applying horizontal mask
Sobel Operator/Sobel
Operator
Fx Fy
Steps in Sobel Filter
•Let gradient approximations in the x-direction be denoted
as Gx
•Let gradient approximations in the y-direction be denoted
as Gy.
•magnitude(G) = square_root(Gx2 + Gy2)
0 1 0
1 -4 1
0 1 0
Derivation of the matrix
The 1st order derivative is given as
Substituting dx=dy=1
Derivation of the matrix
contd…
The 2nd order derivative is given as
X-1
X+1
z8+z2+z6+z4-4z5
GENERIC 3X3 MASK
Y-1 Y Y+1
X-1
X+1
Algorithm
•Read the image
•If the image is colored then convert it into
grayscale format.
•Define the Laplacian filter.
•Convolve the image with the filter.
•Display the binary edge-detected image.
Laplacian of Gaussian (LoG)
▪ The Laplacian mask evokes very strong response to noise
pixels.
▪Hence some kind of noise cleaning is done prior to the
application of the Laplacian operator
▪Usually a Gaussian Smoothening is applied prior to the
application of Laplacian.
▪This is called as LoG
Derivation
Now, we approximate this using a 5x5 matrix.
0 0 -1 0 0
0 -1 -2 -1 0
-1 -2 16 -2 -1
0 -1 -2 -1 0
0 0 -1 0 0
Difference of Gaussian (DoG)
Difference of Gaussian (DoG)
CANNY EDGE DETECTOR
1. Convert the input image to a grayscale image
2. Apply Gaussian Blur: used to reduce the noise in the image
3. Intensity Gradient Calculation: Sobel operator is applied to
obtain the gradient approximation and edge direction
4. Non-maximum Suppression: suppress pixels to zero which
cannot be considered an edge
5. Thresholding
6. Edge Tracking
7. Final Cleansing
Hough Transform
Harris Corner Detection
•corners are regions in the image with large variation in intensity in
all the directions
• One early attempt to find these corners was done by Chris
Harris & Mike Stephens in their paper A Combined Corner and
Edge Detector in 1988
•It basically finds the difference in intensity for a displacement
of (u,v) in all directions. This is expressed as below:
Harris Corner Detection contd…
•We have to maximize this function E(u,v) for corner detection. That means
we have to maximize the second term.
•Applying Taylor Expansion to the above equation and using some
mathematical steps (please refer to any standard text books you like for full
derivation), we get the final equation as:
•Where
I I
Here, x and y are image derivatives in x and y directions respectively.
Harris Corner Detection contd…
• After this, they created a score, basically an equation,
which determines if a window can contain a corner or
not.
•where
• det(M)=λ1λ2
• trace(M)=λ1+λ2
• λ1 and λ2 are the eigenvalues of M
Harris Corner Detection contd…
• So the magnitudes of these eigenvalues decide whether a
region is a corner, an edge, or flat.
•When |R| is small, which happens when λ1 and λ2 are small,
the region is flat.
•When R<0, which happens when λ1>>λ2 or vice versa, the
region is edge.
•When R is large, which happens when λ1 and λ2 are large
and λ1∼λ2, the region is a corner
λ1 and λ2 are small
E is almost constant
in all directions
References
1. https://www.youtube.com/watch?v=_qgKQGsuKeQ
2. https://www.youtube.com/watch?v=U6laTA2gpk4
3. https://www.youtube.com/watch?v=0z3ci5Iak4g
4. https://en.wikipedia.org/wiki/Harris_corner_detector
5. https://docs.opencv.org/3.4/dc/d0d/tutorial_py_features_harris.html
6. https://www.cse.psu.edu/~rtc12/CSE486/lecture06.pdf
Scale-invariant Feature Transform
(SIFT)
•an image matching algorithm that identifies the key
features from the images and is able to match these
features to a new image of the same object.
•helps locate the local features in an image, commonly
known as the ‘keypoints‘ of the image.
•These keypoints are scale & rotation invariant that can
be used for various computer vision applications, like
image matching, object detection, scene detection, etc.
Scale-invariant Feature Transform
(SIFT) contd…
•The major advantage of SIFT features, over edge
features or hog features, is that they are not affected
by the size or orientation of the image.
Scale-invariant Feature Transform
(SIFT) contd…
•How are the key points identified? How do ensure
the scale and rotation invariance?
• Constructing a Scale Space: To make sure that
features are scale-independent
• Keypoint Localisation: Identifying the suitable
features or keypoints
• Orientation Assignment: Ensure the keypoints are
rotation invariant
• Keypoint Descriptor: Assign a unique fingerprint
to each keypoint
Constructing a Scale Space
•Apply Gaussian Blur to every pixel to reduce the noise
• the texture and minor details are removed from the image and only the
relevant information like the shape and edges remain
Constructing a Scale Space
contd…
•Scale space is a collection of images having different scales, generated from
a single image.
•these blur images are created for multiple scales.
•To create a new set of images of different scales, we will take the original
image and reduce the scale by half. For each new image, we will create blur
versions as we saw above.
Constructing a Scale Space
contd…
Constructing a Scale Space
contd…
Difference of Gaussian (DoG)
•Difference of Gaussian is a feature enhancement algorithm that involves
the subtraction of one blurred version of an original image from
another,
•DoG creates another set of images, for each octave, by subtracting
every image from the previous image in the same scale.
Keypoint Localization
•Once the images have been created, the next step is to find the
important keypoints from the image that can be used for feature
matching.
The idea is to find the local maxima and minima for the
images. This part is divided into two steps:
1.Find the local maxima and minima
2.Remove low contrast keypoints (keypoint selection)
Local Maxima and Local
Minima
•go through every pixel in the image and compare it
with its neighboring pixels.
•‘neighboring’ means not only the surrounding pixels of
that image (in which the pixel lies) but also the nine
pixels for the previous and next image in the octave.
Local Maxima and Local
Minima
The pixel marked x is compared with the neighboring pixels (in green) and is
selected as a keypoint if it is the highest or lowest among the neighbors:
Local Maxima and Local
Minima
Overview
A beginner-friendly introduction to the powerful SIFT (Scale Invariant Feature Transform) technique
Learn how to perform Feature Matching using SIFT
We also showcase SIFT in Python through hands-on coding
Introduction
Take a look at the below collection of images and think of the common element between them:
The resplendent Eiffel Tower, of course! The keen-eyed among you will also have noticed that each image
has a different background, is captured from different angles, and also has different objects in the
foreground (in some cases).
I’m sure all of this took you a fraction of a second to figure out. It doesn’t matter if the image is rotated at a
weird angle or zoomed in to show only half of the Tower. This is primarily because you have seen the
images of the Eiffel Tower multiple times and your memory easily recalls its features. We naturally
understand that the scale or angle of the image may change but the object remains the same.
But machines have an almighty struggle with the same idea. It’s a challenge for them to identify the object
in an image if we change certain things (like the angle or the scale). Here’s the good news – machines are
super flexible and we can teach them to identify images at an almost human-level.
So, in this article, we will talk about an image matching algorithm that identifies the key features from the
images and is able to match these features to a new image of the same object. Let’s get rolling!
Table of Contents
1. Introduction to SIFT
2. Constructing a Scale Space
1. Gaussian Blur
2. Difference of Gaussian
3. Keypoint Localization
1. Local Maxima/Minima
2. Keypoint Selection
4. Orientation Assignment
1. Calculate Magnitude & Orientation
2. Create Histogram of Magnitude & Orientation
5. Keypoint Descriptor
6. Feature Matching
Introduction to SIFT
SIFT, or Scale Invariant Feature Transform, is a feature detection algorithm in Computer Vision.
SIFT helps locate the local features in an image, commonly known as the ‘keypoints‘ of the image. These
keypoints are scale & rotation invariant that can be used for various computer vision applications, like
image matching, object detection, scene detection, etc.
We can also use the keypoints generated using SIFT as features for the image during model training. The
major advantage of SIFT features, over edge features or hog features, is that they are not affected by the
size or orientation of the image.
For example, here is another image of the Eiffel Tower along with its smaller version. The keypoints of the
object in the first image are matched with the keypoints found in the second image. The same goes for two
images when the object in the other image is slightly rotated. Amazing, right?
Let’s understand how these keypoints are identified and what are the techniques used to ensure the scale
and rotation invariance. Broadly speaking, the entire process can be divided into 4 parts:
This article is based on the original paper by David G. Lowe. Here is the link: Distinctive Image Features from
Scale-Invariant Keypoints.
We need to identify the most distinct features in a given image while ignoring any noise. Additionally, we
need to ensure that the features are not scale-dependent. These are critical concepts so let’s talk about
them one-by-one.
So, for every pixel in an image, the Gaussian Blur calculates a value based on its neighboring pixels. Below
is an example of image before and after applying the Gaussian Blur. As you can see, the texture and minor
details are removed from the image and only the relevant information like the shape and edges remain:
Gaussian Blur successfully removed the noise from the images and we have highlighted the important
features of the image. Now, we need to ensure that these features must not be scale-dependent. This
means we will be searching for these features on multiple scales, by creating a ‘scale space’.
Scale space is a collection of images having different scales, generated from a single image.
Hence, these blur images are created for multiple scales. To create a new set of images of different scales,
we will take the original image and reduce the scale by half. For each new image, we will create blur
versions as we saw above.
Here is an example to understand it in a better manner. We have the original image of size (275, 183) and a
scaled image of dimension (138, 92). For both the images, two blur images are created:
You might be thinking – how many times do we need to scale the image and how many subsequent blur
images need to be created for each scaled image? The ideal number of octaves should be four, and for
each octave, the number of blur images should be five.
Difference of Gaussian
So far we have created images of multiple scales (often represented by σ) and used Gaussian blur for each
of them to reduce the noise in the image. Next, we will try to enhance the features using a technique called
Difference of Gaussians or DoG.
Difference of Gaussian is a feature enhancement algorithm that involves the subtraction of one blurred version of an original image
from another, less blurred version of the original.
DoG creates another set of images, for each octave, by subtracting every image from the previous image in
the same scale. Here is a visual explanation of how DoG is implemented:
Note: The image is taken from the original paper. The octaves are now represented in a vertical form for a
clearer view.
Let us create the DoG for the images in scale space. Take a look at the below diagram. On the left, we have
5 images, all from the first octave (thus having the same scale). Each subsequent image is created by
applying the Gaussian blur over the previous image.
On the right, we have four images generated by subtracting the consecutive Gaussians. The results are
jaw-dropping!
We have enhanced features for each of these images. Note that here I am implementing it only for the first
octave but the same process happens for all the octaves.
Now that we have a new set of images, we are going to use this to find the important keypoints.
Keypoint Localization
Once the images have been created, the next step is to find the important keypoints from the image that
can be used for feature matching. The idea is to find the local maxima and minima for the images. This
part is divided into two steps:
When I say ‘neighboring’, this not only includes the surrounding pixels of that image (in which the pixel
lies), but also the nine pixels for the previous and next image in the octave.
This means that every pixel value is compared with 26 other pixel values to find whether it is the local
maxima/minima. For example, in the below diagram, we have three images from the first octave. The pixel
marked x is compared with the neighboring pixels (in green) and is selected as a keypoint if it is the
highest or lowest among the neighbors:
We now have potential keypoints that represent the images and are scale-invariant. We will apply the last
check over the selected keypoints to ensure that these are the most accurate keypoints to represent the
image.
Keypoint Selection
Kudos! So far we have successfully generated scale-invariant keypoints. But some of these keypoints may
not be robust to noise. This is why we need to perform a final check to make sure that we have the most
accurate keypoints to represent the image features.
Hence, we will eliminate the keypoints that have low contrast, or lie very close to the edge.
To deal with the low contrast keypoints, a second-order Taylor expansion is computed for each keypoint. If
the resulting value is less than 0.03 (in magnitude), we reject the keypoint.
So what do we do about the remaining keypoints? Well, we perform a check to identify the poorly located
keypoints. These are the keypoints that are close to the edge and have a high edge response but may not
be robust to a small amount of noise. A second-order Hessian matrix is used to identify such keypoints.
You can go through the math behind this here.
Now that we have performed both the contrast test and the edge test to reject the unstable keypoints, we
will now assign an orientation value for each keypoint to make the rotation invariant.
Orientation Assignment
At this stage, we have a set of stable keypoints for the images. We will now assign an orientation to each
of these keypoints so that they are invariant to rotation. We can again divide this step into two smaller
steps:
1. Calculate the magnitude and orientation
2. Create a histogram for magnitude and orientation
Let’s say we want to find the magnitude and orientation for the pixel value in red. For this, we will calculate
the gradients in x and y directions by taking the difference between 55 & 46 and 56 & 42. This comes out
to be Gx = 9 and Gy = 14 respectively.
Once we have the gradients, we can find the magnitude and orientation using the following formulas:
The magnitude represents the intensity of the pixel and the orientation gives the direction for the same.
We can now create a histogram given that we have these magnitude and orientation values for the pixels.
On the x-axis, we will have bins for angle values, like 0-9, 10 – 19, 20-29, up to 360. Since our angle value is
57, it will fall in the 6th bin. The 6th bin value will be in proportion to the magnitude of the pixel, i.e. 16.64.
We will do this for all the pixels around the keypoint.
This histogram would peak at some point. The bin at which we see the peak will be the orientation for the
keypoint. Additionally, if there is another significant peak (seen between 80 – 100%), then another
keypoint is generated with the magnitude and scale the same as the keypoint used to generate the
histogram. And the angle or orientation will be equal to the new bin that has the peak.
Effectively at this point, we can say that there can be a small increase in the number of keypoints.
Keypoint Descriptor
This is the final step for SIFT. So far, we have stable keypoints that are scale-invariant and rotation
invariant. In this section, we will use the neighboring pixels, their orientations, and magnitude, to generate
a unique fingerprint for this keypoint called a ‘descriptor’.
Additionally, since we use the surrounding pixels, the descriptors will be partially invariant to illumination
or brightness of the images.
We will first take a 16×16 neighborhood around the keypoint. This 16×16 block is further divided into 4×4
sub-blocks and for each of these sub-blocks, we generate the histogram using magnitude and orientation.
At this stage, the bin size is increased and we take only 8 bins (not 36). Each of these arrows represents
the 8 bins and the length of the arrows define the magnitude. So, we will have a total of 128 bin values for
every keypoint.
Here is an example:
1 import cv2
2 import matplotlib.pyplot as plt
3 %matplotlib inline
4
5 #reading image
6 img1 = cv2.imread('eiffel_2.jpeg')
7 gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
8
9 #keypoints
10 sift = cv2.xfeatures2d.SIFT_create()
11 keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)
12
13 img_1 = cv2.drawKeypoints(gray1,keypoints_1,img1)
14 plt.imshow(img_1)
view raw
keypoints.py hosted with ❤ by GitHub
Feature Matching
We will now use the SIFT features for feature matching. For this purpose, I have downloaded two images of
the Eiffel Tower, taken from different positions. You can try it with any two images that you want.
1 import cv2
2 import matplotlib.pyplot as plt
3 %matplotlib inline
4
5 # read images
6 img1 = cv2.imread('eiffel_2.jpeg')
7 img2 = cv2.imread('eiffel_1.jpg')
8
9 img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
10 img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
11
12 figure, ax = plt.subplots(1, 2, figsize=(16, 8))
13
14 ax[0].imshow(img1, cmap='gray')
15 ax[1].imshow(img2, cmap='gray')
view raw
reading_image_eiffel.py hosted with ❤ by GitHub
Now, for both these images, we are going to generate the SIFT features. First, we have to construct a SIFT
object and then use the function detectAndCompute to get the keypoints. It will return two values – the
keypoints and the descriptors.
Let’s determine the keypoints and print the total number of keypoints found in each image:
1 import cv2
2 import matplotlib.pyplot as plt
3 %matplotlib inline
4
5 # read images
6 img1 = cv2.imread('eiffel_2.jpeg')
7 img2 = cv2.imread('eiffel_1.jpg')
8
9 img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
10 img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
11
12 #sift
13 sift = cv2.xfeatures2d.SIFT_create()
14
15 keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)
16 keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None)
17
18 len(keypoints_1), len(keypoints_2)
view raw
keypoints_shape.py hosted with ❤ by GitHub
283, 540
Next, let’s try and match the features from image 1 with features from image 2. We will be using the
function match() from the BFmatcher (brute force match) module. Also, we will draw lines between the
features that match in both the images. This can be done using the drawMatches function in OpenCV.
1 import cv2
2 import matplotlib.pyplot as plt
3 %matplotlib inline
4
5 # read images
6 img1 = cv2.imread('eiffel_2.jpeg')
7 img2 = cv2.imread('eiffel_1.jpg')
8
9 img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
10 img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
11
12 #sift
13 sift = cv2.xfeatures2d.SIFT_create()
14
15 keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)
16 keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None)
17
18 #feature matching
19 bf = cv2.BFMatcher(cv2.NORM_L1, crossCheck=True)
20
21 matches = bf.match(descriptors_1,descriptors_2)
22 matches = sorted(matches, key = lambda x:x.distance)
23
24 img3 = cv2.drawMatches(img1, keypoints_1, img2, keypoints_2, matches[:50], img2, flags=2)
25 plt.imshow(img3),plt.show()
view raw
feature_matching.py hosted with ❤ by GitHub
I have plotted only 50 matches here for clarity’s sake. You can increase the number according to what you
prefer. To find out how many keypoints are matched, we can print the length of the variable matches. In
this case, the answer would be 190.
End Notes
In this article, we discussed the SIFT feature matching algorithm in detail. Here is a site that provides
excellent visualization for each step of SIFT. You can add your own image and it will create the keypoints
for that image as well. Check it out here.
Another popular feature matching algorithm is SURF (Speeded Up Robust Feature), which is simply a faster
version of SIFT. I would encourage you to go ahead and explore it as well.
And if you’re new to the world of computer vision and image data, I recommend checking out the below
course:
Aishwarya Singh
An avid reader and blogger who loves exploring the endless world of data science and artificial
intelligence. Fascinated by the limitless applications of ML and AI; eager to learn and discover the
depths of data science.
Overview of the RANSAC Algorithm
Konstantinos G. Derpanis
[email protected]
Version 1.2
Algorithm 1 RANSAC
1: Select randomly the minimum number of points required to determine the model
parameters.
2: Solve for the parameters of the model.
3: Determine how many points from the set of all points fit with a predefined toler-
ance .
4: If the fraction of the number of inliers over the total number points in the set
exceeds a predefined threshold τ , re-estimate the model parameters using all the
identified inliers and terminate.
5: Otherwise, repeat steps 1 through 4 (maximum of N times).
The number of iterations, N , is chosen high enough to ensure that the probability
p (usually set to 0.99) that at least one of the sets of random samples does not include
an outlier. Let u represent the probability that any selected data point is an inlier
1
and v = 1 − u the probability of observing an outlier. N iterations of the minimum
number of points denoted m are required, where
1 − p = (1 − um )N (1)
log(1 − p)
N= (2)
log(1 − (1 − v)m )
For more details on the basic RANSAC formulation, see [1, 2]. Extensions of
RANSAC include using a Maximum Likelihood framework [4] and importance sam-
pling [3].
References
[1] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography. Commu-
nications of the ACM, 24(6):381–395, 1981.
[2] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Uni-
versity Press, Cambridge, 2001.
[4] P. Torr and A. Zisserman. MLESAC: A new robust estimator with applica-
tion to estimating image geometry. Computer Vision and Image Understanding,
78(1):138–156, 2000.