ML_Lec 3- Review of Linear Algebra
ML_Lec 3- Review of Linear Algebra
1
Vector and matrix notation
– A d-dimensional (column) vector 𝑥 and its transpose are written as:
𝑥1
𝑥2
𝑥 = ⋮ and 𝑥 𝑇 = 𝑥1 𝑥2 … 𝑥𝑑
𝑥𝑑
– An 𝑛 × 𝑑 (rectangular) matrix and its transpose are written as
𝑎11 𝑎21 … 𝑎𝑛1
𝑎11 𝑎12 𝑎13 … 𝑎1𝑑
𝑎12 𝑎22 𝑎𝑛2
𝑎21 𝑎22 𝑎23 𝑎2𝑑
𝐴= and 𝑎𝑇 = 𝑎13 𝑎23 𝑎𝑛3
⋮ ⋱
⋮ ⋱
𝑎𝑛1 𝑎𝑛2 𝑎𝑛3 𝑎𝑛𝑑
𝑎1𝑑 𝑎2𝑑 𝑎𝑛𝑑
– The product of two matrices is
𝑏11 𝑏12 𝑏1𝑛 𝑐11 𝑐12 𝑐13 𝑐1𝑛
𝑎11 𝑎12 𝑎13 𝑎1𝑑 𝑏21 𝑏22 𝑏2𝑛 𝑐21 𝑐22 𝑐23 𝑐2𝑛
𝑎
𝐴𝐵 = 21
𝑎22 𝑎23 𝑎2𝑑
𝑏31 𝑏32 𝑏3𝑛 = 𝑐31 𝑐32 𝑐33 𝑐3𝑛
𝑎𝑚1 𝑎𝑚2 𝑎𝑚3 𝑎𝑚𝑑
𝑏𝑑1 𝑏𝑑2 𝑏𝑑𝑛 𝑐𝑚1 𝑐𝑚2 𝑐𝑚3 𝑐𝑚𝑛
𝑑
where 𝑐𝑖𝑗 = 𝑘=1 𝑎𝑖𝑘 𝑏𝑘𝑗
2
Vectors
– The inner product (a.k.a. dot product or scalar product) of two vectors is
defined by
𝑑
𝑇 𝑇
𝑥, 𝑦 = 𝑥 𝑦 = 𝑦 𝑥 = 𝑥𝑘 𝑦𝑘
𝑘=1
– The magnitude of a vector is
1
𝑑 2
𝑥 = 𝑥𝑇𝑥 = 𝑥𝑘 𝑥𝑘
𝑘=1
– The orthogonal projection of vector 𝑦 onto vector 𝑥 is 𝑦 𝑇 𝑢𝑥 𝑢𝑥
• where vector 𝑢𝑥 has unit magnitude and
the same direction as 𝑥
– The angle between vectors 𝑥 and 𝑦 is y
x
𝑥, 𝑦
𝑐𝑜𝑠𝜃 =
𝑥 𝑦
ux
– Two vectors 𝑥 and 𝑦 are said to be
yTux
• orthogonal if 𝑥𝑇𝑦 = 0
• orthonormal if 𝑥𝑇𝑦 = 0 and |𝑥| = |𝑦| = 1
|x |
3
– A set of vectors 𝑥1, 𝑥2, … , 𝑥𝑛 are said to be linearly dependent if there exists a
set of coefficients 𝑎1, 𝑎2, … , 𝑎𝑛 (at least one different than zero) such that
𝑎1 𝑥1 + 𝑎2 𝑥2 … 𝑎𝑛 𝑥𝑛 = 0
𝑎1 𝑥1 + 𝑎2 𝑥2 … 𝑎𝑛 𝑥𝑛 = 0 ⇒ 𝑎𝑘 = 0 ∀𝑘
4
Matrices
– The determinant of a square matrix 𝐴𝑑𝑑 is
𝑑
𝑘+𝑖
𝐴 = 𝑎𝑖𝑘 𝐴𝑖𝑘 −1
𝑘=1
• where 𝐴𝑖𝑘 is the minor formed by removing the ith row and the kth column of 𝐴
• NOTE: the determinant of a square matrix and its transpose is the same: |𝐴| = |𝐴𝑇 |
– The trace of a square matrix 𝐴𝑑𝑑 is the sum of its diagonal elements
𝑑
𝑡𝑟 𝐴 = 𝑎𝑘𝑘
𝑘=1
– The rank of a matrix is the number of linearly independent rows (or columns)
– A square matrix is said to be non-singular if and only if its rank equals the
number of rows (or columns)
• A non-singular matrix has a non-zero determinant
5
– A square matrix is said to be orthonormal if 𝐴𝐴𝑇 = 𝐴𝑇 𝐴 = 𝐼
– For a square matrix A
• if 𝑥 𝑇 𝐴𝑥 > 0 ∀𝑥 ≠ 0, then 𝐴 is said to be positive-definite (i.e., the covariance
matrix)
• 𝑥 𝑇 𝐴𝑥 ≥ 0 ∀𝑥 ≠ 0, then A is said to be positive-semi-definite
– The inverse of a square matrix 𝐴 is denoted by 𝐴−1 and is such that
𝐴𝐴−1 = 𝐴−1 𝐴 = 𝐼
• The inverse 𝐴−1 of a matrix 𝐴 exists if and only if 𝐴 is non-singular
– The pseudo-inverse matrix 𝐴† is typically used whenever 𝐴−1 does not exist
(because 𝐴 is not square or 𝐴 is singular)
𝐴† = 𝐴𝑇 𝐴 −1 𝐴𝑇 with 𝐴† 𝐴 = 𝐼 (assuming 𝐴𝑇 𝐴 is non-singular)
• Note that A𝐴† ≠ 𝐼 in general
6
Vector spaces
– The n-dimensional space in which all the n-dimensional vectors reside is called
a vector space
– A set of vectors {𝑢1, 𝑢2, … 𝑢𝑛 } is said to form a basis for a vector space if any
arbitrary vector x can be represented by a linear combination of the 𝑢𝑖
𝑥 = 𝑎1 𝑢1 + 𝑎2 𝑢2 + ⋯ 𝑎𝑛 𝑢𝑛 u 3
linearly independent a2
u2
≠0 𝑖=𝑗
– A basis {𝑢𝑖 } is said to be orthogonal if 𝑢𝑖𝑇 𝑢𝑗
=0 𝑖≠𝑗
=1 𝑖=𝑗
– A basis {𝑢𝑖 } is said to be orthonormal if 𝑢𝑖𝑇 𝑢𝑗
=0 𝑖≠𝑗
• As an example, the Cartesian coordinate base is an orthonormal base
7
– Given n linearly independent vectors {𝑥1, 𝑥2, … 𝑥𝑛}, we can construct an
orthonormal base 𝜙1 , 𝜙2 , … 𝜙𝑛 for the vector space spanned by {𝑥𝑖 } with
the Gram-Schmidt orthonormalization procedure (to be discussed in the RBF
lecture)
– The distance between two points in a vector space is defined as the
magnitude of the vector difference between the points
1
𝑑 2
𝑑𝐸 𝑥, 𝑦 = 𝑥 − 𝑦 = 𝑥𝑘 − 𝑦𝑘 2
𝑘=1
• This is also called the Euclidean distance
8
The Gram-Schmidt orthogonalization process
Sep 1 V1 = ( 1 -1 1).
-v'3.3 -/3
{ 1(- -- --
,/3).· (J6 J6 J6).·.
1---
(. -J2.
.. . 2
--0-1
J22).· }
.
3 3 3, i6 ' 3 ' . 2 ' 2 .
11
Linear transformations
– A linear transformation is a mapping from a vector space 𝑋 𝑁 onto a vector
space 𝑌 𝑀 , and is represented by a matrix
• Given vector 𝑥𝜖𝑋 𝑁 , the corresponding vector y on 𝑌 𝑀 is computed as
𝑦1 𝑎11 𝑎12 𝑎1𝑁 𝑥1
𝑦2 𝑎21 𝑎22 𝑎2𝑁 𝑥2
⋮ = ⋮
⋱
𝑦𝑀 𝑎𝑀1 𝑎𝑀2 𝑎𝑀𝑁 𝑥𝑁
• Notice that the dimensionality of the two spaces does not need to be the same
• For pattern recognition we typically have 𝑀 < 𝑁 (project onto a lower-dim space)
– A linear transformation represented by a square matrix A is said to be
orthonormal when 𝐴𝐴𝑇 = 𝐴𝑇 𝐴 = 𝐼
• This implies that 𝐴𝑇 = 𝐴−1
• An orthonormal xform has the property of preserving the magnitude of the vectors
𝑦 = 𝑦 𝑇 𝑦 = 𝐴𝑥 𝑇 𝐴𝑥 = 𝑥 𝑇 𝐴𝑇 𝐴𝑥 = 𝑥 𝑇 𝑥 = 𝑥
• An orthonormal matrix can be thought of as a rotation of the reference frame
• The row vectors of an orthonormal xform are a set of orthonormal basis vectors
← 𝑎1 →
← 𝑎2 → 0 𝑖≠𝑗
𝑌𝑀×1 = 𝑋𝑁×1 with 𝑎𝑖𝑇 𝑎𝑗 =
1 𝑖=𝑗
← 𝑎𝑁 →
12
Eigenvectors and eigenvalues
– Given a matrix 𝐴𝑁×𝑁 , we say that 𝑣 is an eigenvector* if there exists a scalar 𝜆
(the eigenvalue) such that
𝐴𝑣 = 𝜆𝑣
– Computing the eigenvalues
Characteristic equation
13
– The matrix formed by the column eigenvectors is called the modal matrix M
• Matrix Λ is the canonical form of A: a diagonal matrix with eigenvalues on the main
diagonal
𝜆1
↑ ↑ ↑
𝜆2
𝑀 = 𝑣1 𝑣2 𝑣𝑁 Λ =
↓ ↓ ↓
𝜆𝑁
– Properties
• If A is non-singular, all eigenvalues are non-zero
• If A is real and symmetric, all eigenvalues are real
– The eigenvectors associated with distinct eigenvalues are orthogonal
• If A is positive definite, all eigenvalues are positive
14
Interpretation of eigenvectors and eigenvalues
– If we view matrix 𝐴 as a linear transformation, an eigenvector represents an
invariant direction in vector space
• When transformed by 𝐴, any point lying on the direction defined by 𝑣 will remain
on that direction, and its magnitude will be multiplied by 𝜆
P’
y2
P P
x2
y=A x
v v d ’= d
d
x1 y1
• For example, the transform that rotates 3-d vectors about the 𝑍 axis has vector
[0 0 1] as its only eigenvector and 𝜆 = 1 as its eigenvalue
cos β sin β 0
A sin β cos β 0
v 0 0 1
T
0 0 1
15
– Given the covariance matrix Σ of a Gaussian distribution
• The eigenvectors of Σ are the principal directions of the distribution
• The eigenvalues are the variances of the corresponding principal directions
– The linear transformation defined by the eigenvectors of Σ leads to vectors
that are uncorrelated regardless of the form of the distribution
• If the distribution happens to be Gaussian, then the transformed vectors will be
statistically independent
𝜆1
↑ ↑ ↑
𝜆2
Σ𝑀 = 𝑀Λ with 𝑀 = 𝑣1 𝑣2 𝑣𝑁 Λ =
↓ ↓ ↓
𝜆𝑁
𝑁 2
1 1 1 𝑦𝑖 − 𝜇𝑦𝑖
𝑓𝑥 𝑥 = 𝑁/2 1/2
exp − 𝑥 − 𝜇 𝑇 Σ −1 𝑥 − 𝜇 𝑓𝑦 𝑦 = exp −
2𝜋 Σ 2 √2𝜋𝜆𝑖 2𝜆𝑖
𝑖=1
x2 y2
x2 𝑦 = 𝑀𝑇𝑥
y2
v2 v1
x1 x1 y1 y1
16