VMnotes Ox
VMnotes Ox
MT 2017
Contents
1 Vector spaces and vectors 6
1.1 Vectors in Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Linear combinations, linear independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Basis and dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Determinants 82
5.1 Definition of a determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Properties of the determinant and calculation . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Cramer’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Scalar products 93
6.1 Real and hermitian scalar products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Orthonormal basis, Gram-Schmidt procedure . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Adjoint linear map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Orthogonal and unitary maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1
6.5 Dual vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2
Foreword: The subject of “Vectors and Matrices”, more politely called Linear Algebra, is one of the basic
disciplines of mathematics. It underlies many branches of more advanced mathematics, such as calculus
of functions in many variables and differential geometry, and it has applications in practically all parts
of physics. There are numerous textbooks on the subject ranging in style from low-level “how-to-do”
guides, mainly teaching the mechanics of manipulating low-dimensional vectors and matrices, to hugely
formalized treaties which barely ever write down a vector or a matrix explicitly. Naturally, a course for
beginning physics students should stay away from either extreme.
In the present text we will follow the inherent logic of the subject, in line with how it is taught in
research universities across the world. This will require some of the language of formal mathematics and
the occasional proof but we will keep this as light as possible. We attempt to illustrate the material
with many examples, both from physics and other areas, and teach the practical methods and algorithms
required in the day-to-day work of a physicist.
Hopefully, a student will finish the course with a good working knowledge of “Vectors and Matrices”
but also with an appreciation of the structure and beauty of the subject of Linear Algebra.
I would like to thank Kira Boehm, Daniel Karandikar and Doyeong Kim for substantial help with the
typesetting of these notes.
Andre Lukas
Oxford, 2013
3
Notation
R the real numbers
C the complex numbers
F a field, usually either the real or the complex numbers
V, W, U vector spaces
Rn the vector space of n-dimensional column vectors with real entries
Cn the vector space of n-dimensional column vectors with complex entries
v, w, · · · boldface lowercase letters are used for vectors
0 the zero vector
i, j, k, · · · indices to label vector components, usually in the range 1, . . . , n
vi , wi , · · · components of column vectors v, w, · · ·
ei the standard unit vectors in Rn
i, j, k another notation for the standard unit vectors e1 , e2 , e3 in R3
α, β, a, b, · · · lowercase letters are used for scalars
A, B, · · · uppercase letters are used for matrices
Aij entry (i, j) of a matrix A
Ai column vector i of a matrix A
Ai row vector i of a matrix A
AT , A† the transpose and hermitian conjugate of the matrix A
(v1 , . . . , vn ) a matrix with column vectors v1 , . . . , vn
1n the n × n identity matrix
Eij the standard matrices with (i, j) entry 1 and zero otherwise
diag(a1 , . . . , an ) an n × n diagonal matrix with diagonal entries a1 , . . . , an
Span(v1 , . . . , vk ) the span of the vectors v1 , . . . , vk
dimF (V ) the dimension of the vector space V over F
v·w the dot product between two n-dimensional column vectors
|v| the length of a vector
^(v, w) the angle between two vectors v and w
v×w the cross (vector) product of two three-dimensional column vectors
hv, w, ui the triple product of three column vectors in three dimensions
δij the Kronecker delta symbol in n dimensions
ijk the Levi-Civita tensor in three dimensions
i1 ···in the Levi-Civita tensor in n dimensions
f a linear map, unless stated otherwise
idV the identity map on V
Im(f ) the image of a linear map f
Ker(f ) the kernel of a linear map f
rk(f ) the rank of a linear map f
[A, B] the commutator of two matrices A, B
(A|b) the augmented matrix for a system of linear equations Ax = b
det(v1 , . . . , vn ) the determinant of the column vectors v1 , . . . , vn
det(A) the determinant of the matrix A
Sn the permutations of 1, . . . , n
sgn(σ) the sign of a permutation σ
h·, ·i a real or hermitian scalar product (or a bi-linear form)
f† the adjoint linear map of f
EigA (λ) the eigenspace of A for λ
4
χA (λ) the characteristic polynomial of A as a function of λ
tr(A) the trace of the matrix A
5
1 Vector spaces and vectors
Linear algebra is foundational for mathematics and has applications in many parts of physics, including
Classical Mechanics, Electromagnetism, Quantum Mechanics, General Relativity etc.
We would like to develop the subject, explaining both its mathematical structure and some of its physics
applications. In this section, we introduce the “arena” for Linear Algebra: vector spaces. Vector spaces
come in many disguises, sometimes containing objects which do not at all look like ”vectors”. Surprisingly,
many of these “unexpected” vector spaces play a role in physics, particularly in quantum physics. After
a brief review of “traditional” vectors we will, therefore, introduce the main ideas in some generality.
1.1 Vectors in Rn
The set of real numbers is denoted by R and by Rn we mean the set of all column vectors
v1
..
v= . , (1.1)
vn
where v1 , . . . , vn ∈ R are the components of v. We will often use index notation to refer to a vector and
write the components collectively as vi , where the index i takes the values i = 1, . . . , n. There are two
basic operations for vectors, namely vector addition and scalar multiplication and for column vectors they
are defined in the obvious way, that is “component by component”. For the vector addition of two vectors
v and w this means
v1 w1 v 1 + w1
v = ... , w = ... , v + w := ..
, (1.2)
.
vn wn v n + wn
where the vector sum v + w has the geometrical interpretation indicated in Fig. 1.
v+w w
6
The scalar multiplication of a column vector v with a scalar α ∈ R is defined as
αv1
αv := ... , (1.3)
αvn
and the geometrical interpretation is indicated in Fig. 2.
αv
These two so-defined operations satisfy a number of obvious rules. The vector addition is associative,
(u + v) + w = u + (v + w), it is commutative, u + v = v + u, there is a neutral element, the zero vector
0
..
0= . , (1.7)
0
7
which satisfies v + 0 = v and, finally, for each vector v there is an inverse −v, so that v + (−v) = 0.
The scalar multiplication satisfies three further rules, namely the distributive laws α(v + w) = αv + αw,
(α + β)v = αv + βv and the associativity law (αβ)v = α(βv). These rules can be easily verified from the
above definitions of vector addition and scalar multiplication and we will come back to this shortly. It is
useful to introduce the standard unit vectors
0
..
.
0
← ith position,
ei =
1 for i = 1, . . . , n , (1.8)
0
..
.
0
in Rn which are the n vectors obtained by setting the ith component to one and all other components to
zero. In terms of the standard unit vectors a vector v with components vi can be written as
n
X
v = v1 e1 + · · · + vn en = vi ei . (1.9)
i=1
Here we have used some of the above general rules, notably the associativity and commutativity of vector
addition as well as the associativity and distributivity of scalar multiplication.
In a physics context, the case n = 3 is particularly important. In this case, sometimes the notation
1 0 0
i = e1 = 0 , j = e2 = 1 , k = e3 = 0 (1.11)
0 0 1
for the three standard unit vectors is used, so that a vector r with components x, y, z, can be expresses
as
x
r = y = xi + yj + zk . (1.12)
z
Example 1.2: Vector addition and scalar multiplication with standard unit vectors
With the standard unit vectors i, j and k in R3 the vectors v and w from Eq. (1.4) can also be written
as
v = i − 2j + 5k , w = −4i + j − 3k . (1.13)
With this notation, the vector addition of v and w can be carried out as
v + w = (i − 2j + 5k) + (−4i + j − 3k) = −3i − j + 2k . (1.14)
For the scalar multiple of v by α = 3 we have
αv = 3(i − 2j + 5k) = 3i − 6j + 15k . (1.15)
8
While the case n = 3 is important for the description of physical space, other values are just as relevant
in physics. For example, a system of k mass points moving in three-dimensional space can be described
by a vector with n = 3k components. In Special Relativity, space and time are combined into a vector
with four components. And finally, in Quantum Mechanics, the quantum states of physical systems are
described by vectors which, depending on the system, can be basically any size. For this reason, we will
keep n general, whenever possible.
You are probably not yet sufficiently familiar with the above physics examples. Let me discuss an
example in a more familiar context which illustrates the need for vectors with an arbitrary number of
components and also indicates some of the problems linear algebra should address.
Note that these are n equations (one for each page rank xk ) and that the sum on the RHS runs over
all pages j which link to page k. Eqs. (1.16) constitute a system of n linear equations for the variables
x1 , . . . , xn (while the number of links, nj , are given constants). Perhaps this is best explained by focusing
on a simple example.
Consider a very simple internet with four sites, so n = 4, and a structure of links as indicated in Fig. 3.
From the figure, it is clear that the number of links on each site (equal to the number of outgoing arrows
n₁= 3 n₃= 1
1 3
2 4
n₂= 2 n₄= 2
Figure 3: Example of a simple “internet” with four sites. An arrow from site j to site k indicates that
site k is linked by site j.
9
from each site) is given by n1 = 3, n2 = 2, n3 = 1, n4 = 2 while the links themselves are specified by
L1 = {3, 4}, L2 = {1}, L3 = {1, 2, 4}, L4 = {1, 2}. To be clear, L1 = {3, 4} means that site 1 is linked
to by sites 3 and 4. With this data, it is straightforward to specialize the general equations (1.16) to the
example and to obtain the following equations
x3 x4
x1 = 1 + 2
x1
x2 = 3
x1 x2 x4 (1.17)
x3 = 3 + 2 + 2
x1 x2
x4 = 3 + 2
for the ranks of the four pages. Clearly, this is a system of linear equations. Later, we will formulate such
systems using vector/matrix notation. In the present case, we can, for example, introduce
x1 0 0 1 1/2
x2
, A = 1/3 0 0 0
x= x3 1/3 1/2 0 1/2 (1.18)
x4 1/3 1/2 0 0
and re-write the Eqs. (1.17) as Ax = x. This equation describes a so-called “eigenvalue problem”, a class of
problems we will discuss in detail towards the end of the course. Of course, we will also properly introduce
matrices shortly. At this stage, the main point to note is that linear systems of equations are relevant
in “everyday” problems and that we need to understand their structure and develop efficient solution
methods. The four equations (1.17) for our explicit example can, of course, be solved by elementary
methods (adding and subtracting equations and their multiples), resulting in
2
2/3
x = α 3/2 ,
(1.19)
1
where α is an arbitrary real number. Hence, site 1 is the highest-ranked one. In reality, the internet has
an enormous number, n, of sites. Real applications of the page rank algorithm therefore involve very large
systems of linear equations and vectors and matrices of corresponding size. Clearly, solving such systems
will require more refined methods and a better understanding of their structure. Much of the course will
be devoted to this task.
10
There exist “unusual” fields which satisfy all the requirements listed in Appendix A. These include fields
with a finite number of elements and here we introduce the simplest such examples, the finite fields
Fp = {0, 1, . . . , p − 1}, where p is a prime number. Addition and multiplication of two numbers a, b ∈ Fp
are defined by
a + b := (a + b) mod p , a · b = (ab) mod p . (1.20)
Here, the addition and multiplication on the right-hand sides of these definitions are just the usual ones
for integers. The modulus operation, a mod p, denotes the remainder of the division of a by p. In
other words, the definitions (1.20) are just the usual ones for addition and multiplication except that the
modulus operation brings the result back into the required range {0, 1, . . . , p − 1} whenever it exceeds
p − 1. Although these fields might seem abstract, they have important applications, for example in
numerical linear algebra. They allow calculations based on a finite set of integers which avoids numerical
uncertainties (as would arise for real numbers) and overflows (as may arise if all integers are used).
The smallest example of a field in this class is F2 = {0, 1}. Since every fields must contain 0 (the
neutral element of addition) and 1 (the neutral element of multiplication), F2 is the smallest non-trivial
field. From the definitions (1.20) its addition and multiplication tables are
+ 0 1 · 0 1
0 0 1 0 0 0 (1.21)
1 1 0 1 0 1
Note that, taking into account the mod 2 operation, in this field we have 1 + 1 = 0. Since the elements of
F2 can be viewed as the two states of a bit, this field has important applications in computer science and
in coding theory, to which we will return later.
Definition 1.1. A vector space V over a field F (= R, C or any other field) is a set with two operations:
i) vector addition: (v, w) 7→ v + w ∈ V , where v, w ∈ V
ii) scalar multiplication: (α, v) 7→ αv ∈ V , where α ∈ F and v ∈ V .
For all u, v, w ∈ V and all α, β ∈ F , these operations have to satisfy the following rules:
(V1) (u + v) + w = u + (v + w) “associativity”
(V2) There exists a “zero vector”, 0 ∈ V so that 0 + v = v “neutral element”
(V3) There exists an inverse, −v with v + (−v) = 0 “inverse element”
(V4) v+w =w+v “commutativity”
(V5) α(v + w) = αv + αw
(V6) (α + β)v = αv + βv
(V7) (αβ)v = α(βv)
(V8) 1·v =v
The elements v ∈ V are called “vectors”, the elements α ∈ F of the field are called “scalars”.
In short, a vector space defines an environment which allows for addition and scalar multiplication of
vectors, subject to a certain number of rules. Note that the above definition does not assume anything
about the nature of vectors. In particular, it is not assumed that they are made up from components.
Let us draw a few simple conclusions from these axioms to illustrate that indeed all the “usual” rules
for calculations with vectors can be deduced.
11
i) −(−v) = v
This follows from (V3) and (V4) which imply v + (−v) = 0 and −(−v) + (−v) = 0, respectively.
Combining these two equations gives v + (−v) = −(−v) + (−v) and then adding v to both sides,
together with (V1) and (V3), leads to v = −(−v).
ii) 0 · v = 0
(V 6)
Since 0 v = (0 + 0)v = 0 v + 0v and 0 v = 0 v + 0 it follows that 0 v = 0.
iii) (−1)v = −v
(ii) (V 6),(V 8)
Since 0 = 0 v = (1 + (−1))v = v + (−1)v and 0 = v + (−v) it follows that (−1)v = −v.
We now follow the standard path and define the “sub-structure” associated to vector spaces, the sub vector
spaces. These are basically vector spaces in their own right but they are contained in larger vector spaces.
The formal definition is as follows.
Definition 1.2. A sub vector space W ⊂ V is a non-empty subset of a vector space V satisfying:
(S1) w1 + w2 ∈ W for all w1 , w2 ∈ W
(S2) αw ∈ W for all α ∈ F and for all w ∈ W
In other words, a sub vector space is a non-empty subset of a vector space which is closed under vector
addition and scalar multiplication.
This definition implies immediately that a sub vector space is also a vector space over the same field as
V . Indeed, we already know that 0w = 0 and (−1)w = −w, so from property (S2) a sub vector space
W contains the zero vector and an inverse for each vector. Hence, the requirements (V2) and (V3) in
Definition 1.1 are satisfied for W . All other requirements in Definition 1.1 are trivially satisfied for W
simply by virtue of them being satisfied in V . Hence, W is indeed a vector space.
Every vector space V has two trivial sub vector spaces, the vector space {0} consisting of only the
zero vector and the whole space V . We will now illustrate the concepts of vector space and sub vector
space by some examples.
vn
where vi ∈ F . Vector addition and scalar multiplication are of course defined exactly as in Eqs. (1.2),
(1.3), that is, component by component. It is easy to check that all vector space axioms (V1)–(V8) are
indeed satisfied for these definitions. For example consider (V6):
(α + β)v1 αv1 + βv1 αv1 βv1
(1.3)
(α + β)v = .. .. (1.2) .. .. (1.3)
= = . + . = αv + βv (1.23)
. .
(α + β)vn αvn + βvn αvn βvn
It is useful to write the definitions of the two vector space operations in index notation as
12
The subscript i on the LHS means that component number i from the vector enclosed in brackets is
extracted. Using this notation, the vector space axioms can be verified in a much more concise way. For
example, we can demonstrate (V7) by ((αβ)v)i = (αβ)vi = α(βvi ) = α(βv)i = (α(βv))i .
Now, this example is not a big surprise and on its own would hardly justify the formal effort of our
general definition. So, let us move on to more adventurous examples.
(b) The set of all functions f : S → F from a set S into a field F , with vector addition and scalar
multiplication defined as
that is, by “pointwise” addition and multiplication, forms a vector space over F . The null “vector” is
the function which is zero everywhere and all axions (V 1) − (V 8) are clearly satisfied. There are many
interesting specializations and sub vector spaces of this.
(c) All continuous (or differentiable) functions f : [a, b] → F on an interval [a, b] ⊂ R form a vector
space. Indeed, since continuity (or differentiability) is preserved under addition and scalar multiplication,
as defined in (b), this is a (sub) vector space. For example, consider the real-valued functions f (x) =
2x2 + 3x − 1 and g(x) = −2x + 4. Then the vector addition of these two functions and the scalar multiple
of f by α = 4 are given by
(f +g)(x) = (2x2 +3x−1)+(−2x+4) = 2x2 +x+3 , (αf )(x) = 4(2x2 +3x−1) = 8x2 +12x−4 . (1.27)
(d) In physics, many problems involve solving 2nd order, linear, homogeneous differential equations of the
form
d2 f df
p(x) 2 + q(x) + r(x)f = 0 (1.28)
dx dx
where p, q and r are fixed functions. The task is to find all functions f which satisfy this equation.
This equation is referred to as a “linear” differential equation since every term is linear in the unknown
function f (rather than, for example, quadratic). This property implies that with two solutions, f , g of
the differential equation also f + g and αf (for scalars α) are solutions. Hence, the space of solutions of
such an equation forms a vector space (indeed a sub vector space of the twice differentiable functions). A
simple example is the differential equation
d2 f
+f =0 (1.29)
dx2
which is obviously solved by f (x) = cos(x) and f (x) = sin(x). Since the solution space forms a vector
space, we know that α cos(x) + β sin(x) for arbitrary α, β ∈ R solves the equation. This can also be easily
checked explicitly by inserting f (x) = α cos(x) + β sin(x) into the differential equation.
(e) The matrices of size n × m consist of an array of numbers in F with n rows and m columns. A matrix
is usually denoted by
a11 . . . a1m
A = ... .. , (1.30)
.
an1 . . . anm
with entries aij ∈ F . As for vectors, we use index notation aij , where i = 1, . . . , n labels the rows and
j = 1, . . . , m labels the columns, to collectively refer to all the entries. Addition of two n × m matrices A
13
and B with components aij and bij and scalar multiplication can then be defined as
a11 . . . a1m b11 . . . b1m
A + B = ... .. + .. .. (1.31)
. . .
an1 . . . anm bn1 . . . bnm
a11 + b11 . . . a1m + b1m
:=
.. ..
(1.32)
. .
an1 + bn1 . . . anm + bnm
a11 . . . a1m αa11 . . . αa1m
.. .. := ...
. ..
αA = α . (1.33)
.
an1 . . . anm αan1 . . . αanm
that is, as for vectors, component by component. Clearly, with these operations, the n × m matrices with
entries in F form a vector space over F . The zero “vector” is the matrix with all entries zero. Indeed, as
long as we define vector addition and scalar multiplication component by component it does not matter
whether the numbers are arranged in a column (as for vectors) or a rectangle (as for matrices). By slight
abuse of notation we sometimes denote the entries of a matrix A by Aij = aij . In index notation, the
above operations can then be written more concisely as (A + B)ij := Aij + Bij and (αA)ij := αAij , in
analogy with the definitions (1.24) for column vectors.
For a numerical example, consider the 2 × 2 matrices
1 −2 0 5
A= , B= . (1.34)
3 −4 −1 8
Their sum is given by
1 −2 0 5 1 3
A+B = + = , (1.35)
3 −4 −1 8 2 4
while the scalar multiplication of A with α = 3 gives
1 −2 3 −6
αA = 3 = . (1.36)
3 −4 9 −12
This list of examples hopefully illustrates the strength of the general approach. Perhaps surprisingly, even
many of the more “exotic” vector spaces do play a role in physics, particularly in quantum physics. Much
of what follows will only be based on the general Definition 1.1 of a vector space and, hence, will apply
to all of the above examples and many more.
14
is called the span of v1 , . . . , vk . Although the second definition might seem slightly abstract at first, the
span has a rather straightforward geometrical interpretation. For example, the span of a single vector
consists of all scalar multiples of this vector, so it can be viewed as the line through 0 in the direction of
this vector. The span of two vectors (which are not scalar multiples of each other) can be viewed as the
plane through 0 containing these two vectors and so forth.
We note that Span(v1 , . . . , vk ) is a sub vector space of VP. This is rather easy
P to see. Consider two
vectors u, v ∈ Span(v 1 , . . . , v k ) in the span, so that u = α v
i i i and v = i βP
i vi . Then, the sum
u + v = i (αi + βi )vi is clearly in the span as well, as is the scalar multiple αu = ni=1 (ααi )vi . Hence,
P
from Def. 1.2, the span is a (sub) vector space.
The above interpretation of the span as lines, planes, etc. through 0 already points to a problem. Consider
the span of three vectors u, v, w but assume that u is a linear combination of the other two. In this
case, u can be omitted without changing the span, so Span(u, v, w) = Span(v, w). In this sense, the
original set of vectors u, v, w was not minimal. What we would like is a criterion for minimality of a set
of vectors, so that none of them can be removed without changing the span. This leads to the concept of
linear independence which is central to the subject. Formally, it is defined as follows.
Definition 1.3. Let V be a vector space over F and α1 , . . . , αk ∈ F scalars. A set of vectors v1 , . . . , vk ∈ V
is called linearly independent if
Xk
αi vi = 0 =⇒ all αi = 0 . (1.39)
i=1
Pk
Otherwise, the vectors are called linearly dependent. That is, they are linearly dependent if i=1 αi vi =0
has a solution with at least one αi 6= 0.
To relate this to our previous discussion the following statement should be helpful.
Claim 1.1. The vectors v1 , . . . , vk are linearly dependent ⇐⇒ One vector vi can be written as a linear
combination of the others.
Proof. The proof is rather simple but note that there are two directions to show.
15
“⇒”: Assume that the vectors v1 , . . . , vk are linearly dependent to that the equation ki=1 αi vi = 0 has
P
a solution with at least one αi 6= 0. Say, α1 6= 0, for simplicity. Then we can solve for v1 to get
1 X
v1 = − α i vi , (1.40)
α1
i>1
So for a linearly dependent set of vectors we can eliminate (at least) one vector without changing the
span. A linearly independent set is one which cannot be further reduced in this way, so is “minimal” in
this sense.
16
which is obviously solved by f (x) = sin x and f (x) = cos x. Are these two solutions linearly independent?
Using Eq. (1.39) we should start with α sin x + β cos x = 0 and, since the zero “vector” is the function
identical to zero, this equation has to be satisfied for all x. Setting x = 0 we learn that β = 0 and setting
x = π/2 it follows that α = 0. Hence, sin and cos are linearly independent.
(e) For an example which involves linear dependence consider the three vectors
−2 1 0
v1 = 0 , v2 = 1 , v3 = 2 . (1.45)
1 1 3
This set of equations clearly has non-trivial solutions, for example α1 = 1, α2 = 2, α3 = −1, so that the
vectors are linearly dependent. Alternatively, this could have been inferred by noting that v3 = v1 + 2v2 .
with coefficients αi and βi . Taking the difference of these two equations implies
n
X
(αi − βi )vi = 0 , (1.48)
i=1
and, from linear independence of the basis, it follows that all αi − βi = 0, so that indeed αi = βi .
In summary, given a basis every vector can be represented by its coordinates relative to the basis. Let
us illustrate this with a few examples.
17
Example 1.7: : Coordinates relative to a basis
(a) For the standard basis ei of V = F n over F we can write every vector w as
w1 n
.. X
w= . = wi ei (1.49)
wn i=1
of R3 and a general vector r with components x, y, z. To write r as a linear combination of the basis
vectors we set
x α3
!
r = y = α1 v1 + α2 v2 + α3 v3 = α1 + α2 + α3 (1.51)
z α1 + 2α2 − α3
which implies x = α3 , y = α1 + α2 + α3 , z = α1 + 2α2 − α3 . Solving for the αi leads to α1 = −3x + 2y − z,
α2 = 2x − y + z, α3 = x, so these are the coordinates of r relative to the given basis.
We would like to call the number of vectors in a basis the dimension of the vector space. However, there
are usually many different choices of basis for a given vector space. Do they necessarily all have the same
number of vectors? Intuitively, it seems this has to be the case but the formal proof is more difficult than
expected. It comes down to the following
Proof. If w1 = 0 then the vectors w1 , . . . , wm are linearly dependent, so can assume that w1 6= 0. Since
the vectors vi form a basis we can write
Xn
w1 = αi vi
i=1
with at least one αi (say α1 ) non-zero (or else w1 would be zero). We can, therefore, solve this equation
for v1 so that
n
!
1 X
v1 = w1 − αi vi
α1
i=2
This shows that we can “exchange” the vector v1 for w1 in the basis {vi } such that V = Span(w1 , v2 , . . . , vn ).
This exchange process can be repeated until all vi are replaced by wi and V = Span(w1 , . . . , wn ). Since
m > n there is at least one vector, wn+1 , “left over” which can be written as a linear combination
n
X
wn+1 = βi wi .
i=1
18
Now consider two basis, v1 , . . . , vn and w1 , . . . , wm of V . Then the Lemma implies that both n > m
and n < m are impossible and n = m follows. Hence, while a vector space usually allows many choices of
basis the number of basis vectors is always the same. So we can define
Definition 1.5. For a basis v1 , . . . , vn of a vector space V over F we call dimF (V ) := n the dimension
of V over F .
From what we have just seen, it does not matter which basis we use to determine the dimension. Every
choice leads to the same result. Let us apply this to compute the dimension for some examples.
take the k th derivative with respect to x and then set x = 0. This immediately implies that αk = 0 and,
hence, that the monomials are linearly independent and form a basis. The dimension of the space is,
therefore, d + 1.
(d) For the n × m matrices with entries in F (as a vector space over F ) define the matrices
0 ··· 0 0 0 ··· 0
.. .. .. .. ..
. . . . .
0 ··· 0 0 0 ··· 0
E(ij) = 0 ··· 0 1 0 ··· 0 , (1.54)
0 ··· 0 0 0 ··· 0
.. .. .. .. ..
. . . . .
0 ··· 0 0 0 ··· 0
where i = 1, . . . , n and j = 1, . . . , m and the “1” appears in the ith row and j th column with all other
entries zero. Clearly, these matrices form a basis, in complete analogy with the standard unit vectors.
Therefore, the vector space of n × m matrices has dimension nm.
In the following Lemma we collect a few simple conclusions about vector spaces which are spanned by a
finite number of vectors.
19
Lemma 1.2. For a vector space V spanned by a finite number of vectors we have:
(i) V has a basis
(ii) Every linearly independent set v1 , . . . , vk ∈ V can be completed to a basis.
(iii) If n = dim(V ), any linearly independent set of vectors v1 , . . . , vn forms a basis.
(iv) If dimF (V ) = dimF (W ) and V ⊂ W for two vector spaces V , W then V = W .
where every row, column and diagonal sums up to 15. Magic squares have long held a certain fascination
and an obvious problem is to find all magic squares.
In our context, the important observation is that magic squares form a vector space. Let us agree
that we add and scalar multiply magic squares in the same way as matrices (see Example 1.4 (e)), that is,
component by component. Then, clearly, the sum of two magic squares is again a magic square, as is the
scalar multiple of a magic square. Hence, from Def. 1.2, the 3 × 3 magic squares form a sub vector space
of the space of all 3 × 3 matrices. The problem of finding all magic squares can now be phrased in the
language of vector spaces. What is the dimension of the space of magic squares and can we write down a
basis for this space?
It is relative easy to find the following three elementary examples of magic squares:
1 1 1 0 1 −1 −1 1 0
M1 = 1 1 1 , M2 = −1 0 1 , M3 = 1 0 −1 . (1.56)
1 1 1 1 −1 0 0 −1 1
20
It is also easy to show that these three matrices are linearly independent, using Eq. (1.39). Setting a
general linear combination to zero,
α1 − α3 α1 + α2 + α3 α1 − α2
!
α1 M1 + α2 M2 + α3 + M3 = α1 − α2 + α3 α1 α1 + α2 − α3 = 0 , (1.57)
α1 + α2 α1 − α2 − α3 α1 + α3
immediately leads to α1 = α2 = α3 = 0. Hence, M1 , M2 , M3 are linearly independent and span a three-
dimensional vector space of magic squares. Therefore, the dimension of the magic square space is at least
three. Indeed, our example (1.55) is contained in this three-dimensional space since M = 5M1 +3M2 +M3 .
As we will see later, this is not an accident. We will show that, in fact, the dimension of the magic square
space equals three and, hence, that M1 , M2 , M3 form a basis.
an bn
is defined as
n
X
a · b := ai bi . (2.2)
i=1
In physics it is customary to omit the sum symbol in this definition and simply write a · b = ai bi , adopting
the convention that an index which appears twice in a given term (such as the index i in the present case)
is summed over. This is also referred to as the Einstein summation convention.
The scalar product satisfies a number of obvious properties, namely
(a) a·b=b·a
(b) a · (b + c) = a · b + a · c
(2.3)
(c) a · (βb) = βa · b
(d) a · a > 0 for all a 6= 0
Property (a) means that the dot product is symmetric. Properties (b), (c) can be expressed by saying
that the scalar product is linear in the second argument (vector addition and scalar multiplication can be
“pulled through”) and, by symmetry, it is therefore also linear in the first argument. It is easy to show
these properties using index notation.
(a) a · b = ai bi = bi ai = b · a
(b) a · (b + c) = ai (bi + ci ) = ai bi + ai ci = a · b + a · c
(c) a · (βb)P= ai (βbi ) = βai bi = βa · b
(d) a · a = ni=1 a2i > 0 for a 6= 0
21
Scalar products can also be defined “axiomatically” by postulating the four properties (a) – (d) and we
will come back to this approach in Section 6.
The last property, (d), allows us to define the length of a vector as
n
!1/2
√ X
|a| := a·a= a2i . (2.4)
i=1
It follows easily that |αa| = |α||a| for any real number α. This relation means that every non-zero vector
a can be “normalised” to length one by defining
n = |a|−1 a . (2.8)
Lemma 2.1. (Cauchy-Schwarz inequality) For any two vectors a and b in Rn we have
|a · b| ≤ |a| |b| .
Proof. The proof is a bit tricky. We start with the simplifying assumption that |a| = |b| = 1. Then
which shows that |a · b| ≤ 1. Now consider arbitrary vectors a and b. If one of these vector is zero then
the inequality is trivially satisfied so we assume that both of them are non-zero. Then the vectors
a b
u= , v=
|a| |b|
have both length one and, hence, |u · v| ≤ 1. Inserting the definitions of u and v into this inequality and
multiplying by |a| and |b| gives the desired result.
22
Lemma 2.2. (Triangle inequality) For any two vectors a and b in Rn we have
|a + b| ≤ |a| + |b|
Proof.
|a + b|2 = |a|2 + |b|2 + 2a · b ≤ |a|2 + |b|2 + 2|a| |b| = (|a| + |b|)2 ,
where the Cauchy-Schwarz inequality has beed used in the second step.
a b
a+b
Figure 4: Geometric meaning of the triangle inequality: The length |a + b| is always less or equal than the
sum |a| + |b| for the other two sides.
The triangle inequality has an obvious geometrical interpretation which is illustrated in Fig. 4. For
two non-zero vectors a and b, the Cauchy-Schwarz inequality implies that
a·b
−1 ≤ ≤1 (2.9)
|a| |b|
so that there is a unique angle θ ∈ [0, π] such that
a·b
cos θ = . (2.10)
|a| |b|
This angle θ is called the angle between the two vectors a and b, also denoted ^(a, b). With this definition
of the angle we can also write the scalar product as
We call the two vectors, a and b orthogonal (or perpendicular), in symbols a ⊥ b, iff a · b = 0. For two
non-zero vectors a and b this means
π
a ⊥ b ⇐⇒ ^(a, b) = . (2.12)
2
23
Example 2.2: : Angle between vectors and orthogonality
√
(a) Recall √
that the two vectors a and b in Eq. (2.5) have a dot product a · b = 6 and lengths |a| = 2 6
and |b| = 6. Hence, the angle between them is given by
6 1 π
cos(^(a, b)) = √ √ = ⇒ ^(a, b) = . (2.13)
2 6 6 2 3
Note that this rule is relatively easy to remember: For the first entry of the cross product consider the
second and third components of the two vectors and multiply them “cross-wise” with a relative minus
sign between the two terms and similarly for the other two entries.
However, cross product calculations which involve non-numerical, symbolic expressions can become ex-
tremely tedious if done by writing out all three components explicitly. It is therefore useful to introduce
24
a more economical notation and adopt the Einstein summation convention. To this end, we define the
following two objects:
(a) it remains unchanged under cyclic index permutations, for example ijk = jki (2.21)
(b) it changes sign under anti-cyclic index permutations, for example ijk = −ikj (2.22)
(c) it vanishes if two indices are identical, for example ijj = 0 (2.23)
(d) ijk ilm = δjl δkm − δjm δkl (2.24)
(e) ijk ijm = 2δkm (2.25)
(f) ijk ijk = 6 (2.26)
(g) ijk aj ak = 0 . (2.27)
The first three of these properties are obvious from the definition of the Levi-Civita tensor. Property (2.24)
can be reasoned out as follows. If the index pair (j, k) is different from (l, m) (in any order) then clearly
both sides of (2.24) are zero. On the other hand, if the two index pairs equal each other they can do so
in the same or the opposite ordering and these two possibilities correspond precisely to the two terms on
the RHS of (2.24). If we multiply (2.24) by δjl , using the index replacing property of the Kronecker delta,
we obtain
ijk ijm = (δjl δkm − δjm δkl )δjl = 3δkm − δkm = 2δkm
and this is property (2.25). Further, multiplying (2.25) with δkm we have
and, hence, (2.26) follows. Finally, to show (2.27) we write ijk aj ak = −ikj ak aj = −ijk aj ak , where the
summation indices j and k have been swapped in the last step, and, hence, 2ijk aj ak = 0.
We can think of δij and ijk as a convenient notation for the 0’s, 1’s and −1’s which appear in the definition
of the dot and cross product. Indeed, the dot product can be written as
a · b = ai bi = δij ai aj , (2.28)
while the index version of the cross product takes the form
25
To verify this last equation focus, for example, on the first component:
which indeed equals the first component of the vector product (2.17). Analogously, it can be verified that
the other two components match. Note that the index expression (2.29) for the vector product is much
more concise than the component version (2.17).
(a) a × b = −b × a (2.31)
(b) a × (b + c) = a × b + a × c (2.32)
(c) a × (βb) = βa × b (2.33)
(d) e1 × e2 = e3 , e2 × e3 = e1 , e3 × e1 = e2 (2.34)
(e) a×a=0 (2.35)
(f) a × (b × c) = (a · c)b − (a · b)c (2.36)
(g) (a × b) · (c × d) = (a · c)(b · d) − (a · d)(b · c) (2.37)
2 2 2 2
(h) | a × b | =| a | | b | −(a · b) (2.38)
Property (a) means that the vector product is anti-symmetric. Properties (b) and (c) imply linearity in
the second argument (vector addition and scalar multiplication can be “pulled through”) and, from anti-
symmetry, linearity also holds in the first argument. Property (g) is sometimes referred to as Lagrange
identity. The above relations can be verified by writing out all the vectors explicitly and using the
definitions (2.2) and (2.17) of the dot and cross products. However, for some of the identities this leads to
rather tedious calculations. It is much more economical to use index notation and express dot and cross
product via Eqs. (2.28) and (2.29). The proofs are then as follows:
(d) By explicit computation using the definition (2.17) with the standard unit vectors.
26
(2.24)
(f) (a × (b × c))i = ijk aj (b × c)k = ijk kmn aj bm nn = kij kmn aj bm cn = (δim δjn − δin δjm )aj bm cn
= aj cj bi − aj bj ci = a · cbi − a · bci = ((a · c)b − (a · b)c)i
(2.24)
(g) (a × b) · (c × d) = ijk imn aj bk cm dn = (δjm δkn − δjn δkm )aj bk cm dn = (a · c)(b · d) − (a · d)(b · c)
Note how expressions in vector notation are converted into index notation in these proofs by working from
the “outside in”. For example, for property (f) we have first written the outer cross product between a
and b × c in index notation and then, in the second step, we have converted b × c. Once fully in index
notation, the order of all objects can be exchanged at will - after all they are just numbers. In the proofs
of (f) and (g) the Kronecker delta acts as an index replacer, as explained below Eq. (2.19).
It is worth pointing out that the cross product can also be defined axiomatically by postulating a
product × : R3 → R3 with the properties (a) – (d) in (2.31)–(2.34). In other words, the cross product
can be defined as an anti-symmetric, bi-linear operation, mapping two three-dimensional vectors into a
three-dimensional vector, which acts in a simple, cyclic way (that is, as in property (d)) on the standard
unit vectors. It is easy to see that the vector product is indeed completely determined by theseP properties.
WritePtwo vector a, b ∈ R3 as linear combinations of the standard unit vectors, that is, a = i ai ei and
b = j bj ej , and work out their cross product using only the rules (a) – (d) (as well as the rule (e) which
follows directly from (a)). This leads to
!
X X (2.32),(2.33) X (2.35) X
a×b= ai ei × bj ej = ai bj ei × ej = ai bj ei × ej (2.39)
i j i,j i6=j
(2.31),(2.34)
= (a2 b3 − a3 b2 )e1 + (a3 b1 − a1 b3 )e2 + (a1 b2 − a2 b1 )e3 , (2.40)
The object in the square bracket, denoted by Iij , is called the moment of intertia tensor of the rigid
body. It is obviously symmetric, so Iij = Iji , so we can think of it as forming a symmetric matrix, and
it is a characteristic quantity of the rigid body. We can think of it as playing a role in rotational motion
27
Figure 6: A rotating rigid body
analogous to that of regular mass in linear motion. Correspondingly, the total kinetic energy of the rigid
body can be written as
1X
Ekin = Iij ωi ωj (2.41)
2
i,j
This relation is of fundamental importance for the mechanics of rigid bodies, in particular the motion of
tops, and we will return to it later.
Dot and cross product can be combined to a third product with three vector arguments, the triple product,
which is defined as
ha, b, ci := a · (b × c) . (2.42)
It has the following properties
(a) ha, b, ci = ijk ai bj ck = a1 b2 c3 + a2 b3 c1 + a3 b1 c2 − a1 b3 c2 − a2 b1 c3 − a3 b2 c1 (2.43)
(b) It is linear in each argument, e.g. hαa + βb, c, di = αha, c, di + βhb, c, di (2.44)
(c) It is unchanged under cyclic permutations, e.g. ha, b, ci = hb, c, ai (2.45)
(d) It changes sign for anti-cyclic permutations, e.g. ha, b, ci = −ha, c, bi (2.46)
(e) It vanishes if any two arguments are the same, e.g. ha, b, bi = 0 (2.47)
(f) The triple product of the three standard unit vectors is one, that is he1 , e2 , e3 i = 1 (2.48)
Property (a) follows easily from the definitions of dot and cross products, Eqs. (2.28) and (2.29), in index
notation and the definition (2.20) of the Levi-Civita tensor. Properties (b)–(e) are a direct consequence of
(a) and the properties of the Levi-Civita tensor. Specifically, (c) and (d) follow from (2.21),(2.22) and (e)
follows from (2.27). Property (f) follows from direct calculation, using the cross product relations (2.34)
for the standard unit vectors.
Another notation for the triple product is
det(a, b, c) := ha, b, ci , (2.49)
28
where “det” is short for determinant. Later we will introduce the determinant in general and for arbitrary
dimensions and we will see that, in three dimensions, this general definition indeed coincides with the
triple product.
Note that the six terms which appear in the explicit expression (2.43) for the triple product correspond
to the six permutations of {1, 2, 3}, where the three terms for the cyclic permutations come with a positive
sign and the other, anti-cyclic ones with a negative sign. There is a simple way to memorise these six
terms. If we arrange the three vectors a, b and c into the columns of a 3 × 3 matrix for convenience, the
six terms in the triple product correspond to the products of terms along the diagonals.
a b1 c1
1
det a2 b2 c2 = a1 b2 c3 + a2 b3 c1 + a3 b1 c2 − a1 b3 c2 − a2 b1 c3 − a3 b2 c1 (2.50)
a3 b3 c3
Here north-west to south-east lines connect the entries forming the three cyclic terms which appear with
a positive sign while north-east to south-west lines connect the entries forming the anti-cyclic terms which
appear with a negative sign. Corresponding lines on the left and right edge should be identified in order
to collect all the factors.
One way to proceed is to first work out the cross product between b and c, that is
−2 4 21
b × c = 5 × −6 = 10 . (2.52)
1 3 −8
Alternatively and equivalently, we can use the rule (2.50) which gives
−1 −2 4
det(a, b, c) = det 2 5 −6 (2.54)
−3 1 3
= (−1) · 5 · 3 + (−2) · (−6) · (−3) + 4 · 2 · 1 − (−1) · (−6) · 1 − (−2) · 2 · 3 − 4 · 5 · (−3)
= −15 − 36 + 8 − 6 + 12 + 60 = 23 .
Having introduced all the general definitions and properties we should now discuss the geometrical inter-
pretations of the cross and triple products.
29
Geometrical interpretation of cross product:
Property (2.47) of the triple product implies that the cross product a × b is perpendicular to both vectors
a and b. For the length of a cross product it follows
(2.38) 1 (a · b)2 1
| a × b | = (| a |2 | b |2 −(a · b)) 2 =| a || b | (1 − ) 2 =| a | · | b | sin ^(a, b) (2.55)
| a |2 | b |2
| {z }
=1−cos2 (^(a,b))
From this result and Fig. 7 the length, |a × b|, of the cross product is equal to the area of the rectangle
indicated and, as this area is left invariant by a shear, it equals the area of the parallelogram defined
by the vectors a and b. In summary, we therefore see that the vector product a × b defines a vector
perpendicular to a and b whose length equals the area of the parallelogram defined by a and b.
This geometrical interpretation suggests a number of applications for the cross product. In particular,
it can be used to find a vector which is orthogonal to two given vectors and to calculate the area of the
parallelogram (and the triangle) defined by two vectors.
30
e1 · e1 = 1 which equals the volume of the unit cube. For three arbitrary vectors αe1 , βe3 , γe3 in the
directions of the coordinate axis we find, from linearity (2.44) of the triple product, that hαe1 , βe2 , γe3 i =
αβγhe1 , e2 , e3 i = αβγ which equals the volume of the cuboid with lengths α, β, γ. Suppose this cuboid
is sheared to a parallelepiped. As an example let us consider a shear in the direction of e3 , leading to a
parallelepiped defined by the vectors a = αe1 + δe3 , b = βe2 and c = γe3 . Then, by linearity of the triple
product and (2.25) we have ha, b, ci = αβγhe1 , e2 , e3 i+δβγhe3 , e2 , e3 i = αβγ, so the triple product is the
same for the cuboid and the parallelepiped related by a shear. It is clear that this remains true for general
shears. Since shears are known to leave the volume unchanged we conclude that the (absolute) value of
the triple product, |ha, b, ci|, for three arbitrary vectors a, b, c equals the volume of the parallelepiped
defined by these vectors.
This geometrical interpretation suggests that the triple product of three linearly dependent vectors
should be zero. Indeed, three linearly dependent vectors all lie in one plane and form a degenerate
parallelepiped with volume zero. To be sure let us properly formulate and proof this assertion.
Claim 2.1. ha, b, ci =
6 0 =⇒ a, b, c are linearly independent.
Proof. If a, b, c are linearly dependent then one of the vectors can be written as a linear combination of
the others, for example, a = βb + γc. From linearity of the triple product and (2.47) it then follows that
We will later generalize this statement to arbitrary dimensions, using the determinant, and also show
that its converse holds. For the time being we note that we have obtained a useful practical way of
checking if three vectors in three dimensions are linearly independent and, hence, form a basis. In short,
6 0 then a, b, c form a basis of R3 .
if ha, b, ci =
form a basis of R3 . We can now check this independently by computing the triple product of these vectors
and using Claim 2.1. For the triple product we find
0 −3
det(v1 , v2 , v3 ) = v1 · (v2 × v3 ) = 1 · 2 = 1 . (2.60)
1 −1
Since this result is non-zero we conclude from Claim 2.1 that the three vectors are indeed linearly inde-
pendent and, hence, form a basis of R3 . Moreover, we have learned that the volume of the parallelepiped
defined by v1 , v2 and v3 equals 1.
31
algebra and are indeed part of another, related mathematical discipline called affine geometry. However,
given their importance in physics we will briefly discuss some of these applications.
pn
As a set this is the same as Rn , however, An is simply considered as a space of points without a vector
space structure. Vectors v ∈ Rn can act on points P in the affine space An by a translation defined as
p 1 + v1
P 7→ P + v =
..
.
p n + vn
−→
The unique vector1 translating P = (p1 , . . . , pn )T ∈ An to Q = (q1 , . . . , qn )T ∈ An is denoted by P Q :=
(q1 − p1 , . . . , qn − pn )T ∈ Rn . It is easy to verify that
−→ −→ −→
P Q + QR = P R ,
a property which is also intuitively apparent from Fig. 8. The distance d(P, Q) between points P and Q
1
For ease of notation we will sometimes write a column v with components v1 , . . . , vn as v = (v1 , . . . , vn )T . The T super-
script, short for “transpose”, indicates conversion of a row into a column. We will discuss transposition more systematically
shortly.
32
2.3.2 Lines in R2
We begin by discussing lines in R2 . In Cartesian form they can be described as all points (x, y) ∈ R2
which satisfy the linear equation
y = ax + b (2.61)
where a, b are fixed real numbers. Alternatively, a line can be described in parametric vector form as all
vectors r(t) given by
x(t)
r(t) = = p + tq , t∈R. (2.62)
y(t)
Here, p = (px , py )T and q = (qx , qy )T are fixed vectors and t is a parameter. The geometric interpretation
of these various objects is apparent from Fig. 9.
It is sometimes required to convert between those two descriptions of a line. To get from the Cartesian
to the vector form simply use x as the parameter so that x(t) = t and y(t) = at + b. Combining these two
equations into a vector equation gives
x(t) 0 1
r(t) = = +t (2.63)
y(t) b a
| {z } | {z }
p q
where the vectors p and q are identified as indicated. For the opposite direction, to get from the vector
to the Cartesian form, simply solve the two components of (2.62) for t so that
x − px y − py qy qy
t= = =⇒ y= x + py − px (2.64)
qx qy qx q
|{z} | {z x }
a b
Example 2.8: Conversion between Cartesian and vector form for two-dimensional lines
(a) Start with a line y = 2x − 3 in Cartesian form. Setting x(t) = t and y(t) = 2t − 3 the vector form of
this line is given by
x(t) t 0 1
r(t) = = = +t . (2.65)
y(t) 2t − 3 −3 2
33
(b) Conversely, the line in vector form given by
2 1
r(t) = +t (2.66)
−1 2
can be split up into the two components x = 2 + t and y = −1 + 2t. Hence, t = x − 2 and inserting this
into the equation for y gives the Cartesian form y = 2x − 5 of the line.
Finally, a common problem is to find the intersection of two lines given by r1 (t1 ) = p1 + t1 q1 and
r2 (t2 ) = p2 + t2 q2 . Setting r1 (t1 ) = r2 (t2 ) leads to
t1 q1 − t2 q2 = p2 − p1 . (2.67)
If q1 , q2 are linearly independent then they form a basis of R2 and, in this case, we know from Claim 1.2
that there is a unique solution t1 , t2 for this equation. The intersection point is obtained by computing
r1 (t1 ) or r2 (t2 ) for these values. If q1 , q2 are linearly dependent then the lines are parallel and either there
is no intersection or the two lines are identical.
The unique solution is t1 = −2 and t2 = 0, which are the parameter values of the intersection point. To
obtain the intersection point risec itself insert these parameter values into the equations for the lines which
gives
3
risec = r1 (−2) = r2 (0) = . (2.70)
0
2.3.3 Lines in R3
The vector form for 2-dimensional lines (2.62) can be easily generalized to three dimensions:
x(t)
r(t) = y(t) = p + tq . (2.71)
z(t)
Here p = (px , py , pz )T and q = (qx , qy , qz )T are fixed vectors. As before, we can get to the Cartesian form
by solving the components of Eq. (2.71) for t resulting in
x − px y − py z − pz
t= = = . (2.72)
qx qy qz
34
Note that this amounts to two equations between the three coordinates x, y, z as should be expected for
the definition of a one-dimensional object (the line) in three dimensions. The geometrical interpretation
of the various vectors is indicated in Fig. 10.
Example 2.10: Conversion between Cartesian and vector form for lines in three dimensions
(a) We would like to convert the line in vector form given by
x(t) 2 3
r(t) = y(t) = −1 + t −5 (2.73)
z(t) 4 2
into Cartesian form. Solving the three components for t immediately leads to the Cartesian form
x−2 y+1 z−4
t= =− = . (2.74)
3 5 2
For the minimum distance of a line from a given point we have the following statement.
35
d·q
Claim 2.2. The minimum distance of a line r(t) = p + tq from a point p0 arises at tmin = − |q| 2 where
Proof. We simply work out the distance square d2 (t) := |r(t) − p0 |2 of an arbitrary point r(t) on the line
from the point p0 which leads to
d·q 2 (d · q)2
2 2 2 2 2
d (t) =| d + tq | =| q | t + 2(d · q)t+ | d | = | q | t + + | d |2 − . (2.77)
|q| | q |2
d·q
This is minimal when the expression inside the square bracket vanishes which happens for t = tmin = − |q| 2.
This proves the first part of the claim. For the second part we simply compute
1 2
2 (2.38) | d × q |
d2min := d2 (tmin ) = 2 2
| d | | q | −(d · q) = . (2.78)
| q |2 | q |2
|d × q| 3
dmin = =√ . (2.81)
|q| 2
2.3.4 Planes in R3
To obtain the vector form of a plane in three dimensions we can generalize Eq. (2.71), the vector form of
a 3-dimensional line, by introducing two parameters, t1 and t2 and define the plane as all points r(t1 , t2 )
given by
x(t1 , t2 )
r(t1 , t2 ) = y(t1 , t2 ) = p + t1 q + t2 s , t1 , t2 ∈ R , (2.82)
z(t1 , t2 )
where p, q and s are fixed vectors in R3 . Of course, for this to really define a plane (rather than a line)
the vectors q and s must be linearly independent. A unit normal vector to this plane is given by
q×s
n= . (2.83)
|q×s|
36
Multiplying the vector form (2.82) by n = (nx , ny , nz )T (and remembering that n · q = n · s = 0) we get
to the Cartesian form of a three-dimensional plane
n·r=d or nx x + ny y + nz z = d , (2.84)
where d = n · p is a constant. From Eq. (2.11) we can re-write the Cartesian form as cos(θ)|r| = d,
where θ = ^(n, r) is the angle between n and r. The distance |r| of the plane from the origin is minimal
if cos(θ) = ±1 which shows that the constant d (or rather its absolute value |d|) should be interpreted
as the minimal distance of the plane from the origin. The geometrical meaning of the various objects is
indicated in Fig. 11. Finally, to convert a plane in Cartesian form (2.84) into vector form we must first
find a vector p with p · n = d (a vector “to the plane”) and then two linearly independent vectors q, r
satisfying q · n = r · n = 0 (two vectors “in the plane”). These are then three suitable vectors to write
down the vector form (2.82).
Example 2.12: Conversion between vector form and Cartesian form for a plane in three dimensions
(a) Start with a plane r(t1 , t2 ) = p + t1 q + t2 s in vector form, where
3 −1 2
p= 2 , q= 0 , s= 1 . (2.85)
0 1 −3
To convert to Cartesian form, we first work out a normal vector, N, to the plane, given by
−1 2 1
N=q×s= 0 × 1 =− 1
(2.86)
1 −3 1
37
“in the plane”, that is, a vector p satisfying N · p = 4 and we can choose, for example, p = (2, 0, 0)T .
Combining these results the vector form of the plane reads
2 1 1
r(t1 , t2 ) = p + t1 q + t2 s = 0 + t1 1 + t2 0 . (2.87)
0 1 −2
t1 q + t2 s − tb = a − p . (2.89)
Let us assume that the triple product hb, q, si is non-zero so that, from Claim (2.1), the vectors b ,q, s
form a basis. In this case, Eq. (2.89) has a unique solution for t1 , t2 , t which corresponds to the parameter
values of the intersection point. This solution can be found by splitting (2.89) up into its three component
equations and explicitly solving for t1 , t2 , t. Perhaps a more elegant way to proceed is to multiply (2.89)
by (q × s), so that the terms proportional to t1 and t2 drop out. The resulting equation can easily be
solved for t which leads to
hp − a, q, si
tisec = . (2.90)
hb, q, si
This is the value of the line parameter t at the intersection point and we obtain the intersection point
itself by evaluating rL (tisec ).
By straightforward calculation we have hp − a, q, si = −4 and hb, q, si = −1 so that, from Eq. (2.90), the
value of the line parameter t at the intersection point is tisec = 4. The intersection point, risec , is then
obtained by evaluating the equation for the line at t = tisec = 4, so
5
risec = rL (4) = −4 . (2.92)
−3
38
Figure 12: Sphere in three dimensions
39
From Eq. (2.96) this gives the minimal distance dmin = |(p1 − p2 ) · n| = 11/3 between the two lines.
2.3.7 Spheres in R3
A sphere in R3 with radius ρ and center p = (a, b, c)T consists of all points r = (x, y, z)T with |r − p| = ρ.
Written out in coordinates this reads explicitly
x1 w1
x2 w2 z
⌃ f y
x3 w3
xn4 wn
x1 , . . . , xn which can be combined into an n-dimensional input vector x = (x1 , . . . , xn )T . The internal state
of the perceptron is determined by three pieces of data: the real values w1 , . . . , wn , called the “weights”,
which can be arranged into the n-dimensional weight vector w = (w1 , . . . , wn )T , a real number θ, called
the “threshold” of the perceptron and a real function f , referred to as the “activation function”. In terms
of this data, the perceptron computes the output values y from the input values x as
The activation function can, for example, be chosen as the sign function
+1 for z ≥ 0
f (z) = . (2.101)
−1 for z < 0
Given this set-up, the functioning of the perceptron can be phrased in geometrical terms. Consider the
equation in Rn with coordinates x given by
w·x=θ . (2.102)
40
Note that this is simply the equation of a plane (or a hyper-plane in dimensions n > 3) in Cartesian form.
If a point x ∈ Rn is “above” (or on) this plane, so that w · x − θ ≥ 0, then, from Eqs. (2.100) and (2.101),
the output of the perceptron is +1. On the other hand, for a point x ∈ Rn below this plane, so that
w · x − θ < 0, the perceptron’s output is −1. In other words, the purpose of the perceptron is to “decide”
whether a certain point x is above or below a given plane.
So far this does not seem to hold much interest - all we have done is to re-formulate a sequence of
simple mathematical operations related to the Cartesian form a plane, in a different language. The point
is that the internal state of the perceptron, that is the choice of a plane specified by the weight vector w
and the threshold θ, is not inserted “by hand” but rather determined by a learning process. This works
as follows. Imagine a certain quantity, y, rapidly changes from −1 to +1 across a certain (hyper-) plane
in Rn whose location is not a priori known. Let us perform m measurements of y, giving measured values
y (1) , . . . , y (m) ∈ {−1, +1} at locations x(1) , . . . , x(m) ∈ Rn . These measurements can then be used to train
the perceptron. Starting from random values w(1) and θ(1) of the weight vector and the threshold we can
iteratively improve those values by carrying out the operations
w(a+1) = w(a) + λ(y (a) − y) x(a) , θ(a+1) = θ(a) − λ(y (a) − y) . (2.103)
Here, y is the output value produced by the perceptron given the input vector x(a) and λ is a real
value, typically chosen in the interval [0, 1], called the learning rate of the preceptron. Evidently, if the
value y produced by the perceptron differs from the true, measured value y (a) , the weight vector and the
threshold of the perceptron are adjusted according to Eqs. (2.103). This training process continues until
all measurements are used up and the final values w = w(m+1) and θ = θ(m+1) have been obtained. In
this state the perceptron can then be used to “predict” the value of y for new input vectors x. Essentially,
the perceptron has “learned” about the location of the plane via the training process and is now able to
decide whether a given point is located above or below.
In the context of artificial neural networks the perceptron corresponds to a single neuron. Proper arti-
ficial neural networks can be constructed by combining a number of perceptrons into a network, using the
output of certain perceptrons within the network as input for others. Such networks of perceptrons corre-
spond to collections of (hyper-) planes and are, for example, applied in the context of pattern recognition.
The details of this are beyond the present scope but are not too difficult to understand by generalising
the above discussion for a single perceptron.
41
3 Linear maps and matrices
We now return to the general story and consider arbitrary vector spaces. The next logical step is to study
maps between vector spaces which are “consistent” with the vector space structure - they are called linear
maps. As we will see, linear maps are closely related to matrices.
Definition 3.1. A map between two sets X and Y assigns to each x ∈ X a y ∈ Y which is written as
y = f (x) and referred to as the image of x under f . In symbols,
f : X → Y, x 7→ f (x) .
The set X is the called domain of the map f , Y is called the co-domain of f . The set
Im(f ) = {f (x)|x ∈ X} ⊆ Y
is called the image of f and consists of all elements of the co-domain which can be obtained as images
under f .
Figure 14: Visual representation of a map, with domain, co-domain, and image.
So a map assigns to each element of the domain an element of the co-domain. However, note that not all
elements of the co-domain necessarily need to be obtained in this way. This is precisely what is encoded
by the image, Im(f ), the set of elements in the co-domain which can be “reached” by f . If Im(f ) = Y
then all elements of the co-domain are obtained as images, otherwise some are not. Also note that, while
each element of the domain is assigned to a unique element of the co-domain, two different elements of
the domain may well have the same image. It is useful to formalize these observations by introducing the
following definitions.
42
Definition 3.2. Let f : X → Y be a map between two sets X and Y . The map f is called
(i) one-to-one (or injective) if every element of the co-domain is the image of at most one element of the
domain. Equivalently, f is one-to-one iff f (x) = f (ex) =⇒ x = x
e for all x, x̃ ∈ X.
(ii) onto (or surjective) if every element of the co-domain is the image of at least one element of the
domain. Equivalently, f is onto iff Im(f ) = Y .
(iii) bijective, if it is injective and surjective, that is, if every element of the co-domain is the image of
precisely one element of the domain.
Figure 15: The above map is not one-to-one (injective), since f (x) = f (e
x) but x 6= x
e.
Figure 16: The above map is not onto (surjective), since Im(f ) 6= Y .
Two maps can be applied one after the other provided the co-domain of the first map is the same as the
domain of the second. This process is called composition of maps and is formally defined as follows.
Definition 3.3. For two maps f : X → Y and g : Y → Z the compositive map g ◦ f : X → Z is defined
by
(g ◦ f )(x) := g(f (x)) .
43
From this definition it is easy to show that map composition is associative, that is, h ◦ (g ◦ f ) = (h ◦ g) ◦ f ,
since
(h ◦ (g ◦ f ))(x) = h((g ◦ f )(x)) = h(g(f (x))) = (h ◦ g)(f (x)) = ((h ◦ g) ◦ f )(x) . (3.1)
A trivial but useful map is the identity map idX : X → X which maps every element in X onto itself,
that is, idX (x) = x for all x ∈ X. It is required to define the important concept of inverse map.
Definition 3.4. Given a map f : X → Y , a map g : Y → X is called an inverse of f if
The inverse map “undoes” the effect of the original map and in order to construct such a map we need
to assign to each y ∈ Y in the co-domain an x ∈ X in the domain such that y = f (x). If the map is not
surjective this is impossible for some y since they are not in the image of f . On the other hand, if the
map is not injective some y ∈ Y are images of more than one element in the domain so that the required
assignment is not unique. Finally, if the map is bijective every y ∈ Y is the image of precisely one x ∈ X,
so we can attempt to define the inverse by setting g(y) = x for this unique x. Altogether this suggests the
following
Theorem 3.1. The map f : X → Y has an inverse if and only if f is bijective. If the inverse exists it is
unique and denoted by f −1 : Y → X.
Proof. “⇒” We assume that f : X → Y has an inverse g : Y → X with f ◦ g = idY and g ◦ f = idX .
To show that f is injective start with f (x) = f (x̃) and apply g from the left. It follows immediately that
x = x̃. To show surjectivity of we need to find, for a given y ∈ Y , an x ∈ X such that f (x) = y. We can
choose x = g(y) since f (x) = f ◦ g(y) = idY (y) = y. In conclusion f is bijective.
“⇐” We assume that f is bijective. Hence, for every y ∈ Y there is precisely one x ∈ X with f (x) = y.
44
We define the prospective inverse map by g(y) = x. Then f ◦ g(y) = f (x) = y and g ◦ f (x) = g(y) = x.
To show uniqueness we consider two maps g : Y → X and g̃ : Y → X with g ◦ f (x) = x = g̃ ◦ f (x). Setting
y = f (x) it follows that g(y) = g̃(y) and, since f is surjective this holds for all y ∈ Y . Hence, g = g̃.
If the maps f and g are both bijective, then it is easy to see that the composite map f ◦ g is also
bijective and, hence, has an inverse. This inverse of the composite map can be computed from the formula
(f · g)−1 = g −1 · f −1 . (3.2)
Note the change of order on the RHS of this equation. This relation follows from (f ◦ g)−1
◦ (f ◦ g) = id
−1 −1 −1 −1 −1
and (g ◦ f ) ◦ (f ◦ g) = id which implies that both (f ◦ g) and (g ◦ f ) provide an inverse for
f ◦ g. Uniqueness of the inverse function then leads to Eq. (3.2). Further we have
(f −1 )−1 = f (3.3)
from the uniqueness of the inverse and the fact that both f and (f −1 )−1 provide an inverse for f −1 .
45
Lemma 3.1. (Properties of linear maps) A linear map f : V → W between two vector spaces V and W
over F has the following properties:
(i) The zero vectors are mapped into each other, so f (0) = 0. Hence 0 ∈ Kerf .
(ii) The kernel of f is a sub vector space of V .
(iii) The image of f is a sub vector space of W .
(iv) f surjective ⇔ Im(f ) = W ⇔ dim Im(f ) = dim W
(v) f injective ⇔ Ker(f ) = {0} ⇔ dim Ker(f ) = 0
(vi) The scalar multiple αf , where α ∈ F , is linear.
(vii) For another linear map g : V → W , the sum f + g is linear.
(vii) For another linear map g : W → U , the composition g ◦ f is linear.
(L2)
Proof. (i) f (0) = f (0 0) = 0f (0) = 0.
(ii) We need to check the two conditions in Def. 1.2. If v1 , v2 ∈ Ker(f ) then, by definition of the kernel,
f (v1 ) = f (v2 ) = 0. It follows that f (v1 + v1 ) = f (v1 ) + f (v2 ) = 0 so that v1 + v2 ∈ Ker(f ). Similarly,
if v ∈ Ker(f ), so that f (v) = 0 if follows that f (αv) = αf (v) = 0, hence, αv ∈ Ker(f ).
(iii) This is very similar to the proof in (ii) and we leave it as an exercise.
(iv) The first part, f surjective ⇔ Im(f ) = W , is clear by the definition of surjective and the image.
Clearly, if Im(f ) = W , then both spaces have the same dimension. Conversely, from Lemma 1.2, two
vector spaces with the same dimension and one contained in the other (here Im(f ) ⊂ W ) must be identical.
(v) Suppose f is injective and consider a vector v ∈ Kerf . Then f (v) = f (0) = 0, which implies that v = 0
and, hence, Ker(f ) = {0}. Conversely, assume that Ker(f ) = {0}. Then, from linearity, f (v1 ) = f (v2 )
implies that f (v1 − v2 ) = 0 so that v1 − v2 ∈ Ker(f ) = {0}. Hence, v1 − v2 = 0 and f is injective.
(vi) Simply check the linearity conditions for αf , for example (αf )(v1 + v2 ) = αf (v1 + v2 ) = α(f (v1 ) +
f (v2 )) = αf (v1 ) + αf (v2 ) = (αf )(v1 ) + (αf )(v2 ).
(vii) Check the linearity conditions for f + g, similar to what has been done in (vi).
(viii) Simply check the linearity conditions for g ◦f , given that they are satisfied for f and g. g ◦f (v+w) =
g(f (v + w)) = g(f (v) + f (w)) = g(f (v)) + g(f (w)) = g ◦ f (v) + g ◦ f (w) and g ◦ f (αv) = g(f (αv)) =
g(αf (v)) = αg(f (v)) = αg ◦ f (v).
The above Lemma contains a number of extremely useful statements. First of all, both the image and
the kernel of a linear map are sub vector spaces - as one would hope for maps designed to be consistent
with the vector space structure. This means we can assign dimensions to both of them. In fact, the
dimension of the image is of particular importance and is given a special name.
Definition 3.6. The dimension of the image of a linear map f is called the rank of f , in symbols
rk(f ) := dim Im(f ).
It might be difficult to check if a map is surjective or injective, using the original definitions of these prop-
erties. For linear maps, Lemma (3.1) gives simple criteria for both properties in terms of the dimensions
of image and kernel.
We have seen earlier that spaces of functions, if appropriately restricted, form vector spaces. The
same is in fact true for linear maps, a fact which will become important later when we discuss dual vector
spaces and which we formulate in the following
Lemma 3.2. The set of all linear maps f : V → W forms a vector space, also denoted Hom(V, W ),
(“homomorphisms from V to W ”). Vector addition and scalar multiplication are defined by (f + g)(v =
f (v) + f (w) and (αf )(v) = αf (v).
46
Proof. From Lemma 3.1, (vi), (vii) the scalar multiple of a linear map and the sum of two linear maps is
again a linear map. From Def. 1.2 this shows that the set of linear maps does indeed form a (sub) vector
space.
To get a better intuitive feel for the action of linear maps we recall our interpretation of sub vector spaces
as lines, planes and their higher-dimensional analogues through 0. We should think of both the kernel
and the image of a linear map in this way, the former residing in the domain vector space, the latter in
the co-domain.
To be concrete, let us consider a linear map f : R3 → R2 and let us assume that dim Ker(f ) = 2, that
is, the kernel of f is a plane through the origin in R3 . This situation is schematically shown in Fig. 18.
Recall that all vectors in the kernel of f , that is all vectors in the corresponding plane (the blue plane in
Fig. 18) are mapped to the zero vector. What is more, consider two vectors v1 , v2 ∈ Ker(f ) + k which
both lie in a plane parallel to Ker(f ), shifted by a vector k (the pink plane in Fig. 18). Then we have
v1 − v2 ∈ Ker(f ) so that f (v1 − v2 ) = 0 and, hence, by linearity f (v1 ) = f (v2 ). Therefore, not only do
all vectors in the kernel get mapped to the zero vector, but all vectors in a plane parallel to the kernel get
mapped to the same (although non-zero) vector. Effectively, the action of the linear map “removes” the
two dimensions parallel to the kernel plane and only keeps the remaining dimension perpendicular to it.
Hence, the image of this linear map is one-dimensional, that is a line through the origin, as indicated in
Fig. 18. This structure is indeed general as expressed by the following theorem.
Figure 18: Geometric representation of a linear map f : R3 → R2 . If dim(V ) = 3 and dim Ker(f ) = 2 it
follows that dim Im(f ) = 1.
Proof. For simplicity of notation, set k = dim Ker(f ) and n = dim(V ). Let v1 , · · · , vk be a basis of Ker(f )
which we complete to a basis v1 , . . . , vk , vk+1 , . . . , vn of V . We will show that f (vk+1 ), . . . , f (vn ) forms
a basis of Im(f ). To do this we need to check the two conditions in Definition 1.4.
(B1) First we need to show that Im(f ) is spanned by f (vk+1 ), . . . , f (vn ). We begin with an arbitrary
47
vector w ∈ Im(f ). This vector must be the image of a v ∈ V , so that w = f (v). We can expand v as a
linear combination
Xn
v= αi vi
i=1
of the basis in V . Acting on this equation with f and using linearity we find
n n n
!
X X X
w = f (v) = f αi vi = αi f (vi ) = αi f (vi ),
i=1 i=1 i=k+1
Hence, we have written w as a linear combination of the vectors f (vk+1 ), . . . , f (vn ) which, therefore, span
the image of f .
(B2) For the second step, we have to show that the vectors f (vk+1 ), . . . , f (vn ) are linearly independent.
As usual, we start with the equation
n n
!
X X
αi f (vi ) = 0 ⇒ f α i vi = 0 .
i=k+1 i=k+1
Pn
The second of these equations means that the vector i=k+1 αi vi is in the kernel of f and, given that
v1 , . . . , vk form a basis of the kernel, there are coefficients α1 , . . . , αk such that
n
X k
X n
X
α i vi = − αi vi ⇒ α i vi = 0 .
i=k+1 i=1 i=1
Since v1 , . . . , vn form a basis of V it follows that all αi = 0 and, hence, f (vk+1 ), . . . , f (vn ) are linearly
independent.
Altogether, it follows that f (vk+1 ), · · · , f (vn ) form a basis of Im(f ). Hence, by counting the number of
basis elements dim Im(f ) = n − k = dim(V ) − dim Ker(f ).
The dimensional formula (3.4) is a profound statement about linear maps and it will be immensely
helpful to understand the solution structure of systems of linear equations. For now we draw a few easy
conclusions:
Claim 3.1. For a linear map f : V → W we have:
(i) f bijective (has an inverse) implies that dim(V ) = dim(W ).
(ii) For dim(V ) = dim(W ) = n the following conditions are equivalent:
f is bijective (has an inverse) ⇐⇒ dim Ker(f ) = 0 ⇐⇒ rk(f ) = n
(iii) If f is invertible then the inverse f −1 : W → V is also a linear map.
Proof. (i) If f is bijective, it is injective and surjective, so from Lemma 3.1 (v),(vi) dim Ker(f ) = 0 and
dim Im(f ) = dim(W ). Then, from Eq. (3.4), dim(V ) = dim Ker(f ) + dim Im(f ) = 0 + dim(W ) = dim(W ).
(ii) This is an easy consequence of Theorem 3.2 and Lemma 3.1 (v), (vi) and we leave the proof as an
exercise.
(iii) We set w1 = f (v1 ), w2 = f (v2 ) and w = f (v) and check the linearity conditions in Def. 3.5 for f −1 .
48
Part (i) of the above claim says we can have invertible linear maps only between vector spaces of the
same dimension. If the dimensions are indeed the same, part (ii) tells us the map is invertible precisely if
its rank is maximal, that is, if the rank is equal to the dimension of the space.
with entries aij ∈ F . We denote the column vector which consists of the entries in the ith row of A by ai .
To map n-dimensional into m-dimensional column vectors, we need to provide m functions each of which
can, in general, depend on all n coordinates vi . To obtain a linear map it seems intuitive that we should
choose these functions linear in the coordinates vi and the most general such possibility can be written
down using the coefficients of the above matrix A. It is given by
a11 v1 + · · · + a1n vn a1 · v
f (v) = .. ..
= =: Av . (3.6)
. .
am1 v1 + · · · + amn vn am · v
By the last equality, we have defined the multiplication Av of an m × n matrix A with an n-dimensional
vector v. By definition, this multiplication is carried out by forming the dot product between the vector
and the rows of the matrix, as indicated above 2 . Evidently, this only makes sense if “sizes fit”, that is,
if the number of components of the vector equals the number of columns of the matrix. The outcome
is a column vector whose dimension equals the number of rows of the matrix. Using index notation,
multiplication of vectors by matrices and the above linear map can more concisely be written as
n
X
f (v) i = aij vj = aij vj , (3.7)
j=1
where a sum over j is implied by the Einstein summation convention in the last expression. Using this
notation it is quite straightforward to check that f satisfies the conditions for a linear map in Definition 3.5.
We conclude that Eq. (3.6) indeed defines a linear map and that, via the multiplication of matrices with
vectors, we can define such a map for each matrix A. In short, multiplication of n-dimensional column
vectors by a m × n matrix A corresponds to a linear map f : F n → F m . Conversely, we can ask if every
linear map between column vectors can be obtained from a matrix in this way. We will return to this
question shortly and see that the answer is “yes”.
Since this is the first time we encounter the multiplication of matrices and vectors an explicit numerical
2
Strictly, we have called this expression
P dot product only for the real case, F = R. For the present purpose we mean by
dot product any expression a · v = i ai vi , where ai , vi ∈ F and F is an arbitrary field.
49
example might be helpful. Consider, the 4 × 3 matrix and the three-dimensional vector
1 0 −1
2 1 1
3
A= −2 1 , v = −2 . (3.10)
1
3
0 0 4
resulting in the four-dimensional vector w. Stated another way, we can view this matrix as a linear map
R3 → R4 and we have just explicitly computed the image w of the vector v under this linear map.
(b) Coordinate maps
We have seen earlier that a vector can be uniquely described by its coordinates relative to a basis, see
Claim 1.2. We can now use the notion of linear maps to make this more precise. Consider an n-dimensional
vector space V over F with basis v1 , . . . , vn . By α, β ∈ F n we denote n-dimensional column vectors with
components αi and βi , respectively, and we define the coordinate map ϕ : F n → V by
n
X
ϕ(α) = α i vi . (3.12)
i=1
This map assigns to a coordinate vector the corresponding vector, relative to the given basis. It is easy
to verify that it is linear.
n
X n
X n
X
ϕ(α + β) = (αi + βi )vi = αi vi + βi vi = ϕ(α) + ϕ(β) (3.13)
i=1 i=1 i=1
Xn n
X
ϕ(aα) = (aαi )vi = a αi vi = aϕ(α) . (3.14)
i=1 i=1
Since v1 , . . . , vn forms a basis it is clear that Im(ϕ) = V and, hence, from Claim 3.1 P
(ii), ϕ is bijective
and has an inverse ϕ : V → F . Clearly, the inverse map assigns to a vector v = ni=1 αi vi ∈ V its
−1 n
A linear and bijective map between two vector spaces is also referred to as a (vector space) isomorphism
and two vector spaces related by such a map are called isomorphic. What the above discussion shows is
that every n-dimensional vector space V over F is isomorphic to F n by means of a coordinate map ϕ.
However, it should be noted that this isomorphism is not unique as it depends on the choice of basis.
For an explicit example of a coordinate map consider Example 1.7, where we have considered R3 with
basis
0 0 1
v1 = 1 , v 2 = 1 , v3 = 1 . (3.16)
1 2 −1
50
The coordinate map for this basis is given by
α3
ϕ(α) = α1 v1 + α2 v2 + α3 v3 = α1 + α2 + α3 . (3.17)
α1 + 2α2 − α3
where pk (x) are fixed real-valued functions of x ∈ R. If we denote by V the vector space of infinitely many
times differentiable functions g : R → R then we can view this differential operator as a map L : V → V .
Since single differentiation and multiplication with fixed functions are each linear operations and L is a
composition of such operations it is clear that L is a linear map. We can also verify this explicitly:
n n n
X dk X dk g1 X dk g2
L(g1 + g2 ) = pk (x) k (g1 + g2 ) = pk (x) k + pk (x) k = L(g1 ) + L(g2 ) (3.18)
dx dx dx
k=0 k=0 k=0
n n
X dk X dk g
L(αg) = pk (x) (αg) = α pk (x) = αL(g) . (3.19)
dxk dxk
k=0 k=0
L(g) = 0 (3.20)
are given by the kernel, Ker(L). For any linear map the kernel is a (sub) vector space and, for the present
example, this means that any linear combination of solutions of the differential equation is also a solution.
As an explicit example consider the second order linear differential operator
d2 d
L= 2
+4 −5. (3.21)
dx dx
The associated homogeneous differential equation L(g) = 0 has the two solutions, g1 (x) = exp(x) and
g2 (x) = exp(−5x), but, from linearity, any linear combination g(x) = αg1 (x) + βg2 (x) = α exp(x) +
β exp(−5x) is also a solution.
51
3.2.1 Basic matrix properties
We consider matrices of arbitrary size n × m (that is, with n rows and m columns) given by
a11 · · · a1m
.. ..
A= . . (3.22)
.
an1 · · · anm
Here aij ∈ F , with i = 1, . . . , n and j = 1, . . . , m, are the (real or complex) entries. The matrix A is called
quadratic if n = m, that is if it has as many rows as columns. It is often useful to be able to refer to the
entries of a matrix by the same symbol and, by slight abuse of notation, we will therefore also denote the
entries by Aij = aij . We already know that the matrices of a given size form a vector space with vector
addition and scalar multiplication defined component by component, that is, (A + B)ij = Aij + Bij and
(αA)ij = αAij . We will frequently need to refer to the row and column vectors of a matrix for which we
introduce the following notation:
Ai1 A1j
Ai = ... , Aj = ... . (3.23)
Aim Anj
Hence, Ai is an m-dimensional column vector which contains the entries in the ith row of A and Aj is
an n-dimensional column vector which contains the entries in the j th column of A. We also recall that A
defines a linear map A : F m → F n via multiplication of matrices and vectors which can be written as
A1 · v m
A : v 7→ Av = .
.
X
or (Av)i = Aij vj . (3.24)
.
An · v j=1
Its row and columns vectors are given by the standard unit vectors, so Ai = ei and Aj = ej , and its
components (1n )ij = δij equal the Kronecker delta symbol introduced earlier. For its action on a vector
we have
(1v)i = δij vj = vi (3.26)
so, seen as a linear map, the unit matrix corresponds to the identity map.
More generally, a diagonal matrix is a matrix D with non-zero entries only along the diagonal, so
Dij = 0 for all i 6= j. It can be written as
d1 0
D=
.. =: diag(d1 , . . . , dn )
(3.27)
.
0 dn
The complex conjugate A∗ : F m → F n of a matrix A : F m → F n is simply the matrix whose entries are
the complex conjugates of the entries in A, so in component form, (A∗ )ij = (Aij )∗ . Of course, for matrices
with only real entries complex conjugation is a trivial operation which leaves the matrix unchanged.
52
The transpose of an n × m matrix A : F m → F n is a m × n matrix AT : F n → F m obtained by exchanging
the rows and columns of A. In component form, this means (AT )ij := Aji . A quadratic matrix A is said
to be symmetric if A = AT or, in index notation, if all entries satisfy Aij = Aji . It is called anti-symmetric
if A = −AT or, Aij = −Aji for all entries. Note that all diagonal entries Aii of an anti-symmetric matrix
vanish. We have
(A + B)T = AT + B T , (αA)T = αAT (3.28)
for n×m matrices A, B and scalars α. In particular, the sum of two symmetric matrices is again symmetric
as is the scalar multiple of a symmetric matrix (and similarly for anti-symmetric matrices). This means
that symmetric and anti-symmetric n × n matrices each form a sub vector space within the vector space
of all n × n matrices.
Note that, for non-quadratic matrices, the transpose changes the “shape” of the matrix. While A above
is a 3 × 2 matrix defining a linear map A : R2 → R3 , its transpose is a 2 × 3 matrix which defines a linear
map AT : R3 → R2 .
(b) The general form of 2 × 2 symmetric and anti-symmetric matrices is
a b 0 b
Asymm = , Aanti−symm = (3.30)
b d −b 0
with arbitrary numbers a, b, d ∈ F . The dimension of the vector space of symmetric 2 × 2 matrices over
F is three (as they depend on three independent numbers) while the dimension of the vector space of
anti-symmetric 2 × 2 matrices over F is one (as they depend on one parameter). In particular, note that
the diagonal of an anti-symmetric matrix vanishes. It is easy to write down a basis for these vector spaces
and also to generalize these statements to matrices of arbitrary size.
Finally, a combination of the previous two operations is the hermitian conjugate of an n × m matrix
A : F m → F n which is defined as a m × n matrix A† : F n → F m obtained by taking the complex
conjugate of the transpose of A, that is, A† := (AT )∗ . For matrices with only real entries, hermitian
conjugation is of course the same as transposition. A quadratic matrix A is said to be hermitian if the
matrix is invariant under hermitian conjugation, that is, if A = A† , and anti-hermitian if A = −A† . In
analogy with the properties of transposition we have
The first property means that the sum of two hermition (anti-hermitian) matrices is again hermitian (anti-
hermitian). More care is required for scalar multiples. The scalar multiple of a hermitian (anti-hermitian)
matrix with a real scalar remains hermitian (anti-hermitian). However, for a complex scalar this is no
longer generally the case due to the complex conjugation of the scalar in the second equation (3.31). This
means the n × n hermitian (anti-hermitian) matrices form a sub vector space of the vector space of all
n × n matrices with complex entries, but only if the underlying field is taken to be the real numbers.
53
Example 3.6: Hermitian conjugate
(a) An explicit example for a 3 × 3 matrix with complex entries and its hermitian conjugate is
i 1 2−i −i 2 1+i
A= 2 3 −3i , A† = 1 3 4 . (3.32)
1−i 4 2+i 2+i 3i 2 − i
Note that, in addition to the transposition carried out by exchanging rows and columns, all entries are
complex conjugated.
(b) The hermitian conjugate of an arbitrary 2 × 2 matrix with complex entries is
∗ ∗
a b † a c
A= , A = (3.33)
c d b∗ d∗
For A to be hermitian we need that a = a∗ , d = d∗ , so that the diagonal is real, and c = b∗ . Hence, the
most general hermitian 2 × 2 matrix has the form
a b
Aherm = , a, d ∈ R , b ∈ C . (3.34)
b∗ d
These matrices form a four-dimensional vectors space (over R) since they depend on four real parameters.
For an anti-hermitian matrix, the corresponding conditions are a = −a∗ , d = −d∗ , so that the diagonal
must be purely imaginary, and c = −b∗ . The most general such matrices are
a b
Aanti−herm = , a, d ∈ iR , b ∈ C . (3.35)
−b∗ d
and they form a four-dimensional vector space over R.
and is hence given by a linear combination of the column vectors Aj with the coefficients equal to the
components of v. This observation tells us that
Im(A) = Span(A1 , · · · , Am ) , (3.37)
so the image of the matrix is spanned by its column vectors. For the rank of the matrix this implies that
rk(A) = dim Span(A1 , · · · , Am ) = maximal number of lin. indep. column vectors of A . (3.38)
For obvious reasons this is also sometimes called the column rank of the matrix A. This terminology
suggests we can also define the row rank of the matrix A as the maximal number of linearly independent
row vectors of A. Having two types of ranks available for a matrix seems awkward but fortunately we
have the following
54
Theorem 3.3. Row and column rank are equal for any matrix.
Proof. Suppose one row, say A1 , of a matrix A can be written as a linear combination of the others.
Then, by dropping A1 from A we arrive at a matrix with one less row, but its row rank unchanged from
that of A. The key observation is that the column rank also remains unchanged under this operation.
This can be seen as follows. Write
n α2
α = ...
X
A1 = αj Aj ,
j=2 αn
with some coefficients α2 , . . . , αn which we have arranged into the vector α. Further, let us write the
column vectors of A as
i ai
A = ,
bi
that is, we split off the entries in the first row, denoted by ai , from the entries
Pn in the remaining
Pn n − 1 irows
which are contained in the vectors bi . It follows that ai = A1i = (A1 )i = j=2 αj Aji = j=2 αj (A )j =
α · bi , so that the column vectors can also be written as
i α · bi
A = ,
bi
Hence, the entries in the first row are not relevant for the linear independence of the column vectors Ai -
merely using the vectors bi will lead to the same conclusions for linear independence. As a result we can
drop a linearly dependent row without changing the row and the column rank of the matrix. Clearly, an
argument similar to the above can be made if we drop a linearly dependent column vectors - again both
the row and column rank remain unchanged.
In this way, we can continue dropping linearly dependent row and column vectors from A until we
arrive at a (generally smaller) matrix A0 which has linearly independent row and column vectors and the
same row and column ranks as A. On purely dimensional grounds, a matrix with all row vectors and all
column vectors linearly independent must be quadratic (For example, consider a 3 × 2 matrix. Its three
2-dimensional row vectors cannot be linearly independent.). Therefore, row and column rank are the same
for A0 and, hence, for A.
The above interpretation of the rank of a matrix as the maximal number of linearly independent row
or column vectors gives us a practical way to determine the rank of the matrix, using the methods to
check linear independence we have introduced in Section 1. Efficient, algorithmic methods for this will be
introduced in the next sub-section but for smaller matrices the rank can often be found “by inspection”,
as in the following example.
55
(b) Consider the 3 × 3 matrix
−1 4 3
A = 2 −3 −1 . (3.40)
3 2 5
which defines a map A : R3 → R3 . It is clear that the first two columns of this matrix are linearly
independent (they are not multiples of each other) and that the third column is the sum of the first two.
Hence, the rank of this matrix is two. This means that the dimension of its image is two while, from
Eq. (3.4), the dimension of its kernel is one. To find the kernel of A explicitly we have to solve Av = 0.
With v = (x, y, z)T this leads to
−x + 4y + 3z = 0 , 2x − 3y − z = 0 , 3x + 2y + 5z = 0 , (3.41)
and these equations are solved precisely if x = y = −z. The image of A is, in general, spanned by the
column vectors, but since A3 = A1 + A2 it is already spanned by the first two columns. In conclusion,
we have
1 −1 4
Ker(A) = Span 1 , Im(A) = Span 2 , −3 . (3.42)
−1 3 2
where A is the matrix with entries aij , these being the coefficients which appear in the parametrization
of the images (3.43). We have therefore succeeded in expressing the action of our arbitrary linear map f
in terms of a matrix and we conclude that all linear maps between column vectors are given by matrices.
We summarize this in the following
56
Lemma 3.3. Every linear map f : F m P → F n can be written in terms of an n × m matrix A, such that
f (v) = Av for all v ∈ F . If f (ej ) = ni=1 aij ẽi for the standard unit vectors ei of F m and ẽi of F n ,
m
Hence, we know from Lemma 3.3 that f can be described by a 3 × 3 matrix A which can be worked out
by studying the action of f on the standard unit vectors ei . Writing n = (n1 , n2 , n3 )T we find by explicit
computation
f (e1 ) = n × e1 = n3 e2 − n2 e3
f (e2 ) = n × e2 = −n3 e1 + n1 e3 (3.48)
f (e3 ) = n × e3 = n2 e1 − n1 e2 .
Lemma 3.3 states that the coefficients in front of the standard unit vectors on the right-hand sides of
these equations are the entries of the desired matrix A. More precisely, the coefficients which appear in
the expression for f (ej ) form the j th column of the matrix A. Hence, the desired matrix is
0 −n3 n2
A = n3 0 −n2 , (3.49)
−n2 n1 0
and we have f (v) = n × v = Av for all vectors v ∈ R3 . The interesting conclusion is that vector products
with a fixed vector n can also be represented by multiplication with the anti-symmetric matrix (3.49).
Everything is much more elegant in index notation where
57
so that the entries Cik of C are given by
n
X
Cik = Bij Ajk = Bi · Ak (3.52)
j=1
This equation represents the component version of what we refer to as matrix multiplication. We obtain
the entries of the new matrix C - which corresponds to the composition of B and A - by performing the
summating over the entries of B and A as indicated in the middle of (3.52) or, equivalently, by dotting
the columns of A into the rows of B, as indicated on the RHS of (3.52). In matrix notation this can also
be written as
B1 · A1 · · · B1 · Am
C = BA := .. ..
. (3.53)
. .
Br · Am · · · Br · Am
Note that the product of the r × n matrix B with the n × m matrix A results in the r × m matrix C = BA.
The important conclusion is that composition of matrices - in their role as linear maps - is accomplished
by matrix multiplication.
We should discuss some properties of matrix multiplication. First note that the matrix product BA
only makes sense if “sizes fit”, that is, if A has as many rows as B columns - otherwise the dot products
in (3.53) do not make sense. This consistency condition is of course a direct consequence of the role of
matrices as linear maps. The maps B and A can only be composed to BA if the co-domain of A has the
same dimension as the domain of B. Let us illustrate this with the following
of sizes 2 × 3 and 3 × 3, respectively. Dotting the column vectors of A into the rows of B we can compute
their product
0 1 1
1 0 −1 −1 2 0
BA = 2 0 1 = , (3.55)
2 3 −2 4 4 3
1 −1 1
a 2 × 3 matrix. Note, however, that the product AB is ill-defined since B has two rows but A has 3
columns.
This is a direct consequence of the associativity of map composition (see the discussion around Eq. (3.1))
but can also be verified directly from the definition of matrix multiplication. This is most easily done in
index notation (using Eq. (3.52)) which gives
(A(BC))ij = Aik (BC)kj = Aik Bkl Clj = (AB)il Clj = ((AB)C)ij . (3.57)
58
However, matrix multiplication is not commutative, that is, typically, AB 6= BA. The “degree of non-
commutativity” of two matrices is often measured by the commutator defined as
[A, B] := AB − BA (3.58)
What is the relation between multiplication and transposition of matrices? The answer is
(AB)T = B T AT . (3.62)
Note the change of order on the RHS! A proof of this relation is most easily accomplished in index notation:
(AB)T = (AB)ji = Ajk Bki = Bki Ajk = (B T )ik (AT )kj = (B T AT )ij .
ij
(3.63)
For the complex conjugation of a matrix product we have of course (AB)∗ = A∗ B ∗ , so together with
Eq. (3.62) this means for the hermitian conjugate that
(AB)† = B † A† . (3.64)
Finally, using matrix terminology, we can think of vectors in a slightly different way. A column vector v
with components v1 , . . . , vm can also be seen as an m × 1 matrix and the action Av of an n × m matrix A
on v as a matrix multiplication. The transpose, vT = (v1 , . . . , vm ) is an m dimensional row vector and,
hence, the dot product of two m-dimensional (column) vectors v and w can also be written as
v · w = vT w , (3.65)
that is, as a matrix product between the 1 × m matrix vT and the m × 1 matrix w.
59
3.2.5 The inverse of a matrix
Recall from Claim 3.1, that a linear map f : V → W can only have an inverse if dim(V ) = dim(W ).
Hence, for a matrix A : F m → F n to have an inverse it must be quadratic, so that n = m. Focusing
on quadratic n × n matrices A we further know from Claim 3.1 that we have an inverse precisely when
rk(A) = n, that is, when the rank of A is maximal. In this case, the inverse of A, denoted A−1 , is the
unique linear map (and, therefore, also a matrix) satisfying
AA−1 = A−1 A = 1n . (3.66)
Note that this is just the general Definition (3.4) of an inverse map applied to a matrix, using the fact
that matrices correspond to linear maps and map composition corresponds to matrix multiplication. We
summarize the properties of the matrix inverse in the following
Lemma 3.4. (Properties of matrix inverse) A quadratic n × n matrix A : F n → F n is invertible if and
only if its rank is maximal, that is, iff rk(A) = n. If A, B are two invertible n × n matrices we have
(a) The inverse matrix, denoted A−1 , is the unique matrix satisfying AA−1 = A−1 A = 1n .
(b) (AB)−1 = B −1 A−1
(c) A−1 is invertible and ((A)−1 )−1 = A
(d) AT is invertible and (AT )−1 = (A−1 )T
Proof. (a) This has already been shown above.
(b) (c) These are direct consequences of the corresponding properties (3.2), (3.3) for general maps.
(d) Recall from Claim 3.1 that a matrix is invertible iff its rank is maximal. Since, from Theorem 3.3,
rk(A) = rk(AT ), we conclude that AT is indeed invertible which proves the first part of the claim. For the
second part, we transpose A−1 A = AA−1 = 1, using Eq. (3.62), to arrive at AT (A−1 )T = (A−1 )T AT = 1.
On the other hand, from the definition of the inverse for AT , we have AT (AT )−1 = (AT )−1 AT = 1.
Comparing the two equations shows that both (A−1 )T and (AT )−1 provide an inverse for AT and, hence,
from the uniqueness of the inverse, they must be equal.
V2 V3
V4
V1 V5
for which the links have no direction, but our considerations can easily be generalized to directed graphs.
Graphs can be related to linear algebra via the adjacency matrix which is defined by
1 if Vi and Vj are linked
Mij = (3.67)
0 otherwise
60
For example, for the graph in Fig. 19 the adjacency matrix is given by
0 1 0 1 0
1 0 1 1 0
M = 0 1 0 0 1 . (3.68)
1 1 0 0 1
0 0 1 1 0
This matrix is symmetric due to the underlying graph being undirected. The following fact (which we
will not try to prove here) makes the adjacency matrix a useful object.
Fact: The number of possible walks from vertex Vi to vertex Vj over precisely n links in a graph is given
by (M n )ij , where M is the adjacency matrix of the graph.
To illustrate this compute the low powers of the adjacency matrix M in Eq. (3.68).
2 1 1 1 1 2 4 2 4 2
1 3 0 1 2 4 2 5 6 1
M2 = 3
1 0 2 2 0 , M = 2 5 0 1 4
. (3.69)
1 1 2 3 0 4 6 1 2 5
1 2 0 0 2 2 1 4 5 0
For example, the number of possible walks from V1 to V3 over three links is given by (M 3 )13 = 2.
By inspecting Fig. 19 it can be seen that these two walks correspond to V1 → V4 → V5 → V3 and
V1 → V4 → V2 → V3 .
61
Note that in Tenc same letters are now represented by different numbers. For example, the letter “a" which
appears three times, and corresponds to the three 1’s in T , is represented by three different numbers in
Tenc . Without knowledge of the encoding matrix A it is quite difficult to decypher Tenc , particularly for
large block sizes. The legitimate receiver of the text should be provided with the inverse A−1 of the
encoding matrix, for our example given by
1 2 1
A−1 = 0 1 1 ,
2 3 2
and can then recover the message by the simply matrix multiplication
T = A−1 Tenc .
Definition 3.7. The following manipulations of a matrix are called elementary row operations.
(R1) Exchange two rows.
(R2) Add a multiple of one row to another.
(R3) Multiply a row with a non-zero scalar.
Analogous definitions hold for elementary column operations.
For definiteness, we will focus on elementary row operations but most of our statements have analogues for
elementary column operations. As we will see, elementary row operations will allow us to devise methods
to compute the rank and the inverse of matrices and, later on, to formulate a general algorithm to solve
linear systems of equations.
A basic but important observation about elementary row operations (which is indeed the main motivation
for defining them) is that they do not change the span of the row vectors. Recall that the rank of a matrix
is given by the maximal number of linearly independent row (or column) vectors. Hence, the rank of a
matrix is also unchanged under elementary row operations. This suggests a possible strategy to compute
the rank of a matrix: By a succession of elementary row operations, we should bring the matrix into a
(simpler) form where the rank can easily be read off. Suppose a matrix has the form
· · · a1j1 ∗
.
..
a2j2
. .
A= .
. .
. arjr · · ·
..
0 .
where the entries aiji are non-zero for all i = 1, . . . , r, all other entries above the solid line are arbitrary
(indicated by the ∗) and all entries below the solid line are zero. This form of a matrix is called (upper)
62
echelon form. Clearly, the first r row vectors in this matrix are linearly independent and, hence, the rank
of a matrix in upper echelon form can be easily read off and is given by
The important fact is that every matrix can be brought into upper echelon form by a sequence of elementary
row operations. This works as follows.
(1) Find the leftmost column j which has at least one non-zero entry in rows i, . . . , n.
(2) If the (i, j) entry is zero exchange row i with one of the rows i + 1, . . . , n (the one which contains
the non-zero entry identified in step 1) so that the new (i, j) entry is non-zero.
(3) Subtract suitable multiples of row i from all rows i + 1, . . . , n such that all entries (i + 1, j), . . . , (n, j)
in column j and below row i vanish.
Continue with the next row until no more non-zero entries can be found in step 1.
This procedure of bringing a matrix into its upper echelon form using elementary row operations is
called Gaussian elimination (sometimes also referred to as row reduction). In summary, our procedure to
compute the rank of a matrix involves, first, to bring the matrix into upper echelon form using Gaussian
elimination and then to read off the rank from the number of steps in the upper echelon form. This is
probably best explained with an example.
We have indicated the row operation from one step to the next above the arrow, referring to the ith row
by Ri . The final matrix is in upper echelon form. There are two steps so that rk(A) = 2.
A neat and very useful fact about elementary row operations is that they can be generated by multiplying
with certain, specific matrices from the left. In other words, to perform a row operation on a matrix A, we
can find a suitable matrix P such that the row operation is generated by A → P A. For example, consider
a simple 2 × 2 case where
a b 1 λ
A= , P = . (3.71)
c d 0 1
63
Then
1 λ a b a + λc b + λd
PA = = . (3.72)
0 1 c d c d
Evidently, multiplication with the matrix P from the left has generated the elementary row operation
R1 → R1 + λR2 on the arbitrary 2 × 2 matrix A. This works in general and the appropriate matrices,
generating the three types of elementary row operations in Def. 3.7, are given by
1
..
1
.
..
1 .
th
0 1 i row 1
(I) .
(III) th
PRi ↔Rj = ..
P Ri →λRi =
λ i row
1 0 th
j row 1
.
1 ..
.. th
i col 1
.
1
1
..
.
th
1 ··· λ i row
(II) .. ..
PRi →Ri +λRj = . (3.73)
. .
1
..
.
j th col 1
This means we can bring a matrix A into upper echelon form by matrix multiplications P1 · · · Pk A where
the matrices P1 , . . . , Pk are suitably chosen from the above list. Note that all the above matrices are
invertible. This is clear, since we can always “undo” an elementary row operations by the inverse row
operation or, alternatively, it can be seen directly from the above matrices. The matrices P (II) and
P (III) are already in upper echelon form and clearly have maximal rank. The matrices P (I) can easily be
brought into upper echelon form by exchanging row i and j. Then they turn into the unit matrix which
has maximal rank.
Recall that, for S to be a magic square, its rows, columns and both diagonals have to sum up to the same
64
total. These conditions can be cast into the seven linear equations
d+e+f =a+b+c −a − b − c + d + e + g = 0
g+h+i=a+b+c −a − b − c + g + h + i = 0
a+d+g =a+b+c −b − c + d + g = 0
b+e+h=a+b+c or −a − c + e + h = 0
c+f +i=a+b+c −a − b + f + i = 0
a+e+i=a+b+c −b − c + e + i = 0
c+e+g =a+b+c −a − b + e + g = 0
or, in short, Ax = 0. The magic squares are precisely the solutions to this equation which shows that
the magic square vector space is the kernel, Ker(A), of the matrix A. By Gaussian elimination and with
a bit of calculation, the matrix A can be brought into upper echelon form and the rank can be read off
as rk(A) = 6. Then, the dimension formula (3.4) leads to dim Ker(A) = 9 − rk(A) = 3 and, hence, the
dimension of the magic square vector space is indeed three. In summary, the three matrices M1 , M2 , M3
in Eq. (1.56) form a basis of the magic square vector space and every magic square is given as a (unique)
linear combination of these three matrices.
65
whose columns consist of all non-zero vectors of Z32 . Clearly, rk(H) = 3 (since H1 , H2 , H4 are linearly in-
dependent) and, therefore, its kernel has dimension dim Ker(H) = 7−3 = 4. It is easy to see that this four-
dimensional kernel has a basis K1 = (1, 0, 0, 0, 0, 1, 1)T , K2 = (0, 1, 0, 0, 1, 0, 1)T , K3 = (0, 0, 1, 0, 1, 1, 0)T ,
K4 = (0, 0, 0, 1, 1, 1, 1)T which we can arrange into the rows of a 4 × 7 matrix
1 0 0 0 0 1 1
0 1 0 0 1 0 1
K= 0 0 1 0 1 1 0
(3.75)
0 0 0 1 1 1 1
The key idea is now to encode the information stored in β1 , . . . , β4 by forming the linear combination
of these numbers with the basis vectors of Ker(H) which we have just determined. In other words, we
encode the information in the seven-dimensional vector
4
X
v= βi Ki = β T K . (3.76)
i=1
Note that, given the structure of the matrix K, the first four bits in v coincide with the actual information
β1 , . . . , β4 . By construction, the vector v is an element of Ker(H).
Now suppose that the transmission of v has resulted in a vector w which can have an error in at most
one bit. How do we detect whether such an error has occurred? We note that the seven-dimensional
standard unit vectors e1 , . . . , e7 are not in the kernel of H. Further, if v is in the kernel then none of the
vectors w = v + ei is. This means the transmitted information w is free of (one-bit) errors if and only if
w ∈ Ker(H), a condition which can be easily tested.
Assuming w ∈ / Ker(H) so that the information is faulty, how can the error be corrected? Assume that
bit number i has changed in w so that the correct original vector is v = w − ei . Since v ∈ Ker(H) it
follows that Hw = Hei = Hi , so that the product Hw coincides with one of the columns Hi of H. This
means, if Hw equals column i of H then we should flip bit number i in w to correct for the error.
Let us carry all this out for an explicit example. Suppose that the transmitted message is w =
(1, 1, 0, 0, 0, 1, 1)T and that it contains at most one error. Then we work out
0
Hw = 1 = H2 . (3.77)
0
First, w is not in the kernel of H so an error has indeed occurred. Secondly, the vector Hw corresponds
to the second column vector of H so we should flip the second bit to correct for the error. This means,
v = (1, 0, 0, 0, 0, 1, 1)T and the original information (which is contained in the first four entries of v) is
β = (1, 0, 0, 0)T .
By paying the price of enhancing the transmitted information from four bits (in β) to seven bits (in v)
both a detection and correction of one-bit errors can be carried out with this method. Compare this with
the naive method of simply transmitting the information in β twice which corresponds to an enhancement
from four to eight bits. In this case, one-bit errors can of course be detected. However, without further
information they cannot be corrected since it is impossible to decide which of the two transmissions is the
correct one.
Our next task is to devise an algorithm to compute the inverse of a matrix, using elementary row opera-
tions. The basic observation is that every quadratic, invertible n × n matrix A can be converted into the
66
unit matrix 1n by a sequence of row operations. Schematically, this works as follows:
a011 a011
∗ 0
1 0
echelon form a022 (R1), (R2) a022 (R3)
A −−−−−−−−−→ −−−−−−−→ −−−→ .. = 1n
.. .. .
. .
0 1
0 a0nn 0 a0nn
In the first step, we bring A into upper echelon form, by the algorithm already discussed. At this point
we can read off the rank of the matrix. If rk(A) < n the inverse does not exist and we can stop. On the
other hand, if rk(A) = n then all diagonal entries a0ii in the upper echelon form must be non-zero (or else
we would not have n steps). This means, in a second step, we can make all entries above the diagonal
zero. We start with the last column and subtract suitable multiples of the last row from the others until
all entries in the last column except a0nn are zero. We proceed in a similar way, column by column from
the right to the left, using row operations of type (R1) and (R2). In this way we arrive at a diagonal
matrix, with diagonal entries a0ii 6= 0 which, in the final step, can be converted into the unit matrix by
row operations of type (R3).
This means we can find a set of matrices P1 , . . . , Pk of the type (3.73), generating the appropriate elemen-
tary row operations, such that
1n = P ···P A
| 1 {z k}
⇒ A−1 = P1 · · · Pk 1n . (3.78)
A−1
These equations imply an explicit algorithm to compute the inverse of a square matrix. We convert A
into the unit matrix 1n using elementary row operations as described above, and then simply carry out
the same operations on 1n in parallel. When we are done the unit matrix will have been converted into
A−1 . Again, we illustrate this procedure by means of an example.
! !
1 0 −2 1 0 0
A= 0 3 −2 13 = 0 1 0
1 −4 0 0 0 1
! !
1 0 −2 1 0 0
R3 → R3 − R1 : 0 3 −2 0 1 0
0 −4 2 −1 0 1
! !
4 1 0 −2 1 0 0
R3 → R3 + R2 : 0 3 −2 ← rk(A) = 3 0 1 0
3 0 0 − 23 −1 4
1
3
! !
1 0 −2 1 0 0
R2 → R2 − 3R3 : 0 3 0 3 −3 −3
0 0 − 23 −1 4
3
1
! !
1 0 0 4 −4 −3
R1 → R1 − 3R3 : 0 3 0 3 −3 −3
0 0 − 23 −1 4
3
1
! !
R2 1 0 0 4 −4 −3
R2 → : 0 1 0 1 −1 −1
3 0 0 − 23 −1 4
1
3
! !
3 1 0 0 4 −4 −3
R3 → − R3 : 0 1 0 = 13 1 −1 −1 = A−1
2 0 0 1 3
−2 − 32
2
67
As a final check we show that
1 0 −2 4 −4 −3 1 0 0
AA−1 = 0 3 −2 1 −1 −1 = 0 1 0 = 13 X
3 3
1 −4 0 2 −2 − 2 0 0 1
Start with a linear map f : V → W between two vector spaces V and W over F with dimensions n
and m, respectively. We introduce a basis v1 , . . . , vn of V and a basis w1 P
, . . . , wm of W . Then, all
n
vectors in v ∈ V and w ∈ W can P be written as linear combinations v = i=1 αi vi with coordinate
m
vectors α = (α1 , . . . , αn ) and w = j=1 βj wj with coordinate vectors β = (β1 , . . . , βm )T , respectively.
T
Following Example (3.4) (b), we can introduce coordinate maps ϕ : F n → V and ψ : F m → W relative to
each basis which act as
Xn Xm
ϕ(α) = α i vi , ψ(β) = βj wj . (3.79)
i=1 j=1
The images f (vi ) of the V basis vectors can always be written as a linear combination of the basis vectors
for W so we have
m
X
f (vj ) = aij wi (3.80)
i=1
for some coefficients aij ∈ F . The situation so far can be summarized by the following diagram
f
V −→ W
ϕ↑ ↑ψ (3.81)
A=?
Fn −→ F m
Essentially, we are describing vectors by their coordinate vectors (relative to the chosen basis) and we
would like to find a matrix A which acts on these coordinate vectors “in the same way” as the original
linear map f on the associated vectors. In this way, we can describe the action of the linear map by a
matrix. How do we find this matrix A? Abstractly, it is given by
A = ψ −1 ◦ f ◦ ϕ , (3.82)
as can be seen by going from F n to F m in the diagram (3.81) using the “upper path”, that is, via V and
W . From Lemma 3.3 we know that we can work out the components of a matrix by letting it act on the
standard unit vectors.
m m m
!
(3.82) −1 (3.79) −1 (3.80) −1 X linearity X −1 (3.79) X
Aej = ψ ◦ f ◦ ϕ(ej ) = ψ ◦ f (vj ) = ψ aij wi = aij ψ (wi ) = aij ẽi
i=1 i=1 i=1
(3.83)
68
Comparing with Lemma 3.3 it follows that aij are the entries of the desired matrix A. Also, Eq. (3.82)
implies that Im(A) = ψ −1 (Im(f )). If we denote by χ := ψ −1 |Im(f ) the restriction of ψ −1 to Im(f ) we
have dim Ker(χ) = 0 since ψ −1 is an isomorphism and hence dim Ker(ψ −1 ) = 0. We conclude, using
Theorem 3.2, that
rk(A) = dim Im(A) = rk(χ) = dim Im(f ) − dim Ker(χ) = rk(f ) , (3.84)
which means that the linear map f and the matrix A representing f have the same rank, that is, rk(A) =
rk(f ).
While this discussion might have been somewhat abstract it has a simple and practically useful conclusion.
To find the matrix A which represents a linear map relative to a basis, compute the images f (vj ) of the
(domain) basis vectors and write them as a linear combinations of the (co-domain) basis vectors wi , as
in Eq. (3.80). The coefficients in these linear combinations form the matrix A. More precisely, by careful
inspection of the indices in Eq. (3.80), it follows that the coefficients which appear in the image of the j th
basis vector form the j th column of the matrix A. We summarize these conclusions in
Lemma 3.5. Let f : V → W be a linear map, v1 , . . . , vn a basis of V and w1 , . . . , wm a basis of W . The
entries aij of the m × n matrix A which describes this linear map relative to this choice of basis can be
read off from the images of the basis vectors as
m
X
f (vj ) = aij wi . (3.85)
i=1
For simplicity, we choose the same basis for the domain and the co-domain, namely v1 = w1 = (1, 2)T
and v2 = w2 = (−1, 1)T . Then, the images of the basis vectors under B, written as linear combinations
of the same basis, are
1 −1
Bv1 = = −1v1 − 2v2 , Bv2 = = −1v1 + 0v2 . (3.87)
−4 −2
Arranging the coefficients from Bv1 into the first column of a matrix and the coefficients from Bv2 into
the second column we find
0 −1 −1
B = . (3.88)
−2 0
This is the matrix representing the linear map B relative to the basis {v1 , v2 }. It might be useful to be
explicit about what exactly this means. Write an arbitrary 2-dimensional vector as
x0 − y 0
x 0 0
= x v1 + y v2 = (3.89)
y 2x0 + y 0
69
so that a vector (x, y)T is described, relative to the basis {v1 , v2 }, by the coordinate vector (x0 , y 0 )T .
Consider the example (x, y) = (1, 8) with associated coordinate vector (x0 , y 0 ) = (3, 2). Then
1 1
B =
8 −16
l l (3.90)
3 −5
B0 =
2 −6
The vectors connected by arrows relate exactly as in Eq. (3.89), that is the vectors in the lower equation
are the coordinate vectors of their counterparts in the upper equation. Eqs. (3.90) are basically a specific
instance of the general diagram (3.81). This illustrates exactly how the representing matrix acts “in the
same way” as the linear map: If the linear map relates two vectors then the representing matrix relates
their two associated coordinate vectors.
(b) It might be useful to discuss a linear map which, originally, is not defined by a matrix. To this end, we
consider the vector space V = {a2 x2 + a1 x + a0 |ai ∈ R} of all quadratic polynomials with real coefficients
d
and the linear map f = dx : V → V , that is, the linear map obtained by taking the first derivative.
As before we choose the same basis for domain and co-domain, namely the standard basis 1, x, x2 of
monomials. We would like to find the matrix A representing the first derivative, relative to this basis.
As before, we work out the images of the basis vectors and write them as linear combinations of the
same basis:
d
1 = 0 · 1 + 0 · x + 0 · x2
dx
d
x = 1 · 1 + 0 · x + 0 · x2
dx
d 2
x = 0 · 1 + 2 · x + 0 · x2
dx
Arranging the coefficients in each row into the columns of a matrix we arrive at
0 1 0
A= 0 0 2 . (3.91)
0 0 0
This matrix generates the first derivative of quadratic polynomials relative to the standard monomial basis.
As before, let us be very explicit about what this means. Consider the polynomial p(x) = 5x2 + 3x + 7
with coordinate vector (7, 3, 5)T and its first derivative p0 (x) = 10x + 3 with coordinate vector (3, 10, 0)T .
Then we have
7 3
A 3 = 10 , (3.92)
5 0
that is, A indeed maps the coordinate vector for p into the coordinate vector for p0 .
The correspondence between operators acting on functions and matrices acting on vectors illustrated by
this example is at the heart of quantum mechanics. Historically, Schrödinger’s formulation of quantum
mechanics is in terms of (wave) functions and operators, while Heisenberg’s formulation is in terms of
matrices. The relation between those two formulations is precisely as in the above example.
70
3.5 Change of basis
We have seen that a linear map can be described, relative to a basis in the domain and co-domain, by a
matrix. It is clear from the previous discussion that, for a fixed linear map, this matrix depends on the
specific choice of basis. In other words, if we choose another basis the matrix describing the same linear
map will change. We would now like to work out how precisely the representing matrix transforms under
a change of basis.
To simplify the situation, we consider a linear map f : V → V from a vector space to itself and choose
the same basis on domain and co-domain. (The general situation of a linear map between two different
vector spaces is a straightforward generalization.) The two sets of basis vectors, coordinate maps and
representing matrices are then denoted by
basis of V coordinate map coordinate vector representing matrix
n
α = (α1 , . . . , αn )T A = ϕ−1 ◦ f ◦ ϕ
P
v1 , ..., vn ϕ(α) = αi vi
i=1 (3.93)
n
v10 , ..., vn0 ϕ0 (α0 ) = αi0 vi0 α0 = (α10 , . . . , αn0 )T A0 = ϕ 0 −1 ◦ f ◦ ϕ0
P
i=1
We would like to find the relationship between A and A0 , that is, between the representing matrices for
f relative to the unprimed and the primed basis. From Eq. (3.82) we know that the two matrices can be
written as A = ϕ−1 ◦ f ◦ ϕ and A0 = ϕ0−1 ◦ f ◦ ϕ0 , so that
A0 = ϕ0−1 ◦ f ◦ ϕ0 = ϕ0−1 ◦ ϕ ◦ ϕ−1 ◦ f ◦ ϕ ◦ ϕ−1 ◦ ϕ0
= ϕ0−1 ◦ ϕ ◦ ϕ−1 ◦ f ◦ ϕ ◦ ϕ−1 ◦ ϕ0 = P AP −1 . (3.94)
| {z } | {z } | {z }
=: P =A = P −1
Note that all we have done is to insert two identity maps, ϕ ◦ ϕ−1 , in the second step and then combined
maps differently in the third step. What is the interpretation of P = ϕ0−1 ◦ ϕ? For a given vector v ∈ V
and its coordinate vectors α = ϕ−1 (v) and α0 = ϕ0 −1 (v) relative to the unprimed and primed basis we
have α0 = ϕ0 −1 (v) = ϕ0 −1 ◦ ϕ(α) = P α, so in summary
α0 = P α . (3.95)
Hence, P converts unprimed coordinate vectors α into the corresponding primed coordinate vector α0
and, as a linear map between column vector, it is a matrix. In short, P describes the change of basis
under consideration. The corresponding transformation of the representing matrix under this basis change
is then
A0 = P AP −1 . (3.96)
This is one of the key equations of linear algebra. For example, we can ask if we can choose a basis
for which the representing matrix is particularly simple. Eq. (3.96) is the starting point for answering
this question to which we will return later. Note that Eq. (3.96) makes intuitive sense. Acting with
the equation on a primed coordinate vector α0 , the first we obtain on the RHS is P −1 α0 . This is the
corresponding unprimed coordinate vector on which the matrix A can sensibly act, thereby converting it
into another unprimed coordinate vector. The final action of P converts this back into a primed coordinate
vector. Altogether, this is the action of the matrix A0 on α0 as required by the equation.
Another way to think about the matrix P is byPrelating the primed and the unprimed basis vectors.
In general, from Lemma 3.3, we can write P ej = i Pij ei . Multiplying this equation with ϕ0 from the
left and using vj = ϕ(ej ), vi0 = ϕ0 (ei ) we find
X X
vj = Pij vi0 ⇐⇒ vj0 = (P −1 )ij vi . (3.97)
i i
71
Hence, the entries of the matrix P can be calculated by expanding the unprimed basis vectors in terms of
the primed basis.
and arranging the coefficients on the right-hand sides into the column of a matrix gives
0 0 1
A = . (3.98)
1 0
Alternatively, we should be able to determine A0 from Eq. (3.96). To work out the relation between the
primed and un-primed coordinate vectors α0 = (α10 , α20 )T and α = (α1 , α2 )T we write
α10 + α20
α1 ! 0 0 0 0 1
α1 v1 + α2 v2 = = α1 v1 + α2 v2 = √ 0 0 .
α2 2 −α1 + α2
Comparing this with the general relation (3.95) between the coordinate vectors we can read off the
coordinate transformation P −1 as
−1 1 1 1 1 1 −1
P =√ ⇒ P =√ .
2 −1 1 2 1 1
72
4 Systems of linear equations
We will now apply our general results and methods to the problem of solving linear equations. This will
lead to an understanding of the structure of the solutions and to explicit solution methods.
f (x) = b (4.1)
where b ∈ W is a fixed vector. For b 6= 0 this is called an inhomogenous linear equation and
f (x) = 0 (4.2)
is the associated homogenous equation. Its general solution is Ker(f ). The solutions of the inhomogeneous
and associated homogeneous equations are related in an interesting way.
Lemma 4.1. If x0 ∈ V solves the inhomogenous equation, that is f (x0 ) = b, then the affine space
x0 + Ker(f )
In short, the Lemma says that the general solution of the inhomogeneous equation is obtained by the
sum of a special solution to the inhomogeneous equation and all solutions to the homogeneous equation.
Recall that Ker(f ) is a sub vector space, so a line, a plane, etc. through 0 with dimension dim(Ker(f )) =
dim(V ) − rf(f ) (see Eq. (3.4)). This shows that the geometry of the solution is schematically as indicated
in Fig. 20. Lemma (4.1) is helpful in order to find the general solution to inhomogenous, linear differential
equations as in the following
d2 y dy
p(x) + q(x) + r(x)y = s(x) ,
dx2 dx
where p, q, r and s are fixed functions. The relevant vector space is the space of (infinitely many times)
d2 d
differentiable functions, the linear map f corresponds to the linear differential operator p(x) dx 2 + q(x) dx +
r(x) and the inhomogeneity b given by s(x). From Lemma 4.1, the general solution to this equation can be
obtained by finding a special solution, y0 , and then adding to it all solutions of the associated homogeneous
equation
d2 y dy
p(x) 2 + q(x) + r(x)y = 0 .
dx dx
To be specific consider the differential equation
d2 y
+y =x.
dx2
73
x0+Ker(f)=solution to
inhomogeneous system
x0
Ker(f)=solution to
homogeneous system
An obvious special solution is the function y0 (x) = x. The general solution of the associated homogeneous
equation
d2 y
+y =0
dx2
is a sin(x) + b cos(x) for arbitrary real constants a, b. Hence, the general solution to the inhomogenous
equation is
y(x) = x + a sin(x) + b cos(x) .
Our main interest is of course in systems of linear equations, that is, the case where the linear map
is an m × n matrix A : F n → F m with entries aij . For x = (x1 , . . . , xn )T ∈ F n and a fixed vector
b = (b1 , . . . , bm )T ∈ F m the system of linear equations can be written as
The solution space of the homogenous system is Ker(A), a (sub) vector space whose dimensions is given
by the dimension formula dim Ker(A) = n − rk(A) (see Eq. (3.4)). If the inhomogenous system has a
solution, x0 , then its general solution is x0 + Ker(A) and such a “special” solution x0 exists if and only if
74
b ∈ Im(A). If rk(A) = m then Im(A) = F m and a solution exists for any choice of b. On the other hand,
if rk(A) < m, there is no solution for “generic” choices of b. For example, if m = 3 and rk(A) = 2 then
the image of A is a plane in a three-dimensional space and we need to choose b to lie in this plane for a
solution to exist. Clearly this corresponds to a very special choice of b and generic vectors b will not lie
in this plane. To summarize the general structure of the solution to Ax = b, where A is an m × n matrix,
we should, therefore distinguish two cases.
(1) rk(A) = m
In this case there exists a solution, x0 , for any choice of b and the general solution is given by
x0 + Ker(A) (4.5)
The number of free parameters in this solution equals dim Ker(A) = n − rk(A) = n − m.
For a quadratic n × n matrix A we can be slightly more specific and the above cases are as follows.
(1) rk(A) = n
A solution exists for any choice of b and there are no free parameters since dim Ker(A) = n − n = 0.
Hence, the solution is unique. Indeed, in this case, the matrix A is invertible (see Lemma 3.4) and
the unique solution is given by x = A−1 b.
The main message of this discussion is that, given the size of the matrix A and its rank, we are able to draw
a number of conclusions about the qualitative structure of the solution, without any explicit calculation.
We will see below how this can be applied to explicit examples.
We can also think about the solutions to a system of linear equations in a geometrical way. With the row
vectors Ai of the matrix A, the linear system (4.3) can be re-written as m equations for (hyper) planes
(that is n − 1-dimensional planes) in n dimensions:
Ai · x = bi , i = 1, . . . , m . (4.6)
Geometrically, we should then think of the solutions to the linear system as the common intersection
of these m (hyper) planes. For example, if we consider a 3 × 3 matrix we should consider the common
intersection of three planes in three dimensions. Clearly, depending on the case, these planes can intersect
in a point, a line, a plane or not intersect at all. In other words, we may have no solution or the solution
may have 0, 1 or 2 free parameters. This corresponds precisely to the cases discussed above.
75
variables. To be specific, we consider the following system with three variables x = (x, y, z)T and three
equations
E1 : 2x + 3y − z = −1
E2 : −x − 2y + z = 3 (4.7)
E3 : ax + y − 2z = b .
To make matters more interesting, we have introduced two parameters a, b ∈ R. We would like to find the
solution to this system for arbitrary real values of these parameters. We can also write the above system
in matrix form, Ax = b, with
2 3 −1 −1
A = −1 −2 1 b= 3 . (4.8)
a 1 −2 b
Before we embark on the explicit calculation, let us apply the results of our previous general discussion
and predict the qualitative structure of the solution. The crucial piece of information required for this
discussion is the rank of the matrix A. Of course, this can be determined from the general methods based
on row reduction which we have introduced in Section 3.3. But, as explained before, for small matrices
the rank can often be inferred “by inspection”. For the matrix A in (4.8) it is clear that the second and
third column vectors, A2 and A3 , are linearly independent. Hence, its ranks is at least two. The first
column vector, A1 , depends on the parameter a so we have to be more careful. For generic a values A1
does not lie in the plane spanned by A2 , A3 , so the generic rank of A is three. In this case, from our
general results, there is a unique solution to the linear system for any value of the other parameter b. For
a specific a value A1 will be in the plane spanned by A2 , A3 and the rank is reduced to two. Then, the
image of A is two-dimensional, that is a plane. For generic values of b the vector b will not lie in this
plane so there is no solution. However, for a specific b value, when b does lie in this plane, there is a
solution with dim Ker(A) = 3 − rk(A) = 1 parameter, that is, a solution line. So, in summary we expect
the following qualitative structure for the solution to the system (4.7).
1) For generic values of a the rank of A is three and there is a unique solution for all values of b.
2a) For a specific value of a (when rk(A) = 2) and for a specific value of b there is a line of solutions.
2b) For the above specific value of a and generic b there is no solution.
Let us now confirm this expectation by an explicit calculation. We begin by adding appropriate multiples
of Eqs. (4.7), namely
E1 + E2 : x+y =2 (4.9)
E3 + 2E2 : (a − 2)x − 3y = b + 6 . (4.10)
(a + 1)x = b + 12 . (4.11)
76
2a) a = −1 and b = −12: In this case, Eq. (4.11) becomes trivial and we are left with only two inde-
pendent equations. Solving Eq. (4.9) and the first Eq. (4.7) for x and z in terms of y we find
So let us start with an arbitrary linear system with m equations for n variables, so a system of the form
Ax = b with an m×n matrix A, inhomogeneity b ∈ F m and variables x ∈ F n . We can multiply the linear
system with one of the m × m matrices P from Eq. (3.73), generating the elementary row operations, to
get the linear system P Ax = P b. This new system has the same solutions as the original one since P is
invertible. This means we do not change the solutions to the linear system if we carry out elementary row
operations simultaneously on the matrix A and the inhomogeneity b. This suggests we should encode the
linear system by the augmented matrix defined by
A0 = (A|b) , (4.14)
an m × (n + 1) matrix which consists of A plus one additional column formed by the vector b. We can now
reformulate our previous observation by stating that elementary row operations applied to the augmented
matrix do not change the solutions of the associated linear system. So our solution strategy will be to
simplify the augmented matrix by successive elementary row operations until the solution can be easily
“read off”. Before we formulate this explicitly, we note a useful criterion which helps us to decide whether
or not b ∈ Im(A), that is, whether or not the linear system has solutions.
Lemma 4.2. b ∈ Im(A) ⇐⇒ rk(A) = rk(A0 )
Proof. “ ⇒ ”: If b ∈ Im(A) it is a linear combination of the column vectors of A and adding it to the
matrix does not increase the rank.
“ ⇐ ”: If rk(A) = rk(A0 ) the rank does not increase when b is added to the matrix. Therefore, b ∈
Span(A1 , . . . , An ) = Im(A).
Let us now describe the general algorithm.
1. Apply row operations, as described in Section (3.3), to the augmented matrix A0 until the matrix A
within A0 is in upper echelon form. Then, the resulting matrix has the form
0
··· a1j1 ∗ b1
. ..
a2j2 .. .
.. ..
. .
0
A → ..
. arjr ··· b0r
0
0 br+1
..
.
b0n
77
where aiji 6= 0 for i = 1, . . . r so that A has rank r. In this form is it easy to apply the criterion,
Lemma 4.2. If b0i 6= 0 for any i > r then rk(A0 ) > rk(A) and the linear system has no solutions. In
this case we can stop. On the other hand, if b0i = 0 for all i > r which we assume from hereon, then
rk(A0 ) = rk(A) and the system has a solution.
2. As explained we assume that b0i = 0 for all i > r. For ease of notation we also permute the columns
of A (this corresponds to a permutation of the variables that we will have to keep track of) so that
the columns with the non-zero entries aiji become the first r of the matrix. The result is
0
a1j1 b1
..
∗ ∗
a2j2 .
.. ..
. .
A0 → 0
arjr b0r
0
..
0 .
0
3. By further row operations we can convert the r × r matrix in the upper left corner of the previous
matrix into a unit matrix 1r . Schematically, the result is
1r B c
0
Afin = (4.15)
0 0 0
4. Recall that r = rk(A) is the rank and n − r = dim Ker(A) is the number of free parameters of the
solution. For this reason it makes sense to split our variables as
ξ
x= (4.16)
t
into an r-dimensional vector ξ and an (n − r)-dimensional vector t. Note that this split is adapted
to the form of the matrix A0fin so that the associated linear system takes the simple form
ξ + Bt = c . (4.17)
The point is that this system can be easily solved for ξ in terms of t. This leads to the general
solution
c − Bt
x= , (4.18)
t
which depends on n − r free parameters t, as expected.
Let us see how this works for an explicit example.
Example 4.2: Solving linear systems with row reduction of the augmented matrix
Consider the following system of linear equations and its augmented matrix
x + y − 2z = 1 1 1 −2 1
2x − y + 3z = 0 A0 = 2 −1 3 0 , (4.19)
−x − 4y + 9z = b −1 −4 9 b
where b ∈ R is an arbitrary real parameter. We proceed in the four steps outlined above.
78
1. First we bring A within A0 into upper echelon form which results in
1 1 −2 1
A0 → 0 −3 7 −2
0 0 0 b+3
For b 6= −3 we have rk(A0 ) = 3 > 2 = rk(A) so there are no solutions. So we assume from hereon
that b = −3.
2. Setting b = −3 we have
1 1 −2 1
0 −3 7 −2 .
0 0 0 0
In this case, we do not have to permute columns since the (two) steps of the upper echelon form
already arise in the first two columns.
3. By further elementary row operations we convert the 2 × 2 matrix in the upper left corner into a
unit matrix.
1 1
1 0 3 3
A0fin = 0 1 − 73 23
0 0 0 0
4. We have r = rk(A) = 2 and dim Ker(A) = 3 − rk(A) = 1 so we expect a solution with one free
variable t (a line). Accordingly, we split the variables as
x
x= y , (4.20)
t
where ξ = (x, y)T in our general notation. Writing the linear system for A0fin in those variables
results in
1 1
x+ t = (4.21)
3 3
7 2
y− t = (4.22)
3 3
This can be easily solved for x, y in terms of t which was really the point of the exercise. The result
is x = 31 − 13 t and y = 23 + 37 t and, inserting into Eq. (4.20), this results in the vector form
1
− 13
3
2 7
x= 3
+ t
3
0 1
79
from linear algebra. To do this, first assume that the circuit contains n loops and assign (“mesh”) currents
Ii , where i = 1, . . . , n, to each loop. Then, applying Ohm’s law and Kirchhoff’s voltage low (“The voltages
along a closed loop must sum to zero.”) to each loop leads to the linear system
R11 I1 + · · · + R1n In = V1
.. .. .. (4.23)
. . .
Rn1 Ii + · · · + Rnn In = Vn ,
where Rij describe the various resistors and Vi correspond to the voltages of the batteries. If we introduce
the n × n matrix R with entries Rij , the current vector I = (I1 , . . . , In )T and the vector V = (V1 , . . . , Vn )T
for the battery voltages this system can, of course, also be written as
RI = V . (4.24)
This is an n × n linear system, where we think of the resistors and battery voltages as given, while the
currents I1 , . . . , In are a priori unknown and can be determined by solving the system. Of course any of
the methods previously discussed can be used to solve this linear system and determine the currents Ii .
For example, consider the circuit in Fig. 21. To its three loops we assign the currents I1 , I2 , I3 as indicated
R1
R2 I2 R4
V I1
R6
R3
I3 R5
in the figure. Kirchhoff’s voltage law applied to the three loops then leads to
With the current and voltage vectors I = (I1 , I2 , I3 )T and V = (V, 0, 0)T the matrix R in Eq. (4.24) is
then given by
R1 + R2 + R3 −R2 −R3
R= −R2 R2 + R4 + R6 −R6 . (4.26)
−R3 −R6 R3 + R5 + R6
80
For example, for resistances (R1 , . . . , R6 ) = (3, 10, 4, 2, 5, 1) (in units of Ohm) we have the resistance
matrix
17 −10 −4
R = −10 13 −1 . (4.27)
−4 −1 10
For a battery voltage V = 12 (in units of volt) we can write down the augmented matrix
17 −10 −4 12
R0 = −10 13 −1 0 , (4.28)
−4 −1 10 0
and solve the linear system by row reduction. This leads to the solution
1548
1
I= 1248 (4.29)
905
744
81
5 Determinants
Determinants are multi-linear objects and are a useful tool in linear algebra. In Section 2 we have
introduced the three-dimensional determinant as the triple product of three vectors. Here we will study
the generalization to arbitrary dimensions and verify that the three-dimensional case coincides with our
previous definition. As with the other general concepts, we first define the determinant by its properties
before we derive its explicit form and study a few applications. In our discussion in Section 2 we have
observed that the three-dimensional determinant is linear in each of its vector arguments (see Eq. (2.44)),
it changes sign when two vector arguments are swapped (see Eq. (2.46)) and the determinant of the three
standard units vector is one (see Eq. (2.48)). We will now use these properties to define the determinant
in arbitrary dimensions.
An easy but important conclusion from these properties is that a determinant with two same arguments
must vanish. Indeed, from the anti-symmetry property (D2) it follows that det(· · · , a, · · · , a, · · · ) =
− det(· · · , a, · · · , a, · · · ), which means that
det(· · · , a, · · · , a, · · · ) = 0 . (5.1)
We know that an object with these properties exists for n = 3 but not yet in other dimensions. To
address this problem we first need to understand a few basic facts about permutations. Here, we will
just present a brief account of the relevant facts. For the formal-minded, Appendix B contains a more
complete treatment which includes the relevant proofs.
Permutations
You probably have an intuitive understanding of a permutation as an operation which changes the order
of a certain set of n objects. Here, we take this set to be the numbers {1, . . . , n}. Mathematically, a
permutation is defined as a bijective map from this set to itself. So the set of all permutations of n objects
is given by
Sn := {σ : {1, · · · , n} → {1, · · · , n} | σ is bijective} , (5.2)
and this set has n! elements. The basic idea is that, under a permutation σ ∈ Sn , a number i ∈ {1, . . . , n}
is permuted to its image σ(i). A useful notation for a permutation mapping 1 → σ(1), . . . , n → σ(n) is 3
1 ··· n
σ= . (5.3)
σ(1) . . . σ(n)
3
Note, despite the similar notation, this is not a matrix in the sense introduced earlier.
82
For example, for n = 3, a permutation which swaps 2 and 3 is written as
1 2 3
τ1 = . (5.4)
1 3 2
Carrying out two permutations, one after the other, simply corresponds to composition of maps in this
formalism. For example, consider a second permutation
1 2 3
τ2 = (5.5)
2 1 3
of three objects which swaps the numbers 1 and 2. Permuting first with τ2 and then with τ1 corresponds
to the permutation σ := τ1 ◦ τ2 which is given by
1 2 3 1 2 3 1 2 3
σ = τ1 ◦ τ2 = ◦ = , (5.6)
1 3 2 2 1 3 3 1 2
a cyclic permutation of the three numbers. A further advantage of describing permutations as bijective
maps is that the inverse of a permutation σ, that is, the permutation which “undoes” the effect of the
original permutation, is simple described by the inverse map σ −1 .
The specific permutations which only swap two numbers and leave all other numbers unchanged are
called transpositions. For example, the permutations (5.4) and (5.5) are transpositions. A basic and
important fact about permutations, proved in Appendix B, is that every permutation can be written as a
composition of transpositions, so any σ ∈ Sn can be written as σ = τ1 ◦ · · · ◦ τk , where τ1 , . . . , τk ∈ Sn are
transpositions. Eq. (5.6) is an illustration of this general fact.
Writing permutations as a composition of transpositions is not unique, that is, two different such
compositions can lead to the same permutation. Not even the number of transpositions required to
generate a given permutation is fixed. For example, the permutation σ in Eq. (5.6) can also be written
as σ = τ1 ◦ τ2 ◦ τ1 ◦ τ1 , that is, as a composition of four transpositions. However, it can be shown
(see Appendix B) that the number of transpositions required is always either even or odd for a given
permutation. For a permutation σ = τ1 ◦ · · · ◦ τk , written as a composition of k transpositions, it,
therefore, makes sense to define the sign of the permutation as
(
+1 : “even” permutation
sgn(σ) := (−1)k = . (5.7)
−1 : “odd” permutation
From this definition, transpositions τ are odd permutations, so sgn(τ ) = −1. For the permutation σ in
Eq. (5.6) we have sgn(σ) = 1 since it can be built from two transpositions. It is, therefore, even as we
would expect from a cyclic permutation of three objects. In essence, the definition (5.7) provides the
correct mathematical way to distinguish even and odd permutations.
When two permutations, each written in terms of transpositions, are composed with each other the
number of transpositions simply adds up. From the definition (5.7) this means that
In other words, a permutation and its inverse have the same sign.
83
We are now ready to return to determinants and derive an explicit formula. We start with an n × n matrix
A with entries aij whose column vectors we write as linear combinations of the standard unit vectors:
a1i
Ai = ... =
X
aji ej . (5.10)
ani j
By using the properties of the determinant from Def. (5.1) we can then attempt to work out the determi-
nant of A. We find
n n
(5.10) X X (D1) X
det(A) = det(A1 , · · · , An ) = det aj1 1 ej1 , · · · , ajn n ejn = aj1 1 · · · ajn n det(ej1 , · · · , ejn )
j1 =1 jn =1 j1 ,··· ,jn
(5.1),ja =σ(a) X (D2) X
= aσ(1)1 · · · aσ(n)n det(eσ(1) , · · · , eσ(n) ) = sgn(σ)aσ(1)1 · · · aσ(n)n det(e1 , · · · , en )
σ∈Sn σ∈Sn
(D3) X
= sgn(σ)aσ(1)1 · · · aσ(n)n
σ∈Sn
Hence, having just used the general properties of determinants and some facts about permutations, we
have arrived at a unique expression for the determinant. Conversely, it is straightforward to show that
this expression satisfies all the requirements of Def. 5.1. In summary, we conclude that the determinant,
as defined in Def. 5.1, is unique and explicitly given by
X
det(A) = det(A1 , · · · , An ) = sgn(σ)aσ(1)1 · · · aσ(n)n , (5.11)
σ∈Sn
where aij are the entries of the n × n matrix A. Note that the sum on the RHS runs over all permutations
in Sn and, therefore, has n! terms. A useful way to think about this sum is as follows. From each column of
the matrix A, choose one entry such that no two entries lie in the same row. A term in Eq. (5.11) consists
of the product of these n entries (times the sign of the permutation involved) and the sum amounts to all
possible ways of making this choice.
Another useful way to write the determinant which is often employed in physics involves the n-
dimensional generalization of the Levi-Civita tensor, defined by
+1 if i1 , . . . , in is an even permutation of 1, . . . , n
i1 ···in = −1 if i1 , . . . , in is an odd permutation of 1, . . . , n . (5.12)
0 otherwise
Essentially, the Levi-Civita tensor plays the same role as the sign of the permutation (plus it vanishes if it
has an index appearing twice when i1 , . . . , in is not actually a permutation of 1, . . . , n) so that Eq. (5.11)
can alternatively be written as
det(A) = i1 ···in ai1 1 · · · ain n , (5.13)
with a sum over the n indices i1 , . . . , in implied.
Low dimensions and some special cases
To get a better feel for the determinant it is useful to look at low dimensions first. For n = 2 we have
a1 b1
det = ij ai bj = 12 a1 b2 + 21 a2 b1 = a1 b2 − a2 b1 . (5.14)
a2 b2
84
The two terms on the right-hand side correspond to the two permutations of {1, 2}. In three dimensions
we find
a1 b1 c1
det a2 b2 c2 = ijk ai bj ck = a1 b2 c3 + a2 b3 c1 + a3 b1 c2 − a2 b1 c3 − a3 b2 c1 − a1 b3 c2 (5.15)
a3 b3 c3
= ha, b, ci = a · (b × c) (5.16)
The last line follows by comparison with Eq. (2.43). Hence, the three-dimensional determinant as from our
general definition is indeed the triple product and coincides with our earlier definition of the determinant.
The six terms in the right-hand side of Eq. (5.15) correspond to the six permutations of {1, 2, 3} and we
recall from Eq. (2.50) that they can be explicitly computed by multiplying the terms along the diagonals
of the matrix.
we need to write down the six terms in Eq. (5.15) which can be obtained by multiplying the terms along
the diagonals of A, following the rule (2.50). Explicitly
1 −2 0
det(A) =det 3 2 −1
4 2 5
= 1 · 2 · 5 + (−2) · (−1) · 4 + 0 · 3 · 2 − 0 · 2 · 4 − (−2) · 3 · 5 − 1 · (−1) · 2
= 10 + 8 + 30 + 2 = 50 . (5.20)
Recall that each of the six terms are obtained by multiplying three entries along a diagonal as indicated
by the lines (where corresponding lines at the left and right edge of the matrix should be identified to
collect all factors). The three diagonals correspond to the three cyclic permutations which appear with
a positive sign while the three diagonals lead to the anti-cyclic terms which come with a negative sign.
The determinant of a 4 × 4 matrix has 4! = 24 terms and it is n! terms for an n × n matrix, so this becomes
complicated quickly. An interesting class of matrices for which the determinant is simple consists of upper
85
triangular matrices, that is, matrices with all entries below the diagonal vanishing. In this case
a1 ∗
det
.. = a1 · · · an ,
(5.21)
.
0 an
so the determinant is simply the product of the diagonal elements 4 . This can be seen from Eq. (5.11). We
should consider all ways of choosing one entry per column such that no two entries appear in the same row.
For an upper triangular matrix, the only non-zero choice in the first column is the first entry, so that the
first row is “occupied”. In the second column the only available non-trivial choice is, therefore, the entry
in the second row etc. In conclusion, from the n! terms in Eq. (5.11) only the term which corresponds to
the product of the diagonal elements is non-zero. An easy conclusion from Eq. (5.21) is that
det(1n ) = 1 , (5.22)
Another obvious question is about the relation between the determinant and matrix multiplication.
Fortunately, there is a simply and beautiful answer.
Theorem 5.1. det(AB) = det(A) det(B), for any two n × n matrices A, B.
Proof. Recall from Eq. (3.52) the index form of matrix multiplication
X
(AB)ij = Aik Bkj .
k
86
where Ak are the columns of A. Hence,
(5.23) X X
det(AB) = det((AB)1 , · · · , (AB)n ) = det Bk1 1 Ak1 , · · · , Bkn n Akn
k1 kn
(D1) X ka =σ(a) X
= Bk1 1 · · · Bkn n det(Ak1 , · · · , Akn ) = Bσ(1)1 · · · Bσ(n)n det(Aσ(1) , · · · , Aσ(n) )
k1 ,··· ,kn σ∈Sn
(D2) X
= sgn(σ)Bσ(1)1 · · · Bσ(n)n det(A1 , · · · , An ) = det(A) det(B)
| {z }
σ∈Sn
| {z } det(A)
det(B)
This simple multiplication rule for determinants of matrix products has a number of profound conse-
quences. First, we can prove a criterion for invertibility of a matrix, based on the determinant, essentially
a more complete version of Claim 2.1.
Proof. “⇒”: If A is bijective it has an inverse A−1 and 1 = det(1n ) = det(AA−1 ) = det(A) det(A−1 ).
This implies that det(A) 6= 0 and that det(A−1 ) = (det(A))−1 which is the second part of our assertion.
“⇐”: We prove this indirectly, so we start by assuming that A is not bijective. From Lemma 3.4 this
means that rk(A) < n, so the rank of A is less than maximal. Hence, at least one of the column vectors
of A, say A1 for definiteness, can be expressed as a linear combination of the others, so that
n
X
A1 = αi Ai
i=2
Note that, for invertible matrices A, this provides us with a useful way to calculate the determinant
of the inverse matrix by
det(A−1 ) = (det(A))−1 . (5.25)
Combining this rule and Theorem 5.1 implies that det(P AP −1 ) = det(P ) det(A)(det(P ))−1 = det(A), so,
in short
det(P AP −1 ) = det(A) . (5.26)
This equation says that the determinant remains unchanged under basis transformations (3.96) and, as a
result, the determinant is the same for every matrix representing a given linear map. The determinant is,
therefore, a genuine property of the linear map and we can talk about the determinant of a linear map,
defined as the determinant of any of its representing matrices.
87
Example 5.2: Using the determinant to check if a matrix is invertibe
The above Corollary is useful to check if (small) matrices are invertible. Consider, for example, the family
of 3 × 3 matrices
1 −1 a
A= 0 a −3 (5.27)
−2 0 1
where a ∈ R is a real parameter. We can ask for which values of a the matrix A is invertible. Computing
the determinant is straightforward and leads to
This vanishes precisely when a = −2 or a = 3/2 and, hence, for these values of a the matrix A is not
invertible. For all other values it is invertible.
Our next goal is to find a recursive method to calculate the determinant, essentially by writing the
determinant of a matrix in terms of determinants of sub-matrices. To this end, for an n × n matrix A, we
define the associated n × n matrices
th
0 ←j col
..
“A” . “A”
0
th
Ã(i,j) = 0 ··· 0 1 0 ··· 0
← i row (5.29)
0
..
“A” . “A”
0
They are obtained from A by setting the (i, j) entry to 1, the other entries in row i and column j to zero
and keeping the rest of the matrix unchanged. Note that the subscripts (i, j) indicate the row and column
which have been changed rather than specific entries of the matrix (hence the bracket notation). With
the so-defined matrices we define the co-factor matrix, an n × n matrix C with entries
To find a more elegant expression for the co-factor matrix, we also introduce the (n − 1) × (n − 1) matrices
A(i,j) which are obtained from A by simply removing the ith row and the j th column. It takes i − 1 swaps
of neighbouring rows in (5.29) to move row i to the first row (without changing the order of any other
rows) and a further j − 1 swaps to move column j to the first column. After these swaps the matrix Ã(i,j)
becomes
1 0 ··· 0
0
B(i,j) = . , (5.31)
.. A(i,j)
0
From Def. (5.1) (D2) and Lemma 5.1 it is clear that det(Ã(i,j) ) = (−1)i+j det(B(i,j) ), since we need a total
of i + j − 2 swaps of rows and columns to convert one matrix into the other. Further, the explicit form of
the determinant (5.11) implies that det(B(i,j) ) = det(A(i,j) ) (as the only non-trivial choice of entry in the
first column of B(i,j) is the 1 in the first row). Combining these observations means the co-factor matrix
is given by
Cij = det(Ã(i,j) ) = (−1)i+j det(A(i,j) ) . (5.32)
88
Hence, the co-factor matrix contains, up to signs, the determinants of the (n − 1) × (n − 1) sub-matrices
of A, obtained by deleting one row and one column from A. As we will see, for explicit calculations, it is
useful to note that the signs in Eq. (5.32) follow a “chess board pattern”, that is, the matrix with entries
(−1)i+j has the form
+ − + ···
− + − ···
(5.33)
+ − + ···
.. .. .. ..
. . . .
Our goal is to relate the determinant of A to the determinants of sub-matrices, that is to the entries of
the co-factor matrix. This is accomplished by
Lemma 5.2. For an n × n matrix A with associated co-factor matrix C, defined by Eq. (5.32), we have
C T A = det(A)1n (5.34)
Proof. This follows from the definition of the co-factor matrix, more or less by direct calculation.
(3.52) X X (5.32) X
(C T A)ij = (C T )ik Akj = Akj Cki = Akj det(Ã(k,i) )
k k k
(5.29) X
1 i−1 i+1
= Akj det(A , · · · , A , ek , A , · · · , An )
k
!
(D1) X
= det A1 , · · · , Ai−1 , Akj ek , Ai+1 , · · · , An
k
(5.1)
= det(A1 , · · · , Ai−1 , Aj , Ai+1 , · · · , An ) = δij det(A) = (det(A)1n )ij
89
by expanding along its 1st column. From Eq. (5.35), taking into account the signs as indicated in (5.33),
we find
det(A) = A11 det(A(1,1) ) − A21 det(A(2,1) ) + A31 det(A(3,1) )
2 −2 −1 0 −1 0
= 2 · det − 1 · det + 0 · det = 2 · 14 − 1 · (−4) + 0 · 2 = 32
3 4 3 4 2 −2
Note that the efficiency of the calculation can be improved by choosing the row or column with the most
zeros.
A by-product of Lemma 5.2 is a new method to compute the inverse of a matrix. If A is invertible then,
from Cor. 5.1, det(A) 6= 0 and we can divide by det(A) to get
1
C T A = 1n .
det(A)
Hence, the inverse of A is given by
1
A−1 = CT . (5.37)
det(A)
Again, it is worth applying this to an example.
90
We note that, for larger matrices, the row reduction method discussed in Section 3.3 is a more efficient
way of computing the inverse than the co-factor method. Indeed, for an n × n matrix the number of
operations required for a row reduction grows roughly as n3 while computing a determinant requires ∼ n!
operations.
Despite our improved methods, the calculation of determinants of large matrices remains a problem,
essentially because the aforementioned n! growth of the number of terms in Eq. (5.11). Using a Laplace
expansion will improve matters only if the matrix in question has many zeros. However, by using elemen-
tary row operations, we can get to an efficient way of computing large determinants. The key observation
is that, from the general properties of the determinant in Def. 5.1, row operations of type (R1) (see
Def. 3.7) only change the sign of the determinant and row operations of type (R2) leave the determinant
unchanged. A given matrix A can be brought into upper echelon form, A0 , by a succession of these row
operations and, hence, det(A) = (−1)p det(A0 ), where p is the number of row swaps used in the process.
The matrix A0 is in fact in upper triangular form
a1 ∗
A0 =
.. .
.
0 an
and, as discussed earlier, the determinant of such a matrix is simply the product of its diagonal entries.
It follows that det(A) = (−1)p a1 · · · an .
which are obtained from A by replacing the ith column with b and keeping all other columns unchanged.
We also note that, in terms of the column vectors Aj of A the linear system Ax = b can be written as
(see, for example, Eq. (3.36)) X
xj A j = b , (5.42)
j
= xi det(A) .
91
for the solution x = (x1 , . . . , xn )T of the linear system Ax = b, where A is an invertible n × n matrix.
To solve linear systems explicitly, Cramer’s rule is only useful for relatively small systems, due to the n!
growth of the determinant. For larger linear systems the row reduction method introduced in Section (4.3)
should be used.
From Eq. (5.41), that is by replacing one column of A with the vector b, we find the three matrices
1 −1 0 2 1 0 2 −1 1
B(1) = 2 2 −2 , B(2) = 1 2 −2 , B(3) = 1 2 2 . (5.45)
0 3 4 0 0 4 0 3 0
By straightforward computation, for example using a Laplace expansion, it follows that det(A) = 32,
det(B(1) ) = 22, det(B(2) ) = 12 and det(B(3) ) = −9. From Eq. (5.43) this leads to the solution
22
1
x= 12 .
32
−9
92
6 Scalar products
In Section 2 we have introduced the standard scalar product on Rn (the dot product) and we have seen
its usefulness, particularly for geometrical applications. Here, we study its generalizations to arbitrary
real and complex vector spaces.
Let us discuss this definition, beginning with the case of a real scalar product. The condition (S2) says
that a scalar product is linear in the second argument, in precisely the same sense that a linear map
is linear (see Def. 3.5). For the real case, the scalar product is symmetric in the two arguments from
condition (S1) and, together with (S2), this implies linearity in the first argument, so
So, in the real case, the scalar product is bi-linear. In this sense, we should think of the above definition
as natural, extending our notion of linearity to maps with two vectorial arguments.
The situation is somewhat more complicated in the hermitian case. Here, the complex conjugation in
(S1) together with (S2) leads to
Hence, sums in the first argument of a hermitian scalar product can still be pulled apart, but scalars are
pulled out with a complex conjugation. This property, together with the linearity in the second argument 5
is also called sesqui-linearity.
The property (S3) ensures that we can sensibly define the norm (or length) of a vector as
p
|v| := hv, vi . (6.3)
Note that in the hermitian case, (S1) implies that hv, vi = hv, vi∗ so that hv, vi is real. For this reason,
the condition (S3) actually makes sense in the hermitian case (if hv, vi was complex there would be no
well-defined sense in which we could demand it to be positive) and this explains the need for including
the complex conjugation in (S1).
The Cauchy-Schwarz inequality can be shown as in Lemma 2.1 (taking care to include complex con-
jugation in the hermitian case), so we have in general
93
The proof of the triangle inequality in Lemma 2.2 also goes through in general, so for the norm (6.3) of a
general scalar product we have
|v + w| ≤ |v| + |w| . (6.5)
For a real scalar product, in analogy with Eq. (2.10), the Cauchy-Schwarz inequality allows the definition
of the angle ^(v, w) ∈ [0, π] between two non-zero vectors v, w by
hv, wi
cos(^(v, w)) := . (6.6)
|v||w|
For any scalar product, two vectors v and w are called orthogonal iff hv, wi = 0. Hence, for a real scalar
product, the non-zero vectors v and w are orthogonal precisely when they form an angle ^(v, w) = π/2.
We should now discuss some examples of scalar products.
We already know from Eq. (2.3) that it satisfies all the requirements in Def. 6.1 for a real scalar product.
(b) Standard scalar product in Cn
For two vectors v = (v1 , . . . , vn )T and w = (w1 , . . . , wn )T in Cn the standard scalar product in Cn is
defined as
X n
†
hv, wi := v w = vi∗ wi . (6.8)
i=1
It is easy to check that it satisfies the requirements in Def. 6.1 for a hermitian scalar product. In particular,
the associated norm is given by
X n
2
|v| = hv, vi = |vi |2 , (6.9)
i=1
where |vi | denotes the modulus of the complex number vi . This is indeed real and positive, as it must,
but note that the inclusion of the complex conjugate in Eq. (6.8) is crucial.
(c) Minkowski product in R4
For two four-vectors v = (v0 , v1 , v2 , v3 )T and w = (w0 , w1 , w2 , w3 )T in R4 , the Minkowski product is
defined as
It is easy to show that it satisfies conditions (S1) and (S2) but not condition (S3). For example, for
v = (1, 0, 0, 0)T we have
hv, vi = −1 , (6.11)
which contradicts (S3). Therefore, the Minkowski product is not a scalar product but merely a bi-linear
form. Nevertheless, it plays an important role in physics, specifically in the context of special (and general)
relativity.
(d) Scalar product for function vector spaces
94
Def. 6.1 applies to arbitrary vector spaces so we should discuss at least one example of a more abstract
vector space. Consider the vector space of continuous (real- or complex-valued) functions f : [a, b] → R
or C on an interval [a, b] ⊂ R. A scalar product for such functions can be defined by the integral
Z b
hf, gi := dxf (x)∗ g(x) . (6.12)
a
It is easily checked that the conditions (S1)–(S3) are satisfied. Scalar products of this kind are of great
importance in physics, particularly in quantum mechanics.
(e) Scalar product for real matrices
The real n × n matrices form a vector space V with vector addition and scalar multiplication defined
component-wise as in Example 1.4 (e). The dimension of this space is n2 with a basis given by the
matrices E(ij) , defined in Eq. (1.54). On this space, we can introduce a scalar product by
X
hA, Bi := tr(AT B) = Aij Bij . (6.13)
i,j
where thePsymbol tr denotes the trace of a matrix A, defined as the sum over its diagonal entries, so
tr(A) := i Aii . The sum on the RHS of Eq. (6.13) shows that this definition is in complete analogy
with the dot product for real vectors, but with the summation running over two indiced instead of just
one. It is, therefore, clear that all requirements for a scalar product are satisfied. For complex matrices a
hermitian scalar product can be defined analogously simply by replacing the transpose in Eq. (6.13) with
a hermitian conjugate.
We conclude our introduction to scalar products with a simple but important observation about orthogonal
vectors.
Lemma 6.1. Pairwise orthogonal and non-zero vectors v1 , . . . , vk are linearly independent.
and take the scalar product of this equation with one of the vectors, vj . Since hvi , vj i = 0 for i 6= j it
follows that αj |vj |2 = 0. Since vj 6= 0 its norm is positive, |vj | > 0, so αj = 0.
Definition 6.2. A basis 1 , . . . , n of a vector space V with a scalar product is called ortho-normal iff
that is, if the basis vectors are pairwise orthogonal and have length one.
95
standard scalar product on Rn (Cn ), as defined in Example 6.1.
(b) The vectors
1 1 1 1
1 = √ , 2 = √ (6.15)
2 1 2 −1
form an orthonormal basis on R2 with respect to the standard scalar product, that is, Ti j = δij .
(c) The vectors
1 2 1 1
1 = √ , 2 = √ (6.16)
5 i 5 −2i
form an orthonormal basis on C2 with respect to the standard scalar product, that is, †i j = δij . Note,
it is crucial to use the proper standard scalar product 6.1 (b) for the complex case which involves the
hermitian conjugate rather than the transpose.
(d) For the vector space of real n × n matrices, the matrices E(ij) , defined in Eq. (1.54), form an ortho-
normal basis with respect to the scalar product (6.13).
An ortho-normal basis has many advantages compared to an arbitrary basis of a vector space. For example,
consider the coordinates of a vector v ∈PV relative to an ortho-normal basis {1 , . . . , n }. Of course, we
can write v as a linear combination v = ni=1 αi i with some coordinates αi but, in the general case, these
coefficients need to be determined by solving a system of linear equations. For an ortho-normal basis, we
can just take the scalar product of this equation with j , leading to
n
X n
X
hj , vi = hj αi i i = αi hj , i i = αj
| {z }
i=1 i=1
=δij
So in summary, the coordinates of a vector v relative to an ortho-normal basis {1 , . . . , n } can be computed
as
Xn
v= αi i ⇐⇒ αi = hi , vi . (6.17)
i=1
(b) For C2 we use the ortho-normal basis {1 , 2 } from (6.16) and the same vector v = (2, −3)T which
we would like to write as a linear combination v = β1 1 + β2 2 . Then,
† †
1 2 2 4 + 3i 1 1 2 2 − 6i
β1 = †1 v =√ = √ , β2 = †2 v =√ = √ .
5 i −3 5 5 −2i −3 5
96
Note it is crucial to use the hermitian conjugate, rather than the transpose in this calculation.
Does every (finite-dimensional) vector space have an ortho-normal basis and, if so, how can it be deter-
mined? The Gram-Schmidt procedure answers both of these questions.
Theorem 6.1. (Gram-Schmidt procedure) If {v1 , . . . , vn } is a basis of the vector space V , then there exists
an ortho-normal basis {1 , . . . , n } of V such that Span(1 , . . . , k ) = Span(v1 , . . . , vk ) for all k = 1, . . . , n.
Proof. The proof is constructive. The first vector of our prospective ortho-normal basis is obtained by
simply normalizing v1 , that is,
v1
1 = . (6.18)
|v1 |
Clearly, |1 | = 1 and Span(1 ) = Span(v1 ). Suppose we have already constructed the first k − 1 vectors
1 , . . . , k−1 , mutually orthogonal, normalized and such that Span(1 , . . . , j ) = Span(v1 , . . . , vj ) for all
j = 1, . . . , k − 1. The next vector, k , is then constructed by first subtracting from vk its projections onto
1 , . . . , k−1 and then normalizing, so
k−1
X vk0
vk0 = vk − hi , vk ii , k = . (6.19)
|vk0 |
i=1
Hence, k is orthogonal to all vectors 1 , . . . , k−1 . Moreover, since Span(1 , . . . , k−1 ) = Span(v1 , . . . , vk−1 )
and vk and k only differ by a re-scaling and terms proportional to 1 , . . . , k−1 is follows that Span(1 , . . . , k ) =
Span(v1 , . . . , vk ).
We have seen that every finitely spanned vector space has a basis. The above theorem, therefore, shows
that every finitely spanned vector space with a scalar product also has an ortho-normal basis. Note that
the proof provides a practical method, summarized by Eqs. (6.18), (6.19), to compute an ortho-normal
basis from a given basis. Let us apply this method to some explicit examples.
97
2) To find 2 use Eq. (6.19) for k = 2:
2 1 1 1
v20 1
v20 = v2 − h1 , v2 i1 = 0 − 1 = −1 , 2 = 0 = √ −1 .
|v2 | 3
1 0 1 1
3) To find 3 use Eq. (6.19) for k = 3:
1 1 1 1 1
1 1 7 v30 1
v30 = v3 −h1 , v3 i1 −h2 , v3 i2 = −2 + 1 − −1 = −1 , 3 = = √ −1 .
2 3 6 |v30 | 6
−2 0 1 −2 −2
So, in summary, the ortho-normal basis is
1 1 1
1 1 1
1 = √ 1 , 2 = √ −1 , 3 = √ −1 .
2 0 3 1 6 −2
It is easy (and always advisable) to check that indeed hi , j i = δij .
(b) For a somewhat more adventurous application of the Gram-Schmidt procedure consider the vector
space of quadratic polynomials in one variable x ∈ [1, −1] with real coefficients and a scalar product
defined by Z 1
hf, gi = dxf (x)g(x) .
−1
We would like to find the ortho-normal basis associated to the standard monomial basis v1 = 1, v2 = x,
v3 = x2 of this space.
1) To find 1 :
Z 1
v1 1
hv1 , v1 i = dx = 2 , 1 = =√ .
−1 |v 1 | 2
2) To find 2 first compute v2 0
Z 1
x
h1 , v2 i = dx √ = 0 , v20 = v2 − h1 , v2 i1 = x ,
−1 2
and then normalize r
1
2 v0 3
Z
hv20 , v20 i = 2
dx x = 2 = 20 = x.
−1 3 |v2 | 2
3) To find 3 first compute v30
Z 1 √ r Z 1
1 2 3 1
h1 , v3 i = √ 2
dx x = , h2 , v3 i = dx x3 = 0 , v30 = v3 −h1 , v3 i1 −h2 , v3 i2 = x2 − ,
2 −1 3 2 −1 3
and normalize
1 2
r
1
v0
8 5
Z
hv30 , v30 i = dx x2 − = , 3 = 30 = (3x2 − 1) .
−1 3 45 |v3 | 8
So, in summary, the ortho-normal polynomial basis is
r r
1 3 5
1 = √ , 2 = x, 3 = (3x2 − 1) .
2 2 8
These are the first three of an infinite family or ortho-normal polynomials, referred to as Legendre poly-
nomials, which play an important role in mathematical physics.
98
We have already seen in Eq. (6.17) that the coordinates of a vector relative to an ortho-normal basis are
easily computed from the scalar product. There are a few more helpful simplifications which arise for an
ortho-normal basis. For their derivation, we start with two vectors
X
v= αi i , αi = hi , vi (6.20)
i
X
w= βi i , βi = hi , wi (6.21)
i
This shows that, relative to an ortho-normal basis, a scalar product can be expressed in terms of the
standard scalar product on Rn or Cn . Suppose we would like to compute the representing matrix A
of a linear map f : V → V relative to an ortho-normal basis {1 , . . . , n } of V . In general, following
Lemma 3.3, the entries Aij of the matrix A can be obtained from
X
f (j ) = Aij i . (6.23)
i
Taking the scalar product of this equation with k results in the simple formula
Aij = hi , f (j )i . (6.24)
In physics, the RHS of this expression is often referred to as a matrix element of the map f . It is worth
noting that a linear map is uniquely determined by its matrix elements.
Lemma 6.2. If two linear maps f : V → V and g : V → V satisfy hv, f (w)i = hv, g(w)i (or hf (v), wi =
hg(v), wi) for all v, w ∈ V then f = g.
Proof. By linearity of the scalar product in the second argument the assumption implies that hv, f (w) −
g(w)i = 0 for all v, w ∈ V . In particular, if we choose v = f (w) − g(w), it follows from Def. 6.1 (S3)
that f (w) − g(w) = 0. Since this holds for all w it follows that f = g. The alternative statement follows
simply by applying Def. 6.1 (S1).
Example 6.5: Calculating the matrix representing a linear map relative to an ortho-normal basis
For a fixed vector n ∈ R3 , we consider the linear map f : R3 → R3 defined by
f (v) = (n · v)n . (6.25)
Evidently, this map projects vectors into the direction of n. We would like to compute the matrix A
representing this linear map relative to the ortho-normal basis given by the three standard unit vectors
ei , using Eq. (6.24). We find
Aij = ei · f (ej ) = (n · ei )(n · ej ) = ni nj . (6.26)
and, hence, Aij = ni nj or, in matrix notation
n21 n1 n2 n1 n3
A = n1 n2 n22 n2 n3 (6.27)
n1 n3 n2 n3 n23
99
We end this discussion of orthogonality with a result on perpendicular spaces. For a sub vector space
W ⊂ V the perpendicular space W ⊥ is defined as
In other words, W ⊥ consists of all vectors which are orthogonal to all vector in W . For example, if W ⊂ R3
is a plane through the origin then W ⊥ is the line through the origin perpendicular to this plane. The
following statements are intuitive and will be helpful for our treatment of eigenvectors and eigenvalues in
the next section.
Lemma 6.3. For a sub vector space W ⊂ V of a finite dimensional vector space V with a scalar product
the following holds:
(i) W ⊥ is a sub vector space of V .
(ii) W ∩ W ⊥ = {0}
(iii) dim(W ) + dim(W ⊥ ) = dim(V )
Proof. (i) If v1 , v2 ∈ W ⊥ then clearly αv1 + βv2 ∈ W ⊥ so from Def. 1.2 W ⊥ is a sub vector space.
(ii) If v ∈ W ∩ W ⊥ then hv, vi = 0, but from Def. 6.1 (S3) this implies that v = 0.
(iii) Choose an ortho-normal basis {1 , . . . , k } of W and define the linear map f : V → V by f (v) =
P k
i=1 hi , vii (a projection onto W ). Clearly Im(f ) ⊂ W . For w ∈ W it follows from Eq. (6.17) that
f (w) = w so that Im(f ) = W . Moreover, Ker(f ) = W ⊥ and the claim follows from the dimension
formula (3.4) applied to the map f .
Definition 6.3. For a linear map f : V → V on a vector space V with scalar product, an adjoint linear
map, f † : V → V is a map satisfying
hv, f wi = hf † v, wi (6.29)
for all v, w ∈ V .
In other words, a linear map can be “moved” into the other argument of the scalar product by taking its
adjoint. The following properties of the adjoint map are relatively easy to show.
100
Proof. (i) For two adjoints f1 , f2 for f we have hf1 (v), wi = hv, f (w)i = hf2 (v), wi for all v, w ∈ V .
Then Lemma 6.2 implies that f1 = f2 .
(ii) h(f † )† (v), wi = hv, f † (w)i = hf (v, wi. Comparing the LHS and RHS together with Lemma 6.2 shows
that (f † )† = f .
(iii) h(f +g)† (v), wi = hv, (f +g)(w)i = hv, f (w)i+hv, g(w)i = hf † (v), wi+hg † (v), wi = h(f † +g † )(v), wi
and the claim follows from Lemma 6.2.
(iv) h(αf )† (v), wi = hv, (αf )(w)i = αhv, f (w)i = αhf † (v), wi = h(α∗ f † )(v), wi and Lemma 6.2 leads to
the stated result.
(v) h(f ◦ g)† (v), wi = hv, (f ◦ g)(w)i = hf † (v), g(w)i = hg † ◦ f † (v), wi.
(vi) From (v) we have idV = (f ◦ f −1 )† = f † ◦ (f −1 )† . This means (f −1 )† is the inverse of f † and, hence,
(f † )−1 = (f −1 )† .
Let us now proceed in a more practical way and understand the adjoint map relative to an ortho-
normal basis 1 , . . . , n of V . From Eq. (6.24) the matrices A, B describing f and f † relative to this basis
are given by
Aij = hi , f (j )i , Bij = hi , f † (j )i . (6.30)
Using the scalar product property (S1) in Def. (6.1) these matrices are related by
that is, if A represents f then the hermitian conjugate A† represents f † . This also shows that, by reversing
the above argument and defining f † as the linear map associated to A† , that the adjoint always exists -
this was not immediately clear from the definition.
Previously, we have introduced hermitian conjugation merely as a “mechanical” operation to be carried
out for matrices. Now we understand its proper mathematical context - it leads to the matrix which
describes the adjoint linear map. In the case of a real scalar product we can of course drop the complex
conjugation in Eq. (6.31) and the matrix describing the adjoint becomes AT , the transpose of A. Hence,
we have also found a mathematical interpretation for the transposition of matrices.
We have seen in Eq. (6.22) that, with respect to on ortho-normal basis, a scalar product is described
by the standard (real or complex) scalar product on Rn or Cn . It is, therefore, clear that the adjoint of
a matrix A with respect to the standard scalar product is given by its hermitian conjugate, A† (or AT
in the real case). This is easy to verify explicitly from the definition of the standard scalar product in
Example 6.1.
hv, Awi = v† Aw = (A† v)† w = hA† v, wi . (6.32)
A particularly important class of linear maps are those which coincide with their adjoint.
Definition 6.4. A linear map f : V → V on a vector space V with scalar product is called self-adjoint
(or hermitian) iff f = f † .
In other words, self-adjoint maps can be moved from one argument of the scalar product into another, so
Clearly, relative to an ortho-normal basis, a self-adjoint linear map is described by a hermitian matrix (or
a symmetric matrix in the real case). Further, the self-adjoint linear maps on Rn (Cn ) with respect to the
standard scalar product are the symmetric (hermitian) matrices.
101
For a more abstract example of a self-adjoint linear map, consider the vector space of (infinitely many
times) differentiable functions ϕ : [a, b] → C, satisfying ϕ(a) = ϕ(b), with scalar product
Z b
hϕ, ψi = dx ϕ(x)∗ ψ(x) .
a
The derivative operator
d
D = −i
dx
defines a linear map on this space and we would like to check that it is self-adjoint. Performing an
integration by parts we find
Z b Z b Z b
∗ dψ ∗ b dϕ ∗
hϕ, Dψi = −i dx ϕ(x) (x) = −i [ϕ(x) ψ(x)]a + i dx (x) ψ(x) = dx (Dϕ)(x)∗ ψ(x)
a dx a dx a
= hDϕ, ψi .
Hence, D is indeed hermitian. Note that the boundary term vanishes due to the boundary condition
on our functions and that including the factor of i in the definition of D is crucial for the sign to work
out correctly. In quantum mechanics physical quantities are represented by hermitian operators. In this
context, the present operator D plays an important role as it corresponds to linear momentum.
102
As before, it is useful to work out what this means relative to an ortho-normal basis. If f is described by
a matrix A relative to this basis then we already know that f † is described by the hermitian conjugate
A† in the complex case or by the transpose AT in the real case.
We begin with the real case where the condition (6.35) turns into
Matrices A satisfying this condition are called orthogonal matrices and they can be characterized, equiv-
alently, by either one of the three conditions above. The simplest way to check if a given matrix is
orthogonal is usually to verify the condition on the LHS. The condition in the middle tells us it is easy to
compute the inverse of an orthogonal matrix - it is simply the transpose. And, finally, the condition on
the RHS says that the column vectors of an orthogonal matrix form an ortho-normal basis with respect
to the standard scalar product (the dot product). In fact, since a real scalar product, written in terms of
an ortho-normal basis, corresponds to the dot product, see Eq. (6.22), we expect that orthogonal matrices
are precisely those matrices which leave the dot product invariant. Indeed, we have
The set of all n × n orthogonal matrices is also denoted by O(n). Taking the determinant of the LHS con-
dition in (6.36) and using Lemma 5.1 and Theorem 5.1 gives 1 = det(1) = det(AAT ) = det(A) det(AT ) =
det(A)2 so that
det(A) = ±1 (6.38)
for any orthogonal matrix. The subset of n × n orthogonal matrices A with determinant det(A) = +1
is called special orthogonal matrices or rotations and denoted by SO(n). Note that the term “rotation”
is indeed appropriate for those matrices. Since they leave the dot product invariant they do not change
lengths of vectors and angles between vectors and the det(A) = +1 conditions excludes orthogonal ma-
trices which contain reflections. The relation between orthogonal matrices with positive and negative
determinants is easy to understand. Consider an orthogonal matrix A with det(A) = −1 and the specific
orthogonal matrix F = diag(−1, 1, . . . , 1) with det(F ) = −1 which corresponds to a reflection in the first
coordinate direction. Then the matrix R = F A is a rotation since det(R) = det(F ) det(A) = (−1)2 = 1.
This means every orthogonal matrix A can be written as a product
A = FR (6.39)
of a rotation R and a reflection F . To get a better feeling for rotations we should look at some low-
dimensional examples.
where a, b, c, d are real numbers and impose the conditions RT R = 12 and det(R) = 1. This gives
2
a + c2 ab + cd
T ! 1 0 !
R R= 2 2 = , det(R) = ad − bc = 1 ,
ab + cd b + d 0 1
103
and, hence, the equations a2 + c2 = b2 + d2 = 1, ab + cd = 0 and ad − bc = 1. It is easy to show that a
solution to these equations can always be written as a = d = cos(θ), c = −b = sin(θ), for some angle θ so
that two-dimensional rotation matrices can be written in the form
cos θ − sin θ
R(θ) = . (6.40)
sin θ cos θ
For the rotation of an arbitrary vector x = (x, y)T we get
0 x cos θ − y sin θ
x = Rx = . (6.41)
x sin θ + y cos θ
It is easy to verify explicitly that |x0 | = |x|, as must be the case, and that the cosine of the angle between
x and x0 is given by
x0 · x (x cos θ − y sin θ)x + (x sin θ + y cos θ)y
cos(^(x0 , x)) = 0
= = cos θ . (6.42)
|x ||x| |x|2
This result means we should interpret R(θ) as a rotation by an angle θ. From the addition theorems of
sin and cos it also follows easily that
R(θ1 )R(θ2 ) = R(θ1 + θ2 ) , (6.43)
that is, the rotation angle adds up for subsequent rotations, as one would expect. Note, Eq. (6.43)
also implies that two-dimensional rotations commute, since R(θ1 )R(θ2 ) = R(θ1 + θ2 ) = R(θ2 + θ1 ) =
R(θ2 )R(θ1 ), again a property intuitively expected.
(b) Three dimensions
To find the explicit form for three-dimensional rotations we could, in principle, use the same approach as
in two dimensions and impose all relevant constraints on an arbitrary 3 × 3 matrix. However, this leads
to a set of equations in 9 variables and is much more complicated. However, it is easy to obtain special
three-dimensional rotations from two-dimensional ones. For example, the matrices
1 0 0
R1 (θ1 ) = 0 cos θ1 − sin θ1 (6.44)
0 sin θ1 cos θ1
clearly satisfy R1 (θ1 )T R1 (θ1 ) = 13 and det(R1 (θ1 )) = 1 and are, hence, rotation matrices. They describe
a rotation by an angle θ1 around the first coordinate axis. Analogously, rotation matrices around the
other two coordinate axis can be written as
cos θ2 0 − sin θ2 cos θ3 − sin θ3 0
R2 (θ2 ) = 0 1 0 , R3 (θ3 ) = sin θ3 cos θ3 0 . (6.45)
sin θ2 0 cos θ2 0 0 1
It turns out that general three-dimensional rotation matrices can be obtained as products of the above
three special types. For example, we can write a three-dimensional rotation matrix as R(θ1 , θ2 , θ3 ) =
R1 (θ1 )R2 (θ2 )R3 (θ3 ), that is, as subsequent rotations around the three coordinate axis. Of course, there are
different ways of doing this, another choice frequently used in physics being R(ψ, θ, φ) = R3 (ψ)R1 (θ)R3 (φ).
The angles ψ, θ, φ in this parametrization are also called the Euler angles and in this case, the rotation
is combined from a rotation by φ around the z-axis, then a rotation by θ around the x-axis and finally
another rotation by ψ around the (new) z-axis. The Euler angles are particularly useful to describe the
motion of tops in classical mechanics.
Finally, we note that, unlike their two-dimensional counterparts, three-dimensional rotations do not,
in general, commute. For example, apart from special choices for the angles R1 (θ1 )R2 (θ2 ) 6= R2 (θ2 )R1 (θ1 ).
104
Application: Rotating physical systems
Suppose we have a stationary coordinate system with coordinates x ∈ R3 and another coordinate system
with coordinates y ∈ R3 , which is rotating relative to the first one. Such a set-up can be used to describe
the mechanics of objects in rotating systems and has many applications, for example to the physics of
tops or the laws of motion in rotating systems such as the earth (see below). Mathematically, the relation
between these two coordinate system can be described by the equation
x = R(t)y , (6.46)
where R(t) are time-dependent rotation matrices. This means the matrices R(t) satisfy
(as well as det(R(t)) = 1)) for all times t. In practice, we can write rotation matrices in terms of rotation
angles, as we have done in Example 6.7. The time-dependence of R(t) then means that the rotation angles
are functions of time. For example, a rotation around the z-axis with constant angular speed ω can be
written as
cos(ωt) − sin(ωt) 0
R(t) = sin(ωt) cos(ωt) 0 . (6.48)
0 0 1
In physics, a rotation is often described by the angular velocity ω, a vector whose direction indicates the
axis of rotation and whose length gives the angular speed. It is very useful to understand the relation
between R(t) and ω. To do this, define the matrix
W = RT Ṙ , (6.49)
where the dot denotes the time derivative and observe, by differentiating Eq. (6.47) with respect to time,
that
T T
| {zṘ} + Ṙ
R | {zR} = 0 . (6.50)
=W =W T
Hence, W is an anti-symmetric matrix and can be written in the form
0 −ω3 ω2
W = ω3 0 −ω1 or Wij = ikj ωk . (6.51)
−ω2 ω1 0
The three independent entries ωi of this matrix define the angular velocity ω = (ω1 , ω2 , ω3 )T . To see that
this makes sense consider the example (6.48) and work out the matrix W .
cos(ωt) sin(ωt) 0 − sin(ωt) − cos(ωt) 0 0 −ω 0
W = ω − sin(ωt) cos(ωt) 0 cos(ωt) − sin(ωt) 0 = ω 0 0 . (6.52)
0 0 1 0 0 0 0 0 0
Comparison with the general form (6.51) of W then shows that the angular velocity for this case is given
by ω = (0, 0, ω), indicating a rotation with angular speed ω around the z-axis, as expected.
In Example 3.8 we have seen that the multiplication of an anti-symmetric 3 × 3 matrix with a vector
can be written as a cross-product, so that
Wb = ω × b (6.53)
105
for any vector b = (b1 , b2 , b3 )T . This can also be directly verified using the matrix form of W together with
the definition (2.17) of the cross product or, more elegantly, by the index calculation Wij bj = ikj ωk bj =
(ω ×b)i , using the index form (2.29) of the cross product. This relation can be used to re-write expressions
involving W in terms of the angular velocity ω.
For a simple application of this formalism, consider an object moving with velocity ẏ relative to
the rotating system. What is its velocity relative to the stationary coordinate system? Differentiating
Eq. (6.46) gives
ẋ = Rẏ + Ṙy = R (ẏ + W y) = R (ẏ + ω × y) . (6.54)
The velocity ẋ in the stationary system has, therefore, two contribution, namely the velocity ẏ relative to
the rotating system and the velocity ω × y due to the rotation itself.
We now turn to the complex case. In this case, from Eq. (6.35), (complex) matrices A describing unitary
maps relative to an ortho-normal basis are characterized by the three equivalent conditions
Matrices satisfying these conditions are called unitary. As for orthogonal matrices, checking whether a
given matrix is unitary is usually easiest accomplished using the condition on the LHS. The condition in
the middle states that the inverse of a unitary matrix is simply its hermitian conjugate and the condition
on the RHS says that the column vectors of a unitary matrix form an ortho-normal basis under the
standard hermitian scalar product on Cn . Unitary matrices are precisely those matrices which leave the
standard hermitian scalar product invariant, explicitly
The set of all n × n unitary matrices is denoted by U (n). Orthogonal matrices (being real) also satisfy
the condition for unitary matrices so O(n) ⊂ U (n). For the determinant of unitary matrices we conclude
that 1 = det(1) = det(A† A) = det(A)∗ det(A) = | det(A)|2 . Hence, the determinant of unitary matrices
has complex modulus 1, so
| det(A)| = 1 . (6.57)
The unitary matrices U with det(U ) = 1 are called special unitary matrices, and the set of these matrices
is denoted by SU (n). Clearly, rotations are also special unitary so SO(n) ⊂ SU (n). For an arbitrary
unitary n × n matrix A we can always find a complex phase ζ such that ζ n = det(A). Then, the matrix
U = ζ −1 A is special unitary since det(U ) = det(ζ −1 A) = ζ −n det(A) = 1. This means every unitary
matrix A can be written as a product
A = ζU (6.58)
of a special unitary matrix U and a complex phase ζ.
where α, β, γ, δ are complex numbers and impose the conditions U † U = 12 and det(U ) = 1. After a
short calculation we find that every two-dimensional special unitary matrix can be written in terms of
106
two complex numbers α, β as
α β
U= where |α|2 + |β|2 = 1 . (6.59)
−β α∗
∗
This shows that two-dimensional special unitary matrices depend on two complex parameters α, β subject
to the (real) constraint |α|2 + |β|2 = 1 and, hence, on three real parameters. Inserting the special
choice α = cos θ, β = − sin θ into (6.59) we recover the two-dimensional rotation matrices (6.7), so that
SO(2) ⊂ SU (2), as expected from our general discussion.
The general study of orthogonal and unitary matrices is part of the theory of Lie groups, a more advanced
mathematical discipline which is beyond the scope of this introductory text.
Orthogonal and unitary matrices have numerous applications in physics which we would like to illus-
trate with an example from classical mechanics.
mẍ = F , (6.60)
where the dot denotes the derivative with respect to time t. We would like to work out the form this law
takes if we transform it to rotating coordinates y, related to the original, non-rotating coordinates x by
x = R(t)y . (6.61)
for all times t. For example, such a version of Newton’s law is relevant to describing mechanics on earth.
To re-write Eq. (6.60) in terms of y we first multiply both sides with RT = R−1 so that
mRT ẍ = FR , (6.63)
with FR := RT F the force in the rotating coordinate system. If the rotation matrix is time-independent
it can be pulled through the time derivatives on the LHS of Eq. (6.63) and we get mÿ = FR . This simply
says that Newton’s law keeps the same form in any rotated (but not rotating!) coordinate system.
If R is time-dependent so that the system with coordinates y is indeed rotating relative to the coor-
dinate system x we have to be more careful. Taking two time derivatives of Eq. (6.61) gives
Compared to Newton’s equation in the standard form (6.60) we have acquired the two additional terms on
the RHS which we should work out further. From Eq. (6.49), recall the definition W = RT Ṙ and further
note that Ẇ = RT R̈ + ṘT Ṙ = RT R̈ + (ṘT R)(RT Ṙ) = RT R̈ − W 2 , so that
RT R̈ = Ẇ + W 2 . (6.66)
107
With these results we can re-write Newton’s equation (6.65) as
Also, recall that the matrix W is anti-symmetric, encodes the angular velocity ω, as in Eq. (6.51) and its
action on vectors can be re-written as a cross product with the angular velocity ω (see Eq. (6.53)). Then,
Newton’s equation (6.67) in a rotating system can be written in its final form
The three terms on the RHS represent the additional forces a mass point experiences in a rotating system.
The centrifugal force is well-known. The Coriolis force is proportional to the velocity, ẏ, and, hence,
vanishes for mass points which rest in the rotating frame. It is, for example, responsible for the rotation
of a Faucault pendulum. Finally, the Euler force is proportional to the angular acceleration, ω̇. For the
earth’s rotation, ω is approximately constant so the Euler force is quite small in this case.
ϕw (v) = wT v ∈ R . (6.69)
It is clear that ϕw is a linear functional. Indeed all linear functionals in (Rn )∗ are of this form. To see
this, start with an arbitrary ϕ ∈ (Rn )∗ and define the vector w with components wi = ϕ(ei ). Then
!
X X X
ϕ(v) = ϕ vi ei = vi ϕ(ei ) = wi vi = wT v = ϕw (v) . (6.70)
i i i
Hence, ϕ = ϕw and we have written an arbitrary functional in the form (6.69). This result means we can
think of the functionals on Rn as n-dimensional row vectors.
(b) For the vector space of continuous functions h : [a, b] → R the integral
Z b
I(h) = dx h(x) (6.71)
a
where x0 ∈ [a, b] is a fixed point. In the physics literature this functional is also called Dirac delta function.
108
We know that a linear map f : V → W , with n = dim(V ) and m = dim W , is described by an m × n
matrix relative to a choice of basis on V and W . For W = F , we have m = dim(W ) = dim(F ) = 1, so
relative to a basis on V , linear functionals are described by 1 × n matrices, that is, by row vectors. So,
for a choice of basis, we can think of the vector space V as consisting of column vectors and its dual V ∗
as consisting of row vectors. To make this more precise we prove the following
Theorem 6.2. For a basis 1 , ..., n of V there is a basis 1∗ , ..., n∗ of V*, called the dual basis, such that
Proof. Recall from Example 3.4, that we can define a coordinate map ψ(α) = i αi i which assigns to a
P
coordinate vector α = (α1 , . . . , αn )T the corresponding vector, relative to the chosen basis i . We define
and claim that this provides the correct dual basis. First we check
To verify that the i∗ form a basis we first check linear independence. Applying i βi i∗ = 0 to j and
P
using Eq. (6.75) shows immediately that βj = 0, so that the i∗ are indeed linearly
P i independent. To see
∗ ∗
that they span V start with an arbitrary functional ϕ ∈ V and a vector v = i v i . Then
!
X X X X
ϕ(v) = ϕ v i i = v i ϕ(i ) = ϕi v i = ϕi i∗ (v) . (6.76)
| {z }
i i i i
:=ϕi ∈F
i
P
This means ϕ = i ϕi ∗ so that we have written an arbitrary functional ϕ as a linear combination of the
i∗ .
To summarize the discussion, for a basis {i } and its dual basis {i∗ } we can write vectors and dual
vectors as in the following table.
You have probably noticed that we have quietly refined our index convention. Vector space basis elements
have lower indices and their coordinates have upper indices while the situation is reversed for dual vectors.
For one, this allows us to decide the origin of coordinate vectors simply by the position of their index -
for an upper index, v i , we refer to vectors and for a lower index, ϕj to dual vectors. From Eq. (6.73), the
action of dual vectors on vectors can be written as
X
ϕ(v) = ϕi v j i∗ (j ) = ϕi v i , (6.78)
| {z }
i,j
=δji
so, as a simple summation over their indices, also referred to as contraction. Note that this corresponds to
a refined Einstein summation convention where the same lower and upper index are being summed over.
109
From here it is only a few steps to defining tensors. For the curious, a basic introduction into tensors can
be found in Appendix C.
Finally, we would like to have a look at the relation of dual vector spaces and scalar products. In fact,
for reasons which will become clear, we keep the discussion slightly more general and consider symmetric
bi-linear forms with an additional property:
Definition 6.7. A symmetric bi-linear form h · , · i on a (real) vector space V is called non-degenerate if
hv, wi = 0 for all w ∈ V implies that v = 0.
Note that a real scalar product is non-degenerate since already hv, vi = 0 implies that v = 0. Intuitively,
non-degeneracy demands that there is no vector which is orthogonal to all other vectors. It turns out that
a non-degenerate symmetric bi-linear form allows for a “natural” identification of a vector space and its
dual. This is the content of the following
Lemma 6.6. Let V be a real vector space with a symmetric bi-linear form h · , · i and define the map
ı : V → V ∗ by ı(v)(w) = hv, wi. Then we have
It is useful to work this out in a basis {i } of V with dual basis {i∗ } of V ∗ . To do this we first introduce
the symmetric matrix g, also called the metric tensor or metric in short, with entries
We would like to work out the matrix which represents the map ı relative to our basis choice. This means
we should look at the images of the basis vectors, so ı(i )(j ) = hi , j i = gij = gki k∗ (j ). Stripping off
the basis vector j we have
ı(i ) = gji j∗ , (6.81)
and, by comparison with Eq. (3.80), we learn that ı is represented by the metric g. If the bi-linear form
is non-degenerate, so that ı is bijective, then g is invertible. The components of g −1 are usually denoted
by g ij , so that
g ij gjk = δki . (6.82)
In the physics literature it is common
P to use the same symbol for
Pthe components of a vector and the dual
i i
vector, related under ı. So if v = i v i then we write ı(v) = i vi ∗ . Since g represents ı this means
vi = gij v j , v i = g ij vj . (6.83)
Physicists refer to these equations by saying that we can “lower and raise indices” with the metric gij and
its inverse g ij . Mathematically, they are simply a component version of the isomorphism ı between V and
V ∗ which is induced from the non-degenerate bi-linear form. With this notation, the bi-linear form on
two vector v = i v i i and w = j wj j can be written as
P P
110
The Minkowski product has already been introduced in Example 6.1 (c). For two four-vectors v, w ∈ R4
and η = diag(−1, 1, 1, 1) the symmetric bi-linear form is defined by
hv, wi = vT ηw . (6.85)
so is simply given by η. Since η is invertible this also shows, from Lemma 6.6, that the Minkowski product
is non-degenerate. From Eq. (6.83) lowering and raising of indices then takes the form
vµ = ηµν v ν , v µ = η µν vν , (6.87)
vT ηw = ηµν v µ wν = v µ wµ = vν wν . (6.88)
All these equations are part of the standard covariant formulation of special relativity.
We can go one step further and ask about the linear transformations Λ : R4 → R4 which leave the
Minkowski product invariant, that is, which satisfy
Note that these linear transformations, which are referred to as Lorentz transformations, relate to the
Minkowski product in the same way that orthogonal linear maps relate to the standard scalar product on
Rn (see Section 6.4). In Special Relativity the linear transformation
µ
x → x0 = Λx ⇐⇒ xµ → x0 = Λµ ν xν (6.90)
generated by Λ is interpreted as a transformation from one inertial system with space-time coordinates
x = (t, x, y, z)T to another one with space-time coordinates x0 = (t0 , x0 , y 0 , z 0 )T .
Lorentz transformations have a number of interesting properties which follow immediately from their
definition (6.89). Taking the determinant of the middle equation (6.89) and using standard properties of
the determinant implies that (det(Λ))2 = 1 so that
det(Λ) = ±1 (6.91)
P3
Further, the ρ = σ = 0 component of the last Eq. (6.89) reads −(Λ0 0 )2 + i 2
i=1 (Λ 0 ) = −1 so that
Λ0 0 ≥ 1 or Λ0 0 ≤ −1 . (6.92)
Combining the two sign ambiguities in Eqs. (6.91) and (6.92) we see that there are four types of Lorentz
transformations. The sign ambiguity in the determinant is analogous to what we have seen for orthogonal
matrices and its interpretation is similar to the orthogonal case. Lorentz transformation with determinant
1 are called “proper” Lorentz transformations while Lorentz transformations with determinant −1 can be
seen as a combination of a proper Lorentz transformation and a reflection. More specifically, consider the
special Lorentz transformation P = diag(1, −1, −1, −1) (note that this matrix indeed satisfies Eq. (6.89))
which is also referred to as “parity”. Then every Lorentz transformation Λ can be written as
Λ = P Λ+ , (6.93)
111
where Λ+ is a proper Lorentz transformation. The sign ambiguity (6.92) in Λ0 0 is new but has an obvious
physical interpretation. Under a Lorentz transformations Λ with Λ0 0 ≥ 1 the sign of the time component
x0 = t of a vector x remains unchanged, so that the direction of time is unchanged. Correspondingly,
such Lorentz transformation with positive Λ0 0 are called “ortho-chronous”. On the other hand, Lorentz
transformations Λ with Λ0 0 ≤ −1 change the direction of time. If we introduce the special Lorentz
transformation T = diag(−1, 1, 1, 1), also referred to as “time reversal”, then every Lorentz transformation
Λ can be written as
Λ = T Λ↑ , (6.94)
where Λ↑ is an ortho-chronous Lorentz transformation. Combining the above discussion, we see that every
Lorentz transformation Λ can be written in one of four ways, namely
↑
Λ+ for det(Λ) = 1 and Λ0 0 ≥ 1
P Λ↑+ for det(Λ) = −1 and Λ0 0 ≥ 1
Λ= ↑ , (6.95)
P T Λ+ for det(Λ) = 1 and Λ0 0 ≤ −1
T Λ↑+
for det(Λ) = −1 and Λ0 0 ≤ −1
where Λ↑+ is a proper, ortho-chronous Lorentz transformation. The Lorentz transformations normally
used in Special Relativity are the proper, ortho-chronous Lorentz transformations. However, the other
Lorentz transformations are relevant as well and it is an important question as to whether they constitute
symmetries of nature in the same way that proper, ortho-chronous Lorentz transformations do. More to
the point, the question is whether nature respects parity P and time-reversal T .
What do proper, ortho-chronous Lorentz transformations look like explicitly? To answer this question we
basically have to solve Eq. (6.89) which is clearly difficult to do in full generality. However, some special
Lorentz transformations are more easily obtained. First, we note that matrices of the type
1 0
Λ= (6.96)
0 R
where R is a three-dimensional rotation matrix are proper, ortho-chronous Lorentz transformations. In-
deed, such matrices satisfy Eq. (6.89) by virtue of RT R = 13 and we have det(Λ) = det(R) = 1 and
Λ0 0 = 1. In other words, regular three-dimensional rotations in the spatial directions are proper, ortho-
chronous Lorentz transformations.
of a two-dimensional Lorentz transformation which affects time and the x-coordinate, but leaves y and z
unchanged. Inserting this Ansatz into Eq. (6.89) and, in addition, requiring that det(Λ2 ) = 1 for proper
Lorentz transformations, leads to
a2 − c2 = 1 , d2 − b2 = 1 , ab − cd = 0 , ad − cb = 1 . (6.98)
112
Note the close analogy of this form with two-dimensional rotations in Example 6.7 (a). The quantity ξ is
also called “rapidity”. It follows from the addition theorems for hyperbolic functions that Λ(ξ1 )Λ(ξ2 ) =
Λ(ξ1 + ξ2 ), so rapidities add up in the same way that two-dimensional rotation angles do. For a more
common parametrisation introduce the parameter β = tanh(ξ) ∈ [−1, 1] so that
1
cosh(ξ) = p =: γ , sinh(ξ) = βγ . (6.100)
1 − β2
In terms of β and γ the two-dimensional Lorentz transformations can then be written in the more familiar
form
γ βγ
Λ2 = . (6.101)
βγ γ
Here, β is interpreted as the relative speed of the two inertial systems (in units of the speed of light).
Definition 7.1. For a linear map f : V → V on a vector space V over F the number λ ∈ F is called an
eigenvalue of f if there is a non-zero vector v such that
f (v) = λv . (7.1)
In short, an eigenvector is a vector which is just “scaled” by the action of a linear map.
How can we find eigenvalues and eigenvectors of a linear map? To discuss this we first introduce the
idea of eigenspaces. The eigenspace for λ ∈ F is defined by
and, hence, from Eq. (7.1) it “collects” all eigenvectors for λ. Being the kernel of a linear map, an
eigenspace is of course a sub vector space of V . Evidently, λ is an eigenvalue of f precisely when
dim Eigf (λ) > 0. If dim Eigf (λ) = 1 the eigenvalue λ is called non-degenerate (up to re-scaling there
113
is only one eigenvector for λ) and if dim Eigf (λ) > 1 the eigenvalue λ is called degenerate (there are at
least two linearly independent eigenvectors for λ).
We see that λ is an eigenvalue of f precisely when Ker(f − λ idV ) is non-trivial. From Lemma 3.1 this
is the same as saying that f − λ idV is not invertible which is equivalent to det(f − λ idV ) = 0, using
Lemma 5.1. So in summary
This leads to an explicit method to calculate eigenvalues and eigenvectors which we develop in the next
sub-section.
For an n-dimensional vector space V the characteristic polynomials χf (λ) is a polynomial of order n in λ
whose coefficients depend on f . Clearly, from Eq. (7.3), the eigenvalues of f are precisely the zeros of its
characteristic polynomial. So schematically, eigenvalues and eigenvectors of f can be computed as follows.
2. Find the zeros, λ, of the characteristic polynomial. They are the eigenvalues of f .
3. For each eigenvalue λ compute the eigenspace Eigf (λ) = Ker(f − λ idV ) by finding all vectors v
which solve the equation
(f − λ idV )(v) = 0 . (7.5)
114
Hence, up to scaling, there is only one eigenvector so the eigenvalue is non-degenerate. Normalizing the
eigenvector with respect to the dot product gives
1
1
v1 = √ 1 .
3 1
λ2 = 1:
0 −1 0 x −y
(A − 11)v = −1
!
1 −1 y = −x + y − z = 0 ⇐⇒ y = 0 , x = −z
0 −1 0 z −y
λ3 = 3:
−2 −1 0 x −2x − y
(A − 31)v = −1 −1 −1 y = −x − y − z = 0
!
⇐⇒ y = −2x , z = x
0 −1 −2 z −y − 2z
Some general properties of the characteristic polynomial are given in the following
Proof. (i) χP AP −1 (λ) = det(P AP −1 − λ1) = det(P (A − λ1)P −1 ) = det(A − λ1) = χA (λ).
(ii) This is a direct consequence of (i).
(iii) First, it is clear that c0 = χA (0) = det(A). The expressions for the other two coefficients follow by
carefully thinking about the order in λ of the terms in det(A − λ1), by using the general expression (5.11)
for the determinant. Terms of order λn and λn−1 only receive contributions from the product of the
diagonal elements, so that
n n
!
Y X
n−2 n n n−1
χA (λ) = (Aii − λ) + O(λ ) = (−1) λ + (−1) Aii λn−1 + O(λn−2 ) .
i=1 i=1
115
The above Lemma shows that the constant term in the characteristic polynomial equals det(A) and that
this is basis-independent. Of course, we have already shown the basis-independence of the determinant in
Section (5.2). However, we do gain some new insight from the basis-independence of the coefficient cn−1
in the characteristic polynomial. We define the trace of a matrix A by
n
X
tr(A) := Aii , (7.8)
i=1
that is, by the sum of its diagonal entries. Since cn−1 = (−1)n−1 tr(A) it follows that the trace is basis-
independent. This can also be seen more directly. First, note that
X X
tr(AB) = Aij Bji = Bji Aij = tr(BA) , (7.9)
i,j i,j
so matrices inside a trace can be commuted without changing the value of the trace. Hence,
P = (v1 , . . . , vn ) (7.11)
P −1 AP = diag(λ1 , . . . , λn ) , (7.12)
116
Proof. “⇐”: We assume that we have a basis v1 , . . . , vn of eigenvectors with eigenvalues λi so that
Avi = λi vi . Define the matrix P = (v1 , . . . , vn ) whose columns are the eigenvectors of A. Since the
eigenvectors form a basis of F n the matrix P is invertible. Then
Âei = λi ei =⇒ P −1 A P ei = λi ei =⇒ Avi = λi vi ,
|{z}
=vi
The requirement which is easily overlooked in the previous lemmas is that we are asking for a basis of
eigenvectors. Once we have found all the eigenvectors of a linear map they might or might not form a
basis of the underlying vector space. Only when they do can the linear map be diagonalized.
so, in this case, the determinant is the product of the eigenvalues and the trace is their sum.
contains the three eigenvectors v1 , v2 , v3 from Example 7.1 as its columns. Note that these three columns
form an ortho-normal system with respect to the dot product so the above matrix P is orthogonal. This
means that its inverse is easily computed from P −1 = P T . With the matrix A from Eq. (7.6) it can then
be checked explicitly that
P T AP = diag(0, 1, 3) .
Note that the eigenvalues of A appear on the diagonal. It is not an accident that the eigenvectors of A
are pairwise orthogonal and, as we will see shortly, this is related to A being a symmetric matrix.
(b) Consider the 2 × 2 matrix
0 1
A=
0 0
117
whose characteristic polynomial is
−λ 1
χA (λ) = det = λ2 .
0 −λ
Hence, there is only one eigenvalue, λ = 0. The associated eigenvectors are found by solving
0 1 x y !
= = 0 ⇐⇒ y = 0
0 0 y 0
so the eigenvalue is non-degenerate with eigenvectors proportional to (1, 0)T . This amounts to only one
eigenvector (up to re-scaling) so this matrix does not have a basis of eigenvectors (which requires two
linearly independent vectors in R2 ) and cannot be diagonalized.
(c) Our next example is for the matrix
0 1
A=
−1 0
with characteristic polynomial
−λ 1
χA (λ) = det = λ2 + 1 .
−1 −λ
At this point we have to be a bit more specific about the underlying vector space. If the vector space is
R2 , we have to work with real numbers and there are no eigenvalues since the characteristic polynomial
has no real zeros. Hence, in this case, the matrix cannot be diagonalized. On the other hand, for C2 and
complex scalars, there are two eigenvalues, λ± = ±i. The corresponding eigenvectors v = (x, y)T are:
λ+ = i
−i 1 x −ix + y
(A − i12 )v = A =
!
= = 0 ⇐⇒ y = ix .
−1 −i y −x − iy
The eigenvalue is non-degenerate and, as in Example (7.1) it is useful to normalize the eigenvector.
However, since we are working over the complex numbers, we should be using the standard hermitian
scalar product and demand that v† v = 1. Then
1 1
v+ = √ .
2 i
λ− = −i
i 1 x ix + y
(A + i12 )v = A =
!
= =0 ⇐⇒ y = −ix .
−1 i y −x + iy
Again, this eigenvalue is non-degenerate with corresponding normalized eigenvector
1 1
v+ = √ .
2 −i
The diagonalizing basis transformation is
1 1 1
P = (v+ , v− ) = √ ,
2 i −i
and its column vectors form an ortho-normal system (under the standard hermitian scalar product on
C2 ). Therefore, P is a unitary matrix and P −1 = P † . Again, the orthogonality of the eigenvectors is not
118
an accident and is related to the matrix A being anti-symmetric. To check these results we verify that
indeed
P † AP = diag(i, −i) .
so that it is properly normalized with respect to the C2 standard scalar product, v1† v1 = 1.
λ2 = −1
2 2i x
(A + 13 )v =
!
= 0 ⇐⇒ x = −iy .
−2i 2 y
This eigenvalue is also non-degenerate and the normalized eigenvector, satisfying v2† v2 = 1, can be chosen
as
1 −i
v2 = √
2 1
Note also that the two eigenvectors are orthogonal, v1† v2 = 0. Consequently, the diagonalizing matrix
1 i −i
U = (v1 , v2 ) = √
2 1 1
While Lemma 7.3 provides a general criterion for a matrix to be diagonalizable it requires calculation
of all the eigenvectors and checking whether they form a basis. It would be helpful to have a simpler
condition, at least for some classes of matrices, which can simply be “read off” from the matrix. To this
end we prove
Theorem 7.1. Let V be a vector space over R (C) with real (hermitian) scalar product h·, ·i. If f : V → V
is self-adjoint then
(i) All eigenvalues of f are real.
(ii) Eigenvectors for different eigenvalues are orthogonal.
119
Proof. (i) For the real case, the first part of the statement is of course trivial. For the complex case, we
start with an eigenvector v 6= 0 of f with corresponding eigenvalue λ, so that f (v) = λv. Then
In the third step we have used the fact that f is self-adjoint and can, hence, be moved from one argument
of the scalar product into the other. Since v 6= 0 and, hence, hv, vi = 6 0 it follows that λ = λ∗ , so the
eigenvalue is real.
(ii) Consider two eigenvectors v1 , v2 , so that f (v1 ) = λ1 v1 and f (v2 ) = λ2 v2 , with different eigenvalues,
λ1 6= λ2 . Then
(λ1 − λ2 )hv1 , v2 i = hλ1 v1 , v2 i − hv1 , λ2 v2 i = hf (v1 ), v2 i − hv1 , f (v2 )i = hv1 , f (v2 )i − hv1 , f (v2 )i = 0 .
Since λ1 − λ2 6= 0 this means hv1 , v2 i = 0 and the two eigenvectors are orthogonal.
Theorem 7.2. Let V be a vector space over C with hermitian scalar product h·, ·i. If f : V → V is
self-adjoint it has an ortho-normal basis, 1 , . . . , n of eigenvectors.
Proof. The proof is by induction in n, the dimension of the vector space V . For n = 1 the assertion is
trivial. Assume that it is true for all dimensions k < n. We would like to show that it is true for dimension
n. The characteristic polynomial χf of f has at least one zero, λ, over the complex numbers. Since f is
self-adjoint, λ is real from the previous theorem. Consider the eigenspace W = Eigf (λ). Since λ is an
eigenvalue, dim(W ) > 0. Vectors v ∈ W ⊥ and w ∈ W are perpendicular, hw, vi = 0, so
This means that f (v) is perpendicular to w so that, whenever v ∈ W ⊥ , then also f (v) ∈ W ⊥ . As a result,
W ⊥ is invariant under f and we can restrict f to W ⊥ , that is, consider g = f |W ⊥ . Since dim W ⊥ < n, there
is an ortho-normal basis 1 , . . . , k of W ⊥ consisting of eigenvectors of g (which are also eigenvectors of f )
by the induction assumption. Add to this ortho-normal basis of W ⊥ an ortho-normal basis of W (which,
by definition of W , consists of eigenvectors of f with eigenvalue λ). Since dim(W ) + dim(W ⊥ ) = n
(see Lemma 6.3) and pairwise orthogonal vector are linearly independent this list of vectors forms an
ortho-normal basis of V , consisting of eigenvectors of f .
In summary, these results mean that every real symmetric (hermitian) matrix can be diagonalized, has
an ortho-normal basis 1 , . . . , n of eigenvectors with corresponding real eigenvalues λ1 , . . . , λn and the
diagonalizing matrix P = (1 , . . . , n ) is orthogonal (unitary), such that
P † AP = diag(λ1 , . . . , λn ) . (7.15)
How can this ortho-normal basis of eigenvectors be found? First, the eigenvalues and eigenvectors have to
be computed in the usual way, as outlined above. From Theorem 7.1 eigenvectors for different eigenvalues
are orthogonal so if all eigenvalues are non-degenerate then the eigenvectors will be automatically pairwise
orthogonal. What remains to be done in order to obtain an ortho-normal system is simply to normalize
the eigenvectors. This is what has happened in Example 7.1 (and its continuation, Example 7.2 (a)) where
all eigenvalues were indeed non-degenerate.
The situation is slightly more involved in the presence of degenerate eigenvalues. Of course eigenvectors
for different eigenvalues are still automatically orthogonal. However, for a degenerate eigenvalue we have
two or more linearly independent eigenvectors which are not guaranteed to be orthogonal. The point is
that we can choose such eigenvectors to be orthogonal. To see how this works it is useful to think about
120
the eigenspaces, EigA (λ), of the hermitian matrix A. Eigenspaces for different eigenvalues are of course
orthogonal to one another (meaning that all vectors of one eigenspace are orthogonal to all vectors of
the other), as a consequence of Theorem 7.1. For each eigenspace, we can find a basis of eigenvectors
and, applying the Gram-Schmidt procedure to this basis, we can convert this into an ortho-normal basis.
If the eigenvector is non-degenerate, so that dim EigA (λ) = 1, this simply means normalizing the single
basis vectors. For degenerate eigenvalues, when dim EigA (λ) > 1, we have to follow the full Gram-Schmidt
procedure as explained in Section 6.2. Combining the ortho-normal sets of basis vectors for each eigenspace
into one list then gives the full basis of ortho-normal eigenvectors. To see how this works explicitly let us
discuss a more complicated example with a degenerate eigenvalue.
Hence, there are two eigenvalues, λ1 = 2 and λ2 = −1. For the eigenvectors v = (x, y, z)T we find:
λ1 = 2:
√ √ √ √
−2
√ 2 2 x −2x
√ + 2y + 2z
3 3 x
(A − 213 )v = √2 −3 1 y = √2x − 3y + z
=! 0 =⇒ y=z= √ .
4 4 2
2 1 −3 z 2x + y − 3z
λ1 = −1:
√ √ √ √
3 √2 2 2 x
3
2x√+ 2y + 2z √
(A + 13 )v = √2
!
1 1 y = √2x + y + z = 0 =⇒ z = − 2x − y .
4 4
2 1 1 z 2x + y + z
Since we have found only one condition on x, y, z there are two linearly independent eigenvectors, so
this eigenvalue has degeneracy 2. Obvious choices for the two eigenvectors are obtained by setting x = 1,
y = 0 and x = 0, y = 1, so
1 0
v2 = √ 0 , v3 = 1 .
− 2 −1
Both of these vectors are orthogonal to 1 above, as they must be, but they are not orthogonal to one
another. However, they do form a basis of the two-dimensional eigenspace EigA (−1) = Span(v2 , v3 ) so
121
that every linear combination of these vectors is also an eigenvector for the same eigenvalue −1. With
this in mind we apply the Gram-Schmidt procedure to v2 and v3 . First normalizing v2 leads to
1
v2 1
2 = =√ √ 0 .
|v2 | 3 − 2
Definition 7.4. Let V be a vector space over C with hermitian scalar product h·, ·, i. A linear map
f : V → V is called normal if f ◦ f † = f † ◦ f (or, equivalently, iff the commutator of f and f † vanishes,
that is, [f, f † ] := f ◦ f † − f † ◦ f = 0).
Recall from Section 6.3 that the adjoint map, f † , for a linear map f : V → V is defined relative to a
scalar product on V . Clearly hermitian and unitary linear maps are normal (since f = f † if f is hermitian
and f † ◦ f = f ◦ f † = id if f is unitary), as are anti-hermitian maps, that is, maps satisfying f = −f † .
If we consider the vector space V = Cn over C with the standard hermitian scalar product then (anti-)
hermitian and unitary maps simply correspond to (anti-) hermitian and unitary matrices and we learn
that these classes of matrices are normal.
A useful statement for normal linear maps is
Lemma 7.4. Let V be a vector space over C with hermitian scalar product h·, ·, i and f : V → V be a
normal linear map. If λ is an eigenvalue of f with eigenvector v then λ∗ is an eigenvalue of f † for the
same eigenvector v.
Proof. First, we show that the map g = f − λ id is also normal. This follows from the straightforward
calculation
122
Note that some of the properties of the adjoint map in Lemma 6.4 have been used in this calculation.
Now consider an eigenvalue λ of f with eigenvector v, so that f (v) = λv or, equivalently, g(v) = 0. Then
we have
0 = hgv, gvi = hv, g † ◦ gvi = hv, g ◦ g † vi = hg † v, g † vi ,
and, from the positivity property of the scalar product, (S3) in Def. 6.1, it follows that g † (v) = 0. Since
g † = f − λ∗ id this, in turn means that f † (v) = λ∗ v. Hence, λ∗ is indeed an eigenvalue of f † with
eigenvector v.
Theorem 7.3. Let V be a vector space over C with hermitian scalar product h·, ·i and f : V → V a linear
map. Then we have: f is normal ⇐⇒ f has an ortho-normal basis of eigenvectors
Proof. “⇐”: Start with an ortho-normal basis 1 , . . . , n of eigenvector of f , so that hi , j i = δij and
f (i ) = λi i . Then
hj , f † (i )i = hf (j ), i i = λ∗i δij
which holds for all j and, hence, implies that f † (i ) = λ∗i i . From this result we have
where the above Lemma has been used in the first line. This means that f (W ) ⊂ W and f † (W ) ⊂ W , so
that sub vector space W is invariant under both f and f † . This implies immediately that the restriction,
f |W of f to W is normal as well. Since dim(W ) = n − 1, the induction assumption can be applied and
we conclude that f |W has an ortho-normal basis of eigenvectors. Combining this basis with v/|v| gives
an ortho-normal basis of eigenvectors for f .
As we have seen, unitary maps are normal so the theorem implies that they have an ortho-normal
basis of eigenvectors. Focusing on V = Cn with the standard hermitian scalar product this means that
unitary matrices can be diagonalised. The eigenvalues of unitary maps are constrained by the following
Lemma 7.5. Let V be a vector space over C with hermitian scalar product h·, ·i and U : V → V a unitary
map. If λ is an eigenvalue of U then |λ| = 1.
Proof. Let λ be an eigenvalue of U with eigenvector v, so that U v = λv. From unitarity of U it follows
that
|λ|2 hv, vi = hλv, λvi = hU v, U vi = hv, vi ,
and, dividing by hv, vi (which must be non-zero since v is an eigenvector), gives |λ|2 = 1.
123
Combining these statements we learn that every unitary matrix U can be diagonalised, by means of a
unitary coordinate transformation P , such that P † U P = diag(eiφ1 , . . . , eiφn ). Since orthogonal matrices
are also unitary they can be diagonalised in the same way, provided we are working over the complex
numbers . In fact, we have already seen this explicitly in Example 7.2 (c) where the matrix A is a specific
two-dimensional rotation matrix.
1 0T
R̃ = , (7.18)
0 R2
where R2 is a 2 × 2 matrix. However, R̃ is also a rotation and, hence, needs to satisfy R̃T R̃ = 13
and det(R̃) = 1. This immediately implies that R2T R2 = 12 and det(R2 ) = 1, so that R2 must be a
two-dimensional rotation, R2 = R(θ), of the form given in Eq. (6.40).
In summary, we learn that for every three-dimensional rotation R we can find an orthonormal basis
(where the first basis vector is the axis of rotation) where it takes the form
T
1 0 0
1 0
R̃ = = 0 cos(θ) − sin(θ) . (7.19)
0 R(θ)
0 sin(θ) cos(θ)
The angle θ which appears in this parametrisation is called the angle of rotation. Basis-independence of
the trace means that tr(R) = tr(R̃) = 1 + 2 cos(θ) which leads to the interesting and useful formula
1
cos(θ) = (tr(R) − 1) (7.20)
2
124
for the angle of rotation of a rotation matrix R. This formula allows for an easy computation of the angle
of rotation, even if the rotation matrix is not in the simple form (7.19). The axis of rotation n, on the
other hand, can be found as the eigenvector for eigenvalue one, that is, by solving Eq. (7.16).
For example, consider the matrix
√
2 √−1 −1 √
1
R = √0 2 − 2 . (7.21)
2
2 1 1
It is easy to verify that RT R = 13 and det(R) = 1 so this is indeed a rotation. By solving Eq. (7.16) for
this matrix (and normalising the eigenvector) we find for the axis of rotation
1 √ T
n= p √ (1, −1, 2 − 1) . (7.22)
5−2 2
√
Also, we have tr(R) = 2 + 1/2, so from Eq. (7.20) the angle of rotation satisfies
1 √
cos(θ) = (2 2 − 1) . (7.23)
4
125
Example 7.5: Simultaneous diagonalization
Can the three matrices
2 −1 3 2 1 1
A= , B= , C=
−1 2 2 3 1 2
7.6 Applications
Eigenvectors and eigenvalues have a wide range of applications, both in mathematics and in physics. Here,
we discuss a small selection of those applications.
d2 q
= −M q , (7.25)
dt2
where M is a real symmetric n × n matrix. In a physical context, this differential equation might describe
a system of mass points connected by springs. The practical problem in solving this equation is that, for
a non-diagonal matrix M , its various components are coupled. However, this coupling can be removed by
diagonalizing the matrix M . To this end, we consider an orthogonal matrix P such that P T M P = M̂ =
diag(m1 , . . . , mn ) and introduce new coordinates Q by setting q = P Q. By multiplying Eq. (7.25) with
P T this leads to
d2 Q T d2 Qi
= −P M P} Q or = −mi Qi for i = 1, . . . , n . (7.26)
dt2 | {z dt2
=M̂
In terms of Q the system decouples and the solutions can easily be written down as
ai sin(wi t) + bi cos(wi t) for mi > 0 p
Qi (t) = ai ewi t + bi e−wi t for mi < 0 where wi = |mi | , (7.27)
ai t + bi for mi = 0
and ai , bi are arbitrary constants. In terms of the original coordinates, the solution is then obtained by
inserting Eq. (7.27) into q = P Q. One interesting observation is that the nature of the solution depends
on the signs of the eigenvalues mi of the matrix M . For a positive eigenvalue, the solution is oscillatory, for
a negative one exponential and for a vanishing one linear. Physically, a negative or vanishing eigenvalue
mi indicates an instability. In this case, the corresponding Qi (t) becomes large at late times (except for
special choices of the constants ai , bi ). The lesson is that stability of the system can be analyzed simply
by looking at the eigenvalues of M . If they are all positive, the system is fully oscillatory and stable, if
there are vanishing or negative eigenvalues the system generically “runs away” in some directions.
126
Example 7.6: As an explicit example, consider the differential equations
d2 q1
= −q1 + q2
dt2
d2 q2
= q1 − q2 + q3
dt2
d2 q3
= q2 − q3 .
dt2
This system is indeed of the general form (7.25) with
1 −1 0
M = −1 2 −1 .
0 −1 1
This is the same matrix we have studied in Example 7.1 and it has eigenvalues m1 = 0, m2 = 1 and
m3 = 3. Due to the zero eigenvalue this system has a linear instability in one direction. Inserting into
Eq. (7.27), the explicit solution reads
a1 t + b1
Q(t) = a2 sin(t)
√ + b2 cos(t)√
(7.28)
a3 sin( 3t) + b3 cos( 3t)
In terms of the original coordinates q, the solution is obtained by inserting (7.28) into q = P Q using the
diagonalizing matrix P given in Eq. (7.14).
g(x) = a0 + a1 x + a2 x2 + · · · . (7.29)
g(A) = a0 1n + a1 A + a2 A2 + · · · , (7.30)
that is, by simply “replacing” x with A in the power series expansion. Note that, convergence assumed,
the RHS of Eq. (7.30) is well-defined via addition and multiplication of matrices and the function “value”
g(A) is a matrix of the same size as A.
Since the exponential series converges for all real (and complex) x it can be shown that the matrix
exponential converges for all matrices.
127
Computing the function of a non-diagonal matrix can be complicated as it involves computing higher
and higher powers of the matrix A. However, it is easily accomplished for a diagonal matrix  =
diag(a1 , . . . , an ) since Âk = diag(ak1 , . . . , akn ) so that
for a function g. This suggest that we might be able to compute the function of a more general matrix
by diagonalizing and then applying Eq. (7.32). To do this, we first observe that computing the function
of a matrix “commutes” with a change of basis. Indeed from
(P −1 AP )k = P −1 A |P P −1 −1 −1 k
{z } AP · · · P AP = P A P
=1
it follows that
g(P −1 AP ) = P −1 g(A)P . (7.33)
Now suppose that A can be diagonalized and P −1 AP = Â = diag(λ1 , . . . , λn ). Then
That is, we can compute the function of the matrix A by first forming the diagonal matrix which contains
the function values of the eigenvalues and then transforming this matrix back to the original basis. Let
us see how this works explicitly.
which we have already diagonalized in Example 7.2 (d). Recall that the eigenvalues of this matrix are 3,
−1 and the diagonalizing basis transformation is given by
1 i −i
U=√
2 1 1
so that U † AU = diag(3, −1). We would like to calculate g(A) for the function g(x) = xn , where n is an
arbitrary integer. Then, from Eq. (7.34), we have
(−1)n + 3n −i((−1)n − 3n )
n n † 1
g(A) = U diag(3 , (−1) )U = .
2 i((−1)n − 3n ) (−1)n + 3n
where θ is an arbitrary real number. Apart from the overall θ factor (which does not affect the eigenvectors
and multiplies the eigenvalues) this is the matrix we have studied in Example 7.2 (c). Hence, we know
that the eigenvalues are ±iθ and the diagonalizing basis transformation is
1 1 1
P =√ ,
2 i −i
128
so that P † AP = diag(iθ, −iθ). From Eq. (7.34) we, therefore, find for the matrix exponential of A
iθ
A iθ −iθ † 1 1 1 e 0 1 −i cos θ sin θ
e = P diag(e , e )P = = .
2 i −i 0 e−iθ 1 i − sin θ cos θ
It is not an accident that this comes out as a two-dimensional rotation. The theory of Lie groups states
that rotations (and special unitary matrices) in all dimensions can be obtained as matrix exponentials
of certain, relative simple matrices, such as A in the present example. This fact is particularly useful in
higher dimensions when the rotation matrices are not so easily written down explicitly. To explain this in
detail is well beyond the scope of this lecture.
Sometimes functions of matrices can be computed more straightforwardly without resorting to diago-
nalizing the matrix. This is usually possible when the matrix in question is relatively simple so that its
powers can be computed explicitly. Indeed, this works for the present example and leads to an alternative
calculation of the matrix exponential. To carry this out we first observe that A2 = −θ2 12 and, hence,
A2n = (−1)n θ2n 12 , A2n+1 = (−1)n θ2n+1 T .
With these results it is straightforward to work out the matrix exponential explicitly.
∞ ∞ ∞
X 1 n X 1 X 1
eA = A = A2n + A2n+1
n! (2n)! (2n + 1)!
n=0 n=0 n=0
∞ ∞
(−1)n θ2n X (−1)n θ2n+1
12 + T = cos(θ)12 + sin(θ)T .
X
=
(2n)! (2n + 1)!
n=0 n=0
129
Remembering that σi† = σi , it is easy to verify that
and, hence, the matrix exponentials U are unitary. Writing U out explicitly, using the Pauli matrices (7.36)
and Eq. (7.40), gives
cos θ + in3 sin θ (n2 + in1 ) sin θ
U= (7.42)
−(n2 − in1 ) sin θ cos θ − in3 sin θ
This shows that det(U ) = |n|2 = 1 so that U is, in fact, special unitary. It turns out that all 2 × 2 special
unitary matrices can be obtained as matrix exponentials of Pauli matrices in this way, another example of
the general statement from the theory of Lie groups mentioned earlier. Indeed, the matrix (7.42) can be
converted into our earlier general form for SU (2) matrices in Eq. (6.59) by setting α = cos θ + in3 sin θ and
β = (n2 + in1 ) sin θ. In mathematical parlance, the vector space L of 2 × 2 hermitian, traceless matrices
is referred to as the Lie algebra of the 2 × 2 special unitary matrices SU (2).
where c is an arbitrary n-dimensional vector. Note that, given our definition of the matrix exponential,
Eq. (7.45) makes perfect sense. But does it really solve the differential equation (7.44)? We verify this
by simply inserting Eq. (7.45) into the differential equation (7.44), using the definition of the matrix
exponential.
∞ ∞ ∞
dx d d X 1 n n X 1 X 1
= eAt c = A t c= An tn−1 c = A An tn c = Ax (7.46)
dt dt dt n! (n − 1)! n!
n=0 n=1 n=0
130
7.6.3 Quadratic forms
A quadratic form in the (real) variables x = (x1 , . . . , xn )T is an expression of the form
n
X
q(x) := Qij xi xj = xT Qx . (7.47)
i,j=1
where Q is a real symmetric n × n matrix with entries Qij . We have already encountered examples of such
quadratic forms in Eq. (6.84) and the comparison shows that they can be viewed as symmetric bi-linear
forms on Rn . Our present task it to simplify the quadratic form by diagonalizing the matrix Q. With the
diagonalizing basis transformation P and P T QP = Q̂ = diag(λ1 , . . . , λn ) and new coordinates defined by
x = P y we have
Xn
T T T
q(x) = x P Q̂P x = y Q̂y = λi yi2 . (7.48)
i=1
Hence, in the new coordinates y the cross terms in the quadratic form have been removed and only the
pure square terms, yi2 , are present. Note that they are multiplied by the eigenvalues of the matrix Q.
where ω = (ω1 , ω2 , ω3 )T is the angular velocity and I is the moment of inertia tensor of the rigid body.
Clearly, this is a quadratic form and by diagonalizing the moment of inertia tensor, P IP T = diag(I1 , I2 , I3 )
and introducing Ω = P ω we can write
3
1X
Ekin = Ii Ω2i . (7.50)
2
i=1
This simplification of the kinetic energy is an important step in understanding the dynamics of rigid
bodies.
Quadratic forms can be used to define quadratic curves (in two dimensions) or quadratic surfaces (in three
dimensions) by the set of all points x satisfying
xT Qx = c , (7.51)
with a real constant c. By diagonalizing the quadratic form, as in Eq. (7.48), the nature of the quadratic
curve or surface can be immediately read off from the eigenvalues λi of Q as indicated in the table below.
131
In terms of the coordinates y = P x which diagonalize the quadratic form as in Eq. (7.48), the curve or
surface defined by Eq. (7.51) can be written as
X
λi yi2 = c . (7.52)
i
Focus on the case of an ellipse or ellipsoid. The standard form of the equation defining an ellipse or
ellipsoid is given by
X y2
i
l 2 =1, (7.53)
i i
where li can be interpreted as the lengths of the semi-axes. By comparison with Eq. (7.52) we see that
these lengths can be computed from the eigenvalues of the matrix Q by
r
c
li = . (7.54)
λi
In the basis with coordinates y the semi-axes are in the directions of the standard unit vectors ei . Hence,
in the original basis with coordinates x the semi-axis are in the directions vi = P ei , that is, in the
directions of the eigenvectors vi of Q.
132
Literature
A large number of textbooks on the subject can be found, varying in style from “Vectors and Matrices for
Dummies” to hugely abstract treaties. I suggest a trip to the library in order to pick one or two books in
the middle ground that you feel comfortable with. Below is a small selection which have proved useful in
preparing the course.
• Mathematical Methods for Physics and Engineering, K. F. Riley, M. P. Hobson and S. J. Bence,
CUP 2002.
This is the recommended book for the first year physics course which covers vectors and matrices
and much of the other basic mathematics required. As the title suggests it is a “hands-on” book,
strong on explaining methods and concrete applications, rather weaker on presenting a coherent
mathematical exposition.
133
A Definition of groups and fields
Definition A.1. (Definition of a group) A group G is a set with an operation
·:G×G→G, (g, h) → g · h “group multiplication”
satisfying:
(G1) g · (h · k) = (g · h) · k for all g, h, k ∈ G. “associativity”
(G2) There exists a 1 ∈ G such that 1 · g = g for all g ∈ G. “neutral element”
(G3) For all g ∈ G, there exists a g −1 ∈ G, such that g −1 · g = 1. “inverse”
The group is called Abelian if in addition
(G4) g · h = h · g for all g, h ∈ G. “commutativity”
Standard examples of fields are the rational numbers, Q, the real numbers, R and the complex numbers,
C. Somewhat more exotic examples are the finite fields Fp = {0, 1, . . . , p − 1}, where p is a prime number
and addition and multiplication are defined by regular addition and multiplication of integers modulo p,
that is, by the remainder of a division by p. Hence, whenever the result of an addition or multiplication
exceeds p − 1 it is “brought back” into the range {0, 1, . . . , p − 1} by subtracting a suitable multiple of p.
The smallest field is F2 = {0, 1} containing just the neutral elements of addition and multiplication which
must exist in every field.
Clearly, this set has n! elements and it forms a group in the sense of Def. A.1, with composition of maps
(which is associative) as the group operation, the identity map as the neutral element and the inverse map
σ −1 ∈ Sn as the group inverse for σ ∈ Sn . Permutations are also sometimes written as
1 2 ··· n
σ= , (B.2)
σ(1) σ(2) · · · σ(n)
so as a 2 × n array of numbers (not a matrix in the sense of linear algebra), indicating that a number in
the first row is permuted into the number in the second row underneath. The permutation group S2 has
only two elements
1 2 1 2
S2 = , , (B.3)
1 2 2 1
134
the identity element and the permutation which swaps 1 and 2. Clearly, S2 is Abelian (that is, map
composition commutes) but this is no longer true for n > 2. For example, in S3 , the two permutations
1 2 3 1 2 3
σ1 = , σ2 = (B.4)
1 3 2 2 1 3
do not commute since
1 2 3 1 2 3
σ1 ◦ σ2 = but σ2 ◦ σ1 = . (B.5)
3 1 2 2 3 1
The special permutations which swap two numbers and leave all other numbers unchanged are called
transpositions.
Lemma B.1. Every permutation σ ∈ Sn (for n > 1) can be written as σ = τ1 ◦ · · · ◦ τk , where τi are
transpositions.
Proof. Suppose that σ maps the first k1 −1 ≥ 0 numbers into themselves, so σ(i) = i for all i = 1, . . . , k1 −1
and this is the maximal such number, so that σ(k1 ) 6= σ(k1 ) and, indeed, σ(k1 ) > k1 . Then define τ1 as
the transposition which swaps k1 and σ(k1 ). The permutation σ1 = τ1 ◦ σ then leaves the first k2 − 1
numbers unchanged and, crucially, k2 > k1 . We can continue this process until, after at most n steps,
id = τk ◦ · · · τ1 ◦ σ. Since transpositions are their own inverse it follows that σ = τ1 ◦ · · · ◦ τk .
We would like to distinguish between even and odd permutations. Formally, this is achieved by the
following
Definition B.1. The sign of a permutation σ ∈ Sn is defined as
Y σ(i) − σ(j)
sgn(σ) = . (B.6)
i−j
i<j
135
C Tensors for the curious
Tensors are part of more advanced linear algebra and for this reason they are often not considered in an
introductory text. However, in many physics courses there is no room to return to the subject at a later
time and, as a result, tensors are often not taught at all. To many physicists, they remain mysterious,
despite their numerous applications in physics. If you do not want to remain perplexed, this appendix is
for you. It provides a short, no-nonsense introduction into tensors, starting where the main text has left
off.
We start with a vector space V over a field F with basis 1 , . . . , n and its dual V ∗ . Recall, that V ∗ is the
vector space of linear functionals V → F . From Section 6.5, we know that V ∗ has a dual basis 1∗ , . . . , n∗
satisfying
i∗ (j ) = δji . (C.1)
In particular, V and V ∗ have the same dimension. An obvious, but somewhat abstract problem which we
need to clarify first has to do with the “double-dual” of a vector space. In other words, what is the dual,
V ∗ ∗ of the dual vectors space V ∗ ? Our chosen terminology suggests the double-dual V ∗ ∗ should be the
original vector space V . This is indeed the case, in the sense of the following
Lemma C.1. The linear map : V → V ∗ ∗ defined by (v)(ϕ) := ϕ(v) is bijective, that is, it is an
isomorphism between V and V ∗ ∗ .
Proof. From Lemma 3.1 and since dim(V ) = dim(V ∗ ) = dim(V ∗ ∗ ) all we need to show is that Ker() =
{0}. Start with a vector v = v i i ∈ Ker(). Then, for all ϕ ∈ V ∗ , we have 0 = (v)(ϕ) = ϕ(v). Choose
ϕ = j∗ and it follows that 0 = j∗ (v) = v j . Hence, all component v j vanish and v = 0.
Note that the definition of the above map does not depend on a choice of basis. For this reason it is
also referred to as a canonical isomorphism between V or V ∗ ∗ . We should think of V and V ∗ ∗ as the
same space by identifying vectors v ∈ V with their images (v) ∈ V ∗ ∗ under . Since V ∗ ∗ ∼
= V consists
of linear functionals on V ∗ this means that the relation between V and V ∗ is “symmetric”. Not only can
elements ϕ ∈ V ∗ act on vectors v ∈ V but also the converse works. This is the essence of the relation
(v)(ϕ) := ϕ(v), defining the map , which, by abuse of notation, is often written as
Having put the vector space and its dual on equal footing we can now proceed to define tensors. We
consider two vector spaces V and W over F and define the tensor space
V ∗ ⊗ W ∗ := {τ : V × W → F | τ bi-linear} . (C.3)
In other words, the tensor space V ∗ ⊗W ∗ consists of all maps τ which assign to their two vector arguments
v ∈ V and w ∈ W a number τ (v, w) and are linear in each argument. Note that we can think of this
as a generalization of a linear functional. While a linear functional assign a number to a single vector
argument, the tensor τ does the same for two vector arguments. This suggests that tensors might be
“built up” from functionals. To this end, we introduce the tensor product ϕ ⊗ ψ between two functionals
ϕ ∈ V ∗ and ψ ∈ W ∗ by
(ϕ ⊗ ψ)(v, w) := ϕ(v)ψ(w) . (C.4)
Clearly, the so-defined map ϕ ⊗ ψ is an element of the tensor space V ∗ ⊗ W ∗ since it takes two vector
arguments and is linear in each of them (since ϕ and ψ are linear in their respective arguments).
Can we get all tensors from tensor products? In a certain sense, the answer is “yes” as explained in the
following
136
Lemma C.2. For a basis {i∗ }, where i = 1, . . . , n, of V ∗ and a basis {˜
a∗ }, where a = 1, . . . , m, of W ∗
the tensor products {∗ ⊗ ˜∗ } form a basis of V ⊗ W . In particular, dim(V ∗ ⊗ W ∗ ) = dim(V ∗ ) dim(W ∗ ).
i a ∗ ∗
Proof. We introduce the dual basis {j } on V and {˜ b } on W so that i∗ (j ) = δji and ˜a∗ (˜
b ) = δba .
As usual, we need to prove that the tensors i∗ ⊗ ˜a∗ are linearly independent and span the tensor space
V ∗ ⊗ W ∗ . We begin with linear independence.
X
τia i∗ ⊗ ˜a∗ = 0
i,a
Acting with this equation on the vector pair (j , ˜b ) gives
X X X
0= τia i∗ ⊗ ˜a∗ (j , ˜b ) = τia i∗ (j )˜
a∗ (˜
b ) = τia δji δba = τjb .
i,a i,a i,a
Since µ(i , ˜a ) = τia = τ (i , ˜a ), the tensors µ and τ coincide on a basis and, hence, µ = τ . We have,
therefore, written the arbitrary tensor τ as a linear combination of the j∗ ⊗ ˜b∗ .
The above Lemma provides us with a simple way to think about the tensors in V ∗ ⊗ W ∗ . Given a
b } on W the tensors in V ∗ ⊗ W ∗ are given by
basis {j } on V and {˜
X
τ= τia i∗ ⊗ ˜a∗ , (C.5)
i,a
where τia ∈ F are arbitrary coefficients. Often, the basis elements are omitted and the set of components
τia , labelled by two indices, are referred to as tensor. This can be viewed as a generalization of vectors
whose components are labelled by one index.
Tensoring can of course be repeated with multiple vector spaces. A particularly important tensor
space, on which we will focus from hereon, is
∗ j
V
| ⊗ ·{z
· · ⊗ V} ⊗ V · · ⊗ V }∗ = Span{i1 ⊗ · · · ⊗ ip ⊗ j∗1 ⊗ · · · ⊗ ∗q } ,
| ⊗ ·{z (C.6)
p q
formed from p factors of the vector space V and q factors of its dual V ∗ . Its dimension is dim(V )p+q . A
general element of this space can be written as a linear combination
i ···i j
X
τ= τj11···jqp i1 ⊗ · · · ⊗ ip ⊗ j∗1 ⊗ · · · ⊗ ∗q , (C.7)
i1 ,...,ip ,j1 ...,jq
i ···i
and is also referred to as a (p, q) tensor. It can also be represented by the components τj11···jqp which carry
p upper and q lower indices and practical applications are often phrased in those terms. From a (p, q)
i ···i ···kr
tensor τj11···jqp and an (r, s) tensor µlk11···l s
we can create a new tensor by multiplication and contraction
(that is summation) over some (or all) of the upper indices of τ and the lower indices of µ and vice versa.
Such a summation over same upper and lower indices is in line with the Einstein summation convention
and corresponds to the action of dual vectors on vectors, as discussion in Section 6.5.
It is probably best to discuss this explicitly for a number of examples. It turns out that many of the
objects we have introduced in the main part of the text can be phrased in the language of tensors.
137
Example C.1: Examples of tensors
(a) Vectors as tensors
A vector v = v i i in a vector space V with basis {i } is a (1, 0) tensor and, accordingly, its component
form is v i , an object with one upper index.
(b) Dual vectors as tensors
Linear functionals ϕ = ϕi i∗ in the dual vector space V ∗ with dual basis {i∗ } are (0, 1) tensors and are,
hence, represented by components ϕi with one lower index. As already discussed in Section 6.5, the action
of a linear functional on a vector is given by
ϕ(v) = ϕi v j i∗ (j ) = ϕi v j δji = ϕi vi , (C.8)
which corresponds to the contraction of a (1, 0) and a (0, 1) tensor over their single index to produce a
(0, 0) tensor, that is, a scalar.
(c) Linear maps and matrices as tensors
Consider a linear map f : V → V which is represented by a matrix A relative to the basis {i } of V .
Then, from
P Lemma 3.3, the components Ai j of A are obtained from the images of the basis vectors via
f (j ) = i A j i . We can re-write this relation as f (j ) = Ai k i k∗ (j ) and, stripping off the basis vector
i
138
As we have seen in Lemma 6.6, a non-degenerate bi-linear form leads to an isomorphism ı : V → V ∗
between the vector space and its dual and we can use this to define a scalar product h · , · i∗ on the dual
vector space V ∗ by
hϕ, ψi∗ := hı−1 (ϕ), ı−1 (ψ)i . (C.15)
Since the representing matrix for ı is gij its inverse, ı−1 , is represented by g ij . Hence, we can write the
scalar product on the dual vector space as
This shows that the scalar product h · , · i∗ on V ∗ can be viewed as a (2, 0) tensor with component g ij .
The identification of vector space and dual vector space by a non-degenerate bi-linear form (via the
map ı and its representing matrix gij ) can be extended to tensors and used to change their degree. We
can use the metric gij to lower one of the (upper) indices of a (p, q) tensor, thereby converting it into
a (p − 1, q + 1) tensor and the inverse metric g ij to raise a (lower) index of a (p, q) tensor to produce a
(p + 1, q − 1) tensor.
If the basis {i } is an ortho-normal basis of the scalar product h · , · i, then the metric is gij = δij
and its inverse g ij = δ ij and, from this point of view, the Kronecker delta δij is a (0, 2) tensor and its
upper-index version, δ ij , a (2, 0) tensor.
(f) Determinant as a tensor
We consider V = Rn (or V = Cn ) with the basis of standard unit vectors {ei } and the associated dual
basis {ei∗ }. The determinant is, by definition, linear in each of its n vectorial arguments and is, therefore,
a tensor. To make this more explicit we start with n vectors vi = vij ej and write
det(v1 , . . . , vn ) = i1 ···in v1i1 · · · vnin = i1 ···in ei∗1 ⊗ · · · ⊗ ei∗n (v1 , . . . , vn ) . (C.17)
which shows that the determinant can be viewed as a (0, n) tensor whose components are given by the
Levi-Civita tensor i1 ···in . Computing the determinant amounts to contracting the Levi-Civita (0, n) tensor
into n (1, 0) tensors (that is, into n vectors), resulting in a scalar.
139