MT3042-guide_2024 (1)
MT3042-guide_2024 (1)
Optimisation
theory
B. von Stengel
MT3042
2024
Optimisation theory
B. von Stengel
MT3042
2024
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 300 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 6 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information see: london.ac.uk
This guide was prepared for the University of London by:
Bernhard von Stengel, Department of Mathematics, London School of Economics
and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
The University of London asserts copyright over all material in this subject guide
except where otherwise indicated. All rights reserved. No part of this work may
be reproduced in any form, or by any means, without permission in writing from
the publisher. We make every effort to respect copyright. If you think we have
inadvertently used your copyright material, please let us know.
Contents
2 Combinatorial Optimisation 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Essential Reading . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Synopsis of Chapter Content . . . . . . . . . . . . . . . . . 15
2.2 Introductory Example: The Marriage Problem . . . . . . . . . . . . 16
2.3 Graphs, Digraphs, Networks . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Walks, Paths, Tours, and Cycles . . . . . . . . . . . . . . . . . . . . 19
2.5 Shortest Walks in Networks . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Introduction to Algorithms . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Single-Source Shortest Paths: Bellman–Ford . . . . . . . . . . . . . 27
2.8 O-Notation and Running-Time Analysis . . . . . . . . . . . . . . . 35
2.9 Single-Source Shortest Paths: Dijkstra’s Algorithm . . . . . . . . . 38
2.10 Reminder of Learning Outcomes . . . . . . . . . . . . . . . . . . . 43
2.11 Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 43
2
Contents 3
3 Continuous Optimisation 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 Essential Reading . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.4 Synopsis of Chapter Content . . . . . . . . . . . . . . . . . 47
3.2 The Real Numbers and Their Order . . . . . . . . . . . . . . . . . . 48
3.3 Infimum and Supremum . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Constructing the Real Numbers * . . . . . . . . . . . . . . . . . . . 52
3.5 Maximisation and Minimisation . . . . . . . . . . . . . . . . . . . . 55
3.6 Sequences, Convergence, and Limits . . . . . . . . . . . . . . . . . 57
3.7 Euclidean Norm and Maximum Norm . . . . . . . . . . . . . . . . 60
3.8 Sequences and Convergence in R𝑛 . . . . . . . . . . . . . . . . . . . 63
3.9 Open and Closed Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.10 Bounded and Compact Sets . . . . . . . . . . . . . . . . . . . . . . 67
3.11 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.12 Proving Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.13 The Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . . . . 77
3.14 Using the Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . 78
3.15 Reminder of Learning Outcomes . . . . . . . . . . . . . . . . . . . 82
3.16 Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 First-Order Conditions 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.2 Essential Reading . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.4 Synopsis of Chapter Content . . . . . . . . . . . . . . . . . 87
4.2 Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Matrix Multiplication for Vectors and Scalars . . . . . . . . . . . . 91
4.4 Differentiability in R𝑛 . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Partial Derivatives and 𝐶 1 Functions . . . . . . . . . . . . . . . . . 96
4.6 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Unconstrained Optimisation . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Equality Constraints and the Theorem of Lagrange . . . . . . . . . 104
4.9 Inequality Constraints and the KKT Conditions . . . . . . . . . . . 110
4.10 Reminder of Learning Outcomes . . . . . . . . . . . . . . . . . . . 121
4.11 Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . 121
The subject guide is your main resource for studying the subject on your own.
This particular guide on Optimisation Theory is designed to contain all necessary
materials for self-study, with activities along the way and exercises at the end of
each chapter. Additional textbooks or other sources are only listed to allow you
to consider topics in more depth or from a different angle. The material can be
understood fully from the subject guide alone.
Given that this subject guide is largely self-contained, it is also similar to a
textbook on mathematical optimisation. At the same time, it aims to support you
in engaging actively with the subject.
This is a mathematics text, which assumes a mathematical mindset that aims
to be precise and is abstract. Mathematical thinking is based on concepts, such as
numbers, functions, or sets. In the mind of a mathematician, they have a clear
meaning, which are made precise with definitions and using certain commonly used
notations. The interesting parts are the useful and often surprising relationships
between these concepts stated as theorems (also called lemmas or propositions if of
lesser importance). Every theorem has a proof that argues that the theorem is
true, with a sequence of convincing logical steps that can be followed sentence by
sentence. The proof (which can be quite involved) is stated separately from the
theorem. In this text, every proof ends with the symbol. This has the explicit
purpose that one can skip the proof at first reading. It is important to understand
and remember the theorem in order to use it, but much less so its proof.
Precise meanings of words for mathematical concepts, mathematical notation,
and the formal statements of definitions, theorems, and proofs take some time to get
used to, but they are only employed to achieve clarity and precision. Understanding
mathematics requires seeing how these abstract concepts apply to specific examples
and problems. In fact, it is best to start with examples first, which is the approach
5
6 Chapter 1. Introduction to the Subject Guide
taken in this guide. The example should explain the concept, the theorem, and in
some cases even why the theorem is true and thereby the idea behind the proof.
This half course brings together several parts of the wide area of mathematical
optimisation, as encountered in many applied fields. The emphasis is on the
mathematical ideas and theory used in different types of optimisation called
combinatorial, linear, and continuous optimisation. Each of these types represents a
large topic of its own, and this course only gives an introduction to the basic ideas.
Combinatorial optimisation is about problems with finitely many combinations
of choices, of which the best one has to be found. For example, in a public
transportation network the problem may be to get from one location to another
by buses and underground trains in order to minimise travel time. The question
is then to find the best combination of bus and underground trips. A suitable
abstraction of this problem is a network with nodes that represent bus stops and
underground stations, and connections between them with associated travel times.
The optimisation problem is then to find the sequence of nodes with the shortest
overall travel time between two nodes. Combinatorial optimisation methods are
generally algorithms that take an input, such as the data of the transportation
network and the start node and end node of a trip, and compute an output, such
as the best route. All this will be covered in detail in Chapter 2.
At the other end of the spectrum of optimisation methods are choices of
continuous quantities such that a function that depends continuously on these
quantities is minimised. Chapter 3 introduces the theory for this set-up. A classic
problem is that of minimising the material needed for a cylindrical container such
as a beer can that contains a prescribed volume. In essence, all that is needed is the
height of the cylinder, which then determines its diameter in order to obtain the
given volume. The resulting surface area, as a function of the height, determines
the amount of material (of a fixed thickness) that is to be minimised. A suitable
method for this is known from calculus: Take the derivative of the surface area
function that depends on the single variable height and find the zeroes of this
derivative, one of which should be the minimum of the surface area function. It
turns out that a more general and elegant way is to look at the surface area as
a function of the two variables height and diameter, where these variables have
to satisfy the constraint of giving the prescribed volume. An optimal solution
then has to fulfill the property that the derivative vector of the function that is
optimised (called its gradient) is a linear multiple of the gradient of the constraint
function. This is explained in Chapter 4, which is about continuous optimisation
of differentiable functions.
1.3. Syllabus 7
1.3 Syllabus
This course aims to bring together several parts of the wide area of mathematical
optimisation. The course starts with an introduction to combinatorial optimisation
with the discrete problem of finding shortest paths in networks.
Subsequent parts concentrate on continuous optimisation, and in this sense
extend the theory studied in standard calculus courses. In contrast to the Mathemat-
ics 1 and Mathematics 2 half courses, the emphasis in this part of the Optimisation
Theory course will be on the mathematical ideas and theory used in continuous
optimisation.
The final part on linear programming and its duality theorem relates to both
combinatorial and continuous optimisation.
This course covers the following topics:
• Introduction to combinatorial optimisation. Shortest paths in directed graphs.
Algorithms and their running time.
• Introduction and review of relevant parts from real analysis, with emphasis
on higher dimensions.
• Classical results on continuous optimisation: Weierstrass’s Theorem con-
cerning continuous functions on compact sets. Review with added rigour
of unconstrained optimisation of differentiable functions on open sets. La-
grange’s Theorem on equality-constrained optimisation. Karush, Kuhn, and
Tucker’s Theorem on inequality-constrained optimisation.
• Linear programming and duality.
At the end of this half course and having completed the essential reading and
activities, students should:
• have knowledge and understanding of important definitions, concepts and
results in the subject, and of how to apply these in different situations
• have knowledge of basic techniques and methodologies in the topics covered
• have basic understanding of the theoretical aspects of the concepts and
methodologies covered
• be able to understand new situations and definitions, including combinations
with elements from different areas covered in the course, investigate their
properties, and relate them to existing knowledge
• be able to think critically and with sufficient mathematical rigour.
Below are the four most relevant skill outcomes for students undertaking this
course which can be conveyed to future prospective employers:
• complex problem-solving
• decision making
• adaptability and resilience
• creativity and innovation.
Each chapter concludes with a set of exercises; many activities in the text
point to these exercises. They test methods (such as how to solve specific optimi-
sation problems), ask to prove some simple properties, or in some cases test the
understanding of mathematical concepts used in proofs.
While the subject guide is meant to provide all material, it is good study practice
to consult other resources as well. They give a different perspective, and allow
you to compare what you read in this guide with other descriptions. This is useful
in several respects. For example, it may help you to communicate with a future
colleague who has learned the material using slightly different terminology. It
may also help you to check and improve your own understanding of the topic.
Reading more than one author helps you acquire the generally useful skill of
quickly understanding technical texts.
The following books provide additional reading. The relevant reading material
will be repeated in each chapter, with explanations about its relevance.
Bryant, V. (1990). Yet Another Introduction to Analysis. Cambridge University Press,
Cambridge, UK. ISBN 978-0521388351.
Chvátal, V. (1983). Linear Programming. W. H. Freeman, New York. ISBN 978-
0716715870.
Conforti, M., G. Cornuéjols, and G. Zambelli (2014). Integer Programming. Springer,
Cham. ISBN 978-3319110073.
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2022). Introduction to
Algorithms, 4th ed. MIT Press, Cambridge, MA. ISBN 978-0262046305.
Dantzig, G. B. (1963). Linear Programming and Extensions. Princeton University
Press, Princeton, NJ. ISBN 978-0691059136.
Gale, D. (1960). The Theory of Linear Economic Models. McGraw-Hill, New York.
ISBN 978-0070227286.
Kuhn, H. W. (1991). Nonlinear programming: A historical note. In: History of
Mathematical Programming: A Collection of Personal Reminiscences, edited by J. K.
Lenstra, A. H. G. Rinnoy Kan, and A. Schrijver, 82–96. CWI and North-Holland,
Amsterdam. ISBN 978-0444888181.
Matoušek, J. and B. Gärtner (2007). Understanding and Using Linear Programming.
Springer, Berlin. ISBN 978-3540306979.
10 Chapter 1. Introduction to the Subject Guide
In addition to the subject guide and the Essential reading, it is crucial that you
take advantage of the study resources that are available online for this course,
including the VLE and the Online Library.
You can access the VLE, the Online Library and your University of London
email account via the Student Portal at: https://my.london.ac.uk
You should have received your login details for the Student Portal with your
official offer, which was emailed to the address that you gave on your application
form. You have probably already logged in to the Student Portal in order to register!
As soon as you registered, you will automatically have been granted access to the
VLE, Online Library and your fully functional University of London email account.
If you have forgotten these login details, please click on the ‘Forgot password’
link on the login page.
The VLE, which complements this subject guide, has been designed to enhance
your learning experience, providing additional support and a sense of community.
It forms an important part of your study experience with the University of London
and you should access it regularly.
The VLE provides a range of resources for EMFSS courses:
Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
• Readings: Direct links, wherever possible, to essential readings in the Online
Library, including journal articles and ebooks.
• Video content: Including introductions to courses and topics within courses,
interviews, lessons and debates.
1.7. Overview of Learning Resources 11
A sample examination paper is available on the VLE. The general advice for exam
preparation is to identify the central idea behind each concept. Start with the basic
concepts, and test them with examples, as in the text or in the exercises. This will
help you understand what is going on, which is absolutely essential for coping with
a relatively abstract mathematical topic as in this guide. More involved concepts
can then be studied with a solid foundation of the basics.
Some methods that apply these concepts should also be practiced. However,
do not overdo this: The exam will most likely consist of unseen questions. Do not
rush into using a method that you have practiced without checking carefully if it
applies to the current question. A few minutes of thinking what is actually asked
can help you save a lot of precious time that you would lose with a wasted effort.
Write down your reasoning concisely in words rather than just producing a
sequence of equations. This will also help the examiner judge what approach you
are using in order to give you partial credit if the answer is not fully correct.
In general, allocate your time well, and proceed to the next part of the question,
or next question altogether, when you are stuck.
1.9 Conventions
Activities in the text are shown inside a box and start with an arrow ⇒, as in
References to statements inside this guide are in upper case, like “Theorem 4.5”,
and to other works in lower case, like “theorem 9.21 of Rudin (1976)”.
Sections with a star * after their title are optional reading and are included
for the mathematically inclined student who would like to know more about
more general ideas. The material in these “starred” sections is still designed to be
accessible, but is kept optional in order to limit the overall amount of important
things needed to learn in this course.
The following common mathematical notations are assumed to be known: N is
the set {1, 2, 3, . . .} of positive integers, Q is the set of rational numbers (fractions of
integers), R is the set of real numbers (points on the real line, with a mathematical
construction discussed in Section 3.4), and, as a special notation used in this guide,
R≥ is the set of nonnegative reals, R≥ = { 𝑥 ∈ R | 𝑥 ≥ 0}. This definition of a set
reads “R≥ is the set of real numbers 𝑥 such that 𝑥 is greater than or equal to zero”,
where the vertical bar | means “such that”.
1.9. Conventions 13
If 𝐴 and 𝐵 are sets, then 𝐴 ⊆ 𝐵 means 𝐴 is a subset of 𝐵 (that is, every element
of 𝐴 is an element of 𝐵). This includes the case 𝐴 = 𝐵. If this is not allowed, that is,
𝐴 is a proper subset of 𝐵, then we write 𝐴 ⊂ 𝐵.
2
Combinatorial Optimisation
2.1 Introduction
This chapter is about combinatorial optimisation. That is, the set of possibilities
that can be optimised over is usually finite, and these possibilities consist of
combinations of choices. These combinations will be explored by algorithms that
are executed by a computer.
The concept of an algorithm will be explained in this chapter. Section 2.6 may
be the first time you ever see an algorithm. No prior computer programming
experience is necessary.
The contents of this chapter are independent of the remaining chapters of the
guide and are not needed for studying them.
14
2.1. Introduction 15
Algorithms are a central topic in computer science. They are closely related to data
structures that represent the elements of a network, say, in computer memory. The
“bible” of algorithms is the following book:
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2022). Introduction to
Algorithms, 4th ed. MIT Press, Cambridge, MA. ISBN 978-0262046305.
At nearly 1,400 pages, it describes important algorithms and their analysis in great
detail. In that book, you find further discussion of
• graphs and their representations (Section 2.3 in this guide) in chapter 20,
• single-source shortest path algorithms (Sections 2.5, 2.7, 2.9) in chapter 22,
• O-notation (Section 2.8) in chapter 3.
A good description of an algorithm for bipartite matching (see Section 2.2) is
given in section 10.2 of
Papadimitriou, C. H. and K. Steiglitz (1998). Combinatorial Optimization:
Algorithms and Complexity. Dover, Mineola, NY. ISBN 978-0486402581.
• Section 2.8 defines the O-notation to analyse and describe the running time of
algorithms in a concise manner.
Section 2.1.3 gives some pointers to further optional reading on the large subject of
algorithms and combinatorial optimisation.
𝑎1 𝑏1
𝑎2 𝑏2 (2.1)
𝑎3 𝑏3
⇒ Try Exercise 2.1, which considers these possibilities when one more possible
couple is added to (2.1).
Instead of women and men who marry, the matching problem similarly applies
to assigning workers to jobs (where the graph describes which worker can do
which job), or other applications. In an abstract setting, the women and men define
the nodes of a graph, with the possible couples called edges, here drawn as lines
between the nodes. The graph in the marriage problem has the special property of
being bipartite (meaning that edges always connect two nodes 𝑎 and 𝑏 that come
from two disjoint sets). We will soon, in Definition 2.1, define the concept of a
graph.
2.3. Graphs, Digraphs, Networks 17
Definition 2.1. A graph is given by (𝑉 , 𝐸) with a finite set 𝑉 of vertices or nodes and
a set 𝐸 of edges which are unordered pairs {𝑢, 𝑣} of nodes 𝑢, 𝑣.
not how these edges are drawn. To emphasise the nodes (which may ambiguously
appear when edges cross in the drawing), they are typically drawn as small disks,
as in the middle picture. It may also be necessary to draw crossing edges, as in the
right picture which has an additional edge from 𝑢 to 𝑣.
u
The following is an example with 𝑉 = {𝑎, 𝑏, 𝑐, 𝑑}, and edges in 𝐸 given by the
unordered pairs {𝑎, 𝑏}, {𝑏, 𝑐}, {𝑏, 𝑑}, and {𝑐, 𝑑}:
a b
(2.2)
c d
We will normally not assume that connections between two nodes 𝑢 and 𝑣 are
symmetric (even though this may apply in many cases). The concept of a directed
graph allows us to distinguish between getting from 𝑢 to 𝑣 and getting from 𝑣 to 𝑢.
The following is an example with 𝑉 = {𝑎, 𝑏, 𝑐, 𝑑}, and arcs in 𝐴 given by the
ordered pairs (𝑎, 𝑏), (𝑏, 𝑐), (𝑐, 𝑑), (𝑏, 𝑑), and (𝑑, 𝑏), where an arc (𝑢, 𝑣) is drawn as
an arrow that points from 𝑢 to 𝑣 :
a b
(2.3)
c d
An arc (𝑢, 𝑣) is also called an arc from 𝑢 to 𝑣. In a digraph, we do not allow arcs
(𝑢, 𝑢) from a node 𝑢 to itself (such arcs are called “loops”). We do allow arcs in
reverse directions such as (𝑢, 𝑣) and (𝑣, 𝑢), as in the example (2.3) for 𝑢, 𝑣 = 𝑏, 𝑑.
Because 𝐴 is a set, it cannot record multiple or “parallel” arcs between the same
nodes, that is, two or more arcs of the form (𝑢, 𝑣), so these are automatically
excluded.
2.4. Walks, Paths, Tours, and Cycles 19
⇒ Given this definition of digraphs which does not allow loops (𝑢, 𝑢) or multiple
parallel arcs, answer Exercise 2.2.
The following definition defines a network as a digraph where every arc has a
weight. This weight is in principle a real number (an element of R) but needs to
be stored in a computer in finite form, and is therefore assumed to be a fraction
(that is, a rational number, an element of Q). Sometimes we also allow weights
that are larger than any rational number, denoted by the symbol ∞ for infinity; in
a computer, this has a separate encoding as a special “number”.
Definition 2.3. Let 𝐷 = (𝑉 , 𝐴) be a digraph. A network is given by (𝐷, 𝑤) with a
weight function 𝑤 : 𝐴 → Q ∪ {∞} that assigns a weight 𝑤(𝑢, 𝑣) to every arc (𝑢, 𝑣)
in 𝐴.
The following is an example of a network with the digraph from (2.3). Next to
each arc is written its weight. In many examples we use integers as weights, but
this need not be the case, like here where 𝑤(𝑎, 𝑏) = 1.2; this is a fraction, namely 12
10 .
1.2
a b
−9 3 −7
c d
2
The underlying structure in our study will always be a digraph, typically with a
weight function so that this becomes a network. The arcs in the digraph represent
connections between nodes that can be followed. A sequence of such connections
defines a walk in the network. The following terminology also defines certain
special cases of walks.
Definition 2.4. Let 𝐷 = (𝑉 , 𝐴) be a digraph. A walk in 𝐷 is a finite sequence of
nodes 𝑢0 , 𝑢1 , . . . , 𝑢 𝑘 for some 𝑘 ≥ 0 such that (𝑢𝑖 , 𝑢𝑖+1 ) ∈ 𝐴 for 0 ≤ 𝑖 < 𝑘, which
are the 𝑘 arcs of the walk, and 𝑘 is called the length of the walk. This is also called a
𝑢0 , 𝑢 𝑘 -walk, or a walk from 𝑢0 to 𝑢 𝑘 , or a walk with startpoint 𝑢0 and endpoint 𝑢 𝑘 .
The walk is called a path if the nodes 𝑢0 , 𝑢1 , . . . , 𝑢 𝑘 are all distinct, a tour if 𝑢0 = 𝑢 𝑘 ,
and a cycle if it is a tour and the nodes 𝑢1 , . . . , 𝑢 𝑘 are all distinct.
The visited nodes on a walk are just the nodes of the walk. In a walk, but not in
a path, a node may be revisited. A tour starts and ends at the same node. A cycle
has also the same startpoint and endpoint but otherwise does not allow revisiting
a node.
20 Chapter 2. Combinatorial Optimisation
Proposition 2.5. Consider a digraph and two nodes 𝑢, 𝑣. If there is a walk from 𝑢 to 𝑣,
then there is a path from 𝑢 to 𝑣.
In a similar way, one can show the following (the qualifier “positive length” is
added to exclude the trivial cycle of length zero).
Proposition 2.6. Consider a digraph and a node 𝑢. If there is a tour of positive length
that starts and ends at 𝑢, then there is a cycle of positive length that starts and ends at 𝑢.
In a network, the weights typically represent costs of some sort associated with
the respective arcs. Weights for walks (and similarly of paths, tours, cycles) are
defined by summing the weights of their arcs.
2.5. Shortest Walks in Networks 21
We allow for walks 𝑊 of length zero (𝑘 = 0), in which case 𝑤(𝑊) = 0. If in (2.4)
𝑤(𝑢𝑖 , 𝑢𝑖+1 ) = ∞ for some 𝑖, then 𝑤(𝑊) = ∞.
Note the difference between length and weight of a walk: length counts the
number of arcs in the walk, whereas weight is the sum of the weights of these arcs
(length is the same as weight only if every arc has weight one).
Given a network and two nodes 𝑢 and 𝑣, we are interested in the 𝑢, 𝑣-walk
of minimum weight, often called shortest walk from 𝑢 to 𝑣 (remember throughout
that “shortest” means “least weighty”). Because there may be infinitely many
walks between two nodes if there is a possibility to revisit some nodes on the
way, “minimum weight” may not be a number, but may be equal to plus or minus
infinity.
and 𝑦(𝑢, 𝑣) = {𝑤(𝑊) | 𝑊 ∈ 𝑌(𝑢, 𝑣)}. The distance from 𝑢 to 𝑣, denoted by dist(𝑢, 𝑣),
is defined as follows:
+∞
if 𝑤(𝑊) = ∞ for all 𝑤 ∈ 𝑌(𝑢, 𝑣),
dist(𝑢, 𝑣) = −∞ if 𝑦(𝑢, 𝑣) is nonempty and has no smallest number,
min 𝑦(𝑢, 𝑣)
otherwise.
Theorem 2.9. Let 𝑢, 𝑣 be two nodes in a network (𝐷, 𝑤). Then dist(𝑢, 𝑣) = −∞ if and
only if there is a cycle 𝐶 with 𝑤(𝐶) < 0 that starts and ends at some node on a walk from
𝑢 to 𝑣. If dist(𝑢, 𝑣) ≠ ±∞, then
Proof. We make repeated use of the proof of Proposition 2.5. Suppose there is a
walk 𝑃 = 𝑢0 , 𝑢1 , . . . , 𝑢 𝑘 with 𝑢0 = 𝑢 and 𝑢 𝑘 = 𝑣 and a cycle 𝐶 = 𝑢𝑖 , 𝑣1 , . . . , 𝑣ℓ −1 , 𝑢𝑖
22 Chapter 2. Combinatorial Optimisation
that starts and ends at some node 𝑢𝑖 on that walk, 0 ≤ 𝑖 ≤ 𝑘, with 𝑤(𝐶) < 0. Let
𝑛 ∈ N. We insert 𝑛 repetitions of 𝐶 into 𝑃 to obtain a walk 𝑊 that we write (in an
obvious notation) as
𝑊 = 𝑢0 , 𝑢1 , . . . , 𝑢𝑖 , [𝑣 1 , . . . , 𝑣ℓ −1 , 𝑢𝑖 , ]𝑛 𝑢𝑖+1 , . . . , 𝑢 𝑘 .
The first 𝑖 arcs together with the last 𝑘 − 𝑖 arcs of 𝑊 are those of 𝑃, with 𝑛
copies of 𝐶 in the middle, so 𝑊 has weight 𝑤(𝑃) + 𝑛 · 𝑤(𝐶). For larger 𝑛 this is
arbitrarily negative because 𝑤(𝐶) < 0, and 𝑊 belongs to the set 𝑌(𝑢, 𝑣). Hence
dist(𝑢, 𝑣) = −∞.
Conversely, let dist(𝑢, 𝑣) = −∞. Consider a path 𝑃 from 𝑢 to 𝑣 of minimum
weight 𝑤(𝑃) as given by the minimum in (2.5). Suppose there is a 𝑢, 𝑣-walk 𝑊
with 𝑤(𝑊) < 𝑤(𝑃), which exists because dist(𝑢, 𝑣) = −∞ (otherwise 𝑤(𝑃) would
be a lower bound of 𝑦(𝑢, 𝑣)). Because 𝑊 is clearly not a path, it contains a tour 𝑇
as in the proof of Proposition 2.5. If 𝑤(𝑇) ≥ 0 then we could remove 𝑇 from 𝑊 and
obtain a path 𝑊 ′ with weight 𝑤(𝑊 ′) = 𝑤(𝑊) − 𝑤(𝑇) ≤ 𝑤(𝑊) and thus eventually
a path of weight less than 𝑤(𝑃), in contrast to the definition of 𝑃. So 𝑊 contains a
tour 𝑇 with 𝑤(𝑇) < 0, which starts and ends at some node 𝑥, say. We now claim
that 𝑇 contains a cycle 𝐶 with 𝑤(𝐶) < 0. If 𝑇 is itself a cycle, that is clearly the case.
Otherwise, 𝑇 either contains a “subtour” 𝑇 ′ with 𝑤(𝑇 ′) < 0 (and in general some
other startpoint 𝑦) which we can consider instead of 𝑇, or else every subtour 𝑇 ′
of 𝑇 fulfills 𝑤(𝑇 ′) ≥ 0 in which case we can remove 𝑇 ′ from 𝑇 without increasing
𝑤(𝑇); an example of these two possibilities is shown in the following picture with
𝑇 = 𝑥, 𝑦, 𝑧, 𝑦, 𝑥 and 𝑇 ′ = 𝑦, 𝑧, 𝑦.
1 1 1 1
u x v u x v
1 1 −1 −2 (2.6)
−1 1
z y z y
−2 1
In either case, 𝑇 is eventually reduced to a cycle 𝐶 with 𝑤(𝐶) < 0 which is part of
𝑊 (where 𝑊 is modified alongside 𝑇 when removing subtours 𝑇 ′ of nonnegative
weight). This shows the first claim of the theorem.
This implies that if dist(𝑢, 𝑣) ≠ ±∞, then there is a walk and hence a path from
𝑢, 𝑣, and no 𝑢, 𝑣-walk contains a cycle or tour of negative weight, and hence (2.5)
holds according to the preceding reasoning.
Note that the left picture in (2.6) shows that we can have dist(𝑢, 𝑣) = −∞ even
though no cycle of negative weight can be inserted into a path from 𝑢 to 𝑣. In this
example it is only possible to insert a tour of negative weight into a path from 𝑢
to 𝑣, or to insert a cycle of negative weight into a walk from 𝑢 to 𝑣.
Theorem 2.9 would be simpler to prove if it just stated the existence of a
negative-weight tour that can be inserted into a walk from 𝑢 to 𝑣 as an equivalent
2.6. Introduction to Algorithms 23
1
−1
z y
−2
In that case dist(𝑢, 𝑣) = 2 because the only walk from 𝑢 to 𝑣 is the path 𝑢, 𝑥, 𝑣, and
the negative (-weight) cycle 𝑦, 𝑧, 𝑦 can be reached from 𝑢 but cannot be extended
to a walk to 𝑣. Nevertheless, we will in the following consider all negative cycles
that can be reached from a given node 𝑢 as “bad” for the computation of distances
𝑑(𝑢, 𝑣) for nodes 𝑣.
1. 𝑚 ← some element of 𝑆
2. remove 𝑚 from 𝑆
3. while 𝑆 ≠ ∅ :
4. 𝑥 ← some element of 𝑆
5. 𝑚 ← min{𝑚, 𝑥}
6. remove 𝑥 from 𝑆
In this algorithm, we first specify its behaviour in terms of its input and output.
Here the input is a nonempty finite set 𝑆 of real numbers, and the output is the
24 Chapter 2. Combinatorial Optimisation
5. if 𝑥 < 𝑚 : 𝑚 ← 𝑥
where the assignment 𝑚 ← 𝑥 will not happen if 𝑥 < 𝑚 is false, that is, if 𝑥 ≥ 𝑚,
in which case 𝑚 is unchanged. We have chosen the description in Algorithm 2.10
because it is more readable.
A further observation is that the algorithm can be made more “elegant” by
avoiding the repetition of the similar instructions in lines 2 and 6. Namely, we
omit line 2 altogether and replace line 1 with the assignment
1. 𝑚 ← ∞
under the assumption that an element ∞ that is larger than all real numbers exists
and can be stored in the computer. In that case, the first element that is found in
the set 𝑆 is 𝑥 in line 4, which when compared in line 5 with 𝑚 (which currently has
value ∞) will certainly fulfill 𝑥 < 𝑚 and thus 𝑚 takes in the first iteration the value
of 𝑥, which is then removed from 𝑆 in line 6. So then the first iteration of the “loop”
in lines 3–6 performs what happened in lines 1–2 in the original Algorithm 2.10.
This variant of the algorithm is not only shorter but also more general because it
can also be applied to an empty set 𝑆. It is reasonable to define min ∅ = ∞ because
∞ is the neutral element of min in the sense that min{𝑥, ∞} = 𝑥 for all reals 𝑥, just
as an empty sum is 0 (the neutral element of addition) or an empty product is 1
(the neutral element of multiplication). For example, this would apply to the case
that dist(𝑢, 𝑣) = ∞ in (2.5) when there is no path from 𝑢 to 𝑣.
When Algorithm 2.10 terminates, the set 𝑆 will be empty and therefore no
longer be the original set. If this is undesired, one may instead create a copy of 𝑆
on which the algorithm operates that can be “destroyed” in this way while the
original 𝑆 is preserved.
This raises the question of how a set 𝑆 is represented in a computer. The best
way to think of this is as a table of a fixed length 𝑛, say, that stores the elements of 𝑆
which will be denoted as 𝑆[1], 𝑆[2], . . . , 𝑆[𝑛]. Each table element 𝑆[𝑖] for 1 ≤ 𝑖 ≤ 𝑛
is a “real” number in a given limited precision just as the variables 𝑚 and 𝑥. In
programming terminology, 𝑆 is then also called an array of numbers, with a given
array index 𝑖 in a specified range (here 1 ≤ 𝑖 ≤ 𝑛) to access the array element 𝑆[𝑖].
In the computer, the array corresponds to a consecutive sequence of memory cells,
each of which stores an array element. The only difference to a set 𝑆 is that in
that way, repetitions of elements may occur if the numbers 𝑆[1], 𝑆[2], . . . , 𝑆[𝑛]
are not all distinct. Computing the minimum of these 𝑛 (not necessarily distinct)
numbers is possible just as before. Algorithm 2.11, shown below, is close to an
actual implementation in a programming language such as Python. We just say
“numbers” which are real (in fact, rational) numbers as they can be represented in
a computer.
In Algorithm 2.11, the indentation (white space at the left) in lines 6–7 means
that these two statements are executed if the condition 𝑆[𝑘] < 𝑚 of the if statement
26 Chapter 2. Combinatorial Optimisation
1. 𝑚 ← 𝑆[1]
2. 𝑖 ← 1
3. 𝑘 ← 2
4. while 𝑘 ≤ 𝑛 :
5. if 𝑆[𝑘] < 𝑚 :
6. 𝑚 ← 𝑆[𝑘]
7. 𝑖 ← 𝑘
8. 𝑘 ← 𝑘+1
in line 5 is true. Line 8 has the same indentation as line 5 so the statement 𝑘 ← 𝑘 +1
is always executed inside the “loop” in lines 4–8. This is important because if line 8
was indented like lines 6 and 7, then if 𝑆[𝑘] ≥ 𝑚 the value of 𝑘 would stay the
same instead of being incremented by 1, so that the loop in lines 4–8 would from
then on be executed forever. The result would be a faulty algorithm that normally
never terminates (unless the elements in the array are in strictly decreasing order).
Line 2 and 7 together with their respective preceding line make sure that the
index 𝑖 will be such that always 𝑚 = 𝑆[𝑖] holds. If one is not interested in this
index 𝑖 of the minimum in the array, then lines 2 and 7 can be omitted.
⇒ Try Exercise 2.4, which is about a subtle change of the return value 𝑖 in
Algorithm 2.11 in case of repeated elements in the array 𝑆.
Algorithm 2.11 is very detailed and shows how to iterate with an index
variable 𝑘 through array elements 𝑆[𝑘] that represent the elements of a set.
Moreover, the array itself is not modified by this operation, unlike the description
in Algorithm 2.10. We normally aim for the most concise description. The following
is a short version that uses the loop description for all 𝑥 ∈ 𝑆 which means a suitable
iteration through the elements 𝑥 of 𝑆, where 𝑆 has some representation such as an
array.
Algorithm 2.12 (Finding the minimum in a set of numbers using “for all”).
1. 𝑚 ← ∞
2. for all 𝑥 ∈ 𝑆 :
2.7. Single-Source Shortest Paths: Bellman–Ford 27
3. 𝑚 ← min{𝑚, 𝑥}
𝑎 𝑏 𝑐 𝑑
𝑏 𝑐 𝑑 𝑏 (2.7)
𝑑
The columns in this table are also called adjacency lists (which in general have
different lengths). Such a table is easily stored in a computer. Often the vertices are
represented by the integers 1, . . . , |𝑉 | (which makes it easy to find the adjacency
list for each vertex).
We resume the discussion from Section 2.5. That is, in the following we will study
algorithms for single-source shortest paths, where some node 𝑠 (for “source”, or
“start node”) is specified and the task is to compute dist(𝑠, 𝑣) for all nodes 𝑣, or to
find out that some node 𝑣 can be reached from 𝑠 so that 𝑑(𝑠, 𝑣) = −∞, in which
case the algorithm will stop. In short, meaningful distances will only be computed
under the assumption that there is no negative cycle that can be reached from 𝑠.
The reasoning behind computing dist(𝑠, 𝑣) for all nodes 𝑣 is that even if one is only
interested in computing dist(𝑠, 𝑡) for a specific pair 𝑠, 𝑡, there is essentially no other
way than to compute dist(𝑠, 𝑣) for all other nodes 𝑣 because 𝑣 could be the last
node before 𝑡 on a shortest path from 𝑠 to 𝑡.
Algorithm 2.13 below, generally known as the Bellman–Ford Algorithm, finds
shortest paths from a single source node 𝑠 to all other nodes 𝑣, or terminates with
a warning message that a negative (-weight) cycle can be reached from 𝑠 so that all
nodes 𝑣 in that cycle (and possibly others) fulfill dist(𝑠, 𝑣) = −∞.
The algorithm uses an internal table 𝑑[𝑣, 𝑖 ] where 𝑣 is a vertex and 𝑖 takes
values between 0 and |𝑉 | − 1. Possible values for 𝑑[𝑣, 𝑖 ] are real numbers as well as
∞, where it is assumed that ∞ + 𝑥 = ∞ and min{∞, 𝑥} = 𝑥 for any real number 𝑥 or
if 𝑥 = ∞. The algorithm is presented in a first version that is easier to understand
than a second version (Algorithm 2.18 below) where the two-dimensional table
with entries 𝑑[𝑣, 𝑖 ] will be replaced by a one-dimensional array with entries 𝑑[𝑣].
We explain Algorithm 2.13 alongside the following example of a network with
four nodes 𝑠, 𝑥, 𝑦, 𝑧.
28 Chapter 2. Combinatorial Optimisation
1. 𝑑[𝑠, 0] ← 0
2. for all 𝑣 ∈ 𝑉 − {𝑠} : 𝑑[𝑣, 0] ← ∞
3. 𝑖 ← 0
4. while 𝑖 < |𝑉 | − 1 :
5. for all 𝑣 ∈ 𝑉 : 𝑑[𝑣, 𝑖 + 1] ← 𝑑[𝑣, 𝑖 ]
6. for all (𝑢, 𝑣) ∈ 𝐴 :
7. 𝑑[𝑣, 𝑖 + 1] ← min{ 𝑑[𝑣, 𝑖 + 1], 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) }
8. 𝑖 ← 𝑖+1
9. for all (𝑢, 𝑣) ∈ 𝐴 :
10. if 𝑑[𝑢, |𝑉 | − 1] + 𝑤(𝑢, 𝑣) < 𝑑[𝑣, |𝑉 | − 1] :
11. print “Negative cycle!” and stop immediately
12. for all 𝑣 ∈ 𝑉 : dist(𝑠, 𝑣) ← 𝑑[𝑣, |𝑉 | − 1]
𝑣 𝑠 𝑥 𝑦 𝑧
2
s
1
x 𝑑[𝑣, 0] 0 ∞ ∞ ∞
−1 𝑑[𝑣, 1] 0 1 ∞ ∞ (2.8)
2
𝑑[𝑣, 2] 0 1 3 0
y z
1 𝑑[𝑣, 3] 0 1 1 0
The right side of (2.8) shows 𝑑[𝑣, 𝑖 ] as rows of a table for 𝑖 = 0, 1, 2, 3, with the
vertices 𝑣 as columns. In lines 1–2 of Algorithm 2.13, these values are initialised
(initially set) to 𝑑[𝑠, 0] = 0 and 𝑑[𝑠, 𝑣] = ∞ for 𝑣 ≠ 𝑆. Lines 4–8 represent the main
loop of the algorithm, where 𝑖 takes successively the values 0, 1, . . . |𝑉 | − 2, and the
entries in row 𝑑[𝑣, 𝑖 + 1] are computed from those in row 𝑑[𝑣, 𝑖 ]. The important
property of these numbers, which we prove shortly, is the following.
Theorem 2.14. In Algorithm 2.13, at the beginning of each iteration of the main loop
(lines 4–8), 𝑑[𝑣, 𝑖 ] is the smallest weight of any walk from 𝑠 to 𝑣 that has at most 𝑖 arcs.
The main loop begins with 𝑖 = 0 after line 3. In line 5, the entries 𝑑[𝑣, 𝑖 + 1] are
copied from 𝑑[𝑣, 𝑖 ], and will subsequently be updated. In the example (2.8), 𝑑[𝑣, 1]
first contains 0, ∞, ∞, ∞. Lines 6–7 describe a second “inner” loop that considers
all arcs (𝑢, 𝑣). Whenever 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) is smaller than 𝑑[𝑣, 𝑖 + 1], the assignment
𝑑[𝑣, 𝑖 + 1] ← 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) takes place. This will not happen if 𝑑[𝑢, 𝑖 ] = ∞
2.7. Single-Source Shortest Paths: Bellman–Ford 29
because then also 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) = ∞. For 𝑖 = 0, the only arc (𝑢, 𝑣) where this is
not the case is (𝑢, 𝑣) = (𝑠, 𝑥), in which case 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) = 𝑑[𝑠, 0] + 1 = 1, which
is less than ∞, resulting in the assignment 𝑑[𝑥, 1] ← 1. In (2.8), this assignment
is shown by the new entry 1 for 𝑑[𝑥, 1] surrounded by a box. This is the only
assignment of this sort. After all arcs have been considered, it can be verified that
the entries 0, 1, ∞, ∞ in row 𝑑[𝑣, 1] represent indeed the shortest weights of walks
from 𝑠 to 𝑣 that use at most one arc, as asserted by Theorem 2.14.
After 𝑖 is increased from 0 to 1 in line 8, the second iteration of the main loop
starts with 𝑖 = 1. Then arcs (𝑢, 𝑣) where 𝑑[𝑢, 𝑖 ] < ∞ are those where 𝑢 = 𝑠 or 𝑢 = 𝑥,
which are the arcs (𝑠, 𝑥), (𝑥, 𝑠), (𝑥, 𝑦), and (𝑥, 𝑧). The last two produce the updates
𝑑[𝑦, 2] ← 𝑑[𝑥, 1] + 𝑤(𝑥, 𝑦) = 1 + 2 = 3 and 𝑑[𝑧, 2] ← 𝑑[𝑥, 1] + 𝑤(𝑥, 𝑧) = 1 − 1 = 0,
shown by the boxed entries in row 𝑑[𝑣, 2] of the table. Again, it can be verified
that these are the weights of shortest walks from 𝑠 to 𝑣 with at most two arcs.
The last iteration of the main loop is for 𝑖 = 2, which produces only a single
update, namely when the arc (𝑧, 𝑦) is considered in line 7, where 𝑑[𝑦, 3] is updated
from its current value 3 to 𝑑[𝑦, 3] ← 𝑑[𝑧, 2] + 𝑤(𝑧, 𝑦) = 0 + 1 = 1. Row 𝑑[𝑣, 3]
then has the weights of shortest walks from 𝑠 to 𝑣 that use at most three arcs. The
main loop terminates when 𝑖 = 3 (in general, when 𝑖 = |𝑉 | − 1).
Because the network in (2.8) has only four nodes, any walk with more than
three arcs (in general, more than |𝑉 | − 1 arcs) cannot be a path. In fact, if there
is a walk with |𝑉 | arcs that is shorter than found so far, it must contain a tour of
negative weight, as will be proved in Theorem 2.16 below. In Algorithm 2.13, lines
9–11 test for a possible improvement of the current values in 𝑑[𝑣, |𝑉 | − 1] (the last
row in the table), much in the same way as in the previous updates in lines 6–7,
by considering all arcs (𝑢, 𝑣). However, unlike the assignment in line 7, such a
possible improvement is now taken to terminate the algorithm immediately with
the notification that there must be a negative cycle that can be reached from 𝑠.
The normal case is that no such improvement is possible. In that case, line 12
produces the desired output of the distances 𝑑(𝑠, 𝑣). In the example (2.8), these
are the entries 0, 1, 1, 0 in the last row 𝑑[𝑣, 3].
Before we prove Theorem 2.14, we note that any prefix of a shortest walk is a
shortest walk to its last node.
Proof. Suppose there was a shorter walk 𝑊 ′ from 𝑢 to 𝑢𝑖 than the prefix 𝑢0 , 𝑢1 , . . . , 𝑢𝑖
of 𝑊. Then 𝑊 ′ followed by 𝑢𝑖+1 , . . . , 𝑢 𝑘 is a shorter walk from 𝑠 to 𝑢 than 𝑊, which
contradicts the definition of 𝑊.
30 Chapter 2. Combinatorial Optimisation
Theorem 2.14 presents one part of the correctness of Algorithm 2.13. A second
part is the correct detection of negative (-weight) cycles. We first consider an
example, which is the same network as in (2.8) with an additional arc (𝑦, 𝑠) of
weight −5, which creates two negative cycles, namely 𝑠, 𝑥, 𝑦, 𝑠 and 𝑠, 𝑥, 𝑦, 𝑧, 𝑠.
𝑣 𝑠 𝑥 𝑦 𝑧
2 𝑑[𝑣, 0] 0 ∞ ∞ ∞
s x
1
𝑑[𝑣, 1] 0 1 ∞ ∞
−5 −1 (2.9)
2 𝑑[𝑣, 2] 0 1 3 0
y z 𝑑[𝑣, 3] −2 1 1 0
1
neg. cycle? −4 −1
In this network, any walk from 𝑠 to 𝑦 has two or more arcs, so by Theorem 2.14 the
rows 𝑑[𝑣, 0], 𝑑[𝑣, 1], 𝑑[𝑣, 2] are the same as in (2.8) and only 𝑑[𝑣, 3] has the different
entries −2, 1, 1, 0, when the main loop terminates. In the additional row in the
table in (2.9), there are two possible improvements of 𝑑[𝑣, 3], namely of 𝑑[𝑠, 3],
indicated by −4 , when the arc (𝑦, 𝑠) is considered as (𝑢, 𝑣) in line 10, or of 𝑑[𝑥, 3],
indicated by −1 , when the arc (𝑠, 𝑥) is considered. Whichever improvement is
discovered first (depending on the order of arcs (𝑢, 𝑣) in line 9), it leads to the
immediate stop of the algorithm in line 11. Both improvements reveal the existence
of a walk with four arcs that is shorter than the current shortest walk with at
most three arcs. For the first improvement, this four-arc walk is 𝑠, 𝑥, 𝑧, 𝑦, 𝑠, for the
second it is 𝑠, 𝑥, 𝑦, 𝑠, 𝑥.
2.7. Single-Source Shortest Paths: Bellman–Ford 31
Theorem 2.16. Consider a network (𝑉 , 𝐴, 𝑤) with a source node 𝑠. Then there is a negative
cycle that starts at some node which can be reached from 𝑠 if and only if Algorithm 2.13
stops in line 11.
root to 𝑣 (unless 𝑣 = 𝑠, in which case pred[𝑠] is not given, often written NIL). The
following is an example of such a shortest path tree in a network with six nodes,
with the corresponding pred array and the distances from 𝑠. The arcs of the tree
(which are also part of the original network) are indicated as dashed arrows.
2 2
s x a 𝑣 𝑠 𝑥 𝑦 𝑧 𝑎 𝑏
1
−3
−1 2 dist(𝑠, 𝑣) 0 1 1 0 3 1
2
pred[𝑣] NIL 𝑠 𝑧 𝑥 𝑥 𝑧
y z b
1 1
(2.12)
The following algorithm is an extension of Algorithm 2.13 that also computes
the shortest-path predecessors.
Algorithm 2.17 (Bellman–Ford, first version, with shortest-path tree).
Input : network (𝑉 , 𝐴, 𝑤) and source (start node) 𝑠.
Output : dist(𝑠, 𝑣) for all nodes 𝑣 if no such distance is −∞, and predecessor pred[𝑣]
of 𝑣 on shortest path from 𝑠.
Line 1 of this algorithm not only initialises 𝑑[𝑣, 0] to ∞ but also pred[𝑣] to NIL,
for all nodes 𝑣. Line 2 then sets 𝑑[𝑠, 0] to 0.
Lines 7 and 7a represent the update of 𝑑[𝑣, 𝑖 + 1] in line 7 of Algorithm 2.13,
and line 7b the new assignment of the predecessor pred[𝑣] on the new shortest
2.7. Single-Source Shortest Paths: Bellman–Ford 33
𝑣 𝑠 𝑥 𝑦 𝑧
2 𝑑[𝑣, 0] 0 ∞ ∞ ∞
s x
1
𝑑[𝑣, 1] 0 1 𝑠 ∞ ∞
−1 (2.13)
2 𝑑[𝑣, 2] 0 1 3 𝑥 0 𝑥
y z 𝑑[𝑣, 3] 0 1 1 𝑧 0
1
pred[𝑣] NIL 𝑠 𝑧 𝑥
Our notation with updated values of 𝑑[𝑣, 𝑖 +1] shown by boxes with a subscript
𝑢 for the corresponding arc (𝑢, 𝑣) is completely ad-hoc. We chose it to document
the progress of the algorithm in a compact and unambiguous way.
A case that has not yet occurred is that 𝑑[𝑣, 𝑖 + 1] is updated more than once
for the same value of 𝑖, in case there are several arcs (𝑢, 𝑣) where this occurs,
depending on the order in which these arcs are traversed in line 6. An example is
(2.12) above for 𝑖 = 2 and 𝑣 = 2, where the update of 𝑑[𝑏, 3] occurs from ∞ to 5
via the arc (𝑎, 𝑏), and then to 1 via the arc (𝑧, 𝑏). One may record these updates
of 𝑑[𝑏, 3] in the table as 5 𝑎 1 𝑧 , or by only listing the last update 1 𝑧 (if (𝑧, 𝑏) is
considered before (𝑎, 𝑏), this is the only update).
Algorithm 2.13 is the first version of the Bellman–Ford algorithm. The progress
of the algorithm is nicely described by Theorem 2.14. Algorithm 2.18 is a second,
simpler version of the algorithm. Instead of storing the current distances from 𝑠
for walks that use at most 𝑖 arcs in a separate table row 𝑑[𝑣, 𝑖 ], the second version
of the algorithm uses just a single array with entries 𝑑[𝑣]. The new algorithm has
fewer instructions, which we have numbered with some line numbers omitted for
easier comparison with the first version in Algorithm 2.13.
The main difference between this algorithm and the first version is the update
rule in line 7. The first version compared 𝑑[𝑣, 𝑖 + 1] with 𝑑[𝑢, 𝑖 ] + 𝑤(𝑢, 𝑣) where
34 Chapter 2. Combinatorial Optimisation
1. 𝑑[𝑠] ← 0
2. for all 𝑣 ∈ 𝑉 − {𝑠} : 𝑑[𝑣] ← ∞
4. repeat |𝑉 | − 1 times :
6. for all (𝑢, 𝑣) ∈ 𝐴 :
7. 𝑑[𝑣] ← min{ 𝑑[𝑣], 𝑑[𝑢] + 𝑤(𝑢, 𝑣) }
9. for all (𝑢, 𝑣) ∈ 𝐴 :
10. if 𝑑[𝑢] + 𝑤(𝑢, 𝑣) < 𝑑[𝑣] :
11. print “Negative cycle!” and stop immediately
12. for all 𝑣 ∈ 𝑉 : dist(𝑠, 𝑣) ← 𝑑[𝑣]
𝑑[𝑢, 𝑖 ] was always the value of the previous iteration, whereas the second version
compares 𝑑[𝑣] with 𝑑[𝑢] + 𝑤(𝑢, 𝑣) where 𝑑[𝑢] may already have improved in the
current iteration. The following simple example illustrates the difference.
𝑣 𝑠 𝑥 𝑦 𝑧
𝑑[𝑣, 0] 0 ∞ ∞ ∞
s x y z 𝑑[𝑣, 1] 0 1 ∞ ∞ (2.14)
1 1 1
𝑑[𝑣, 2] 0 1 2 ∞
𝑑[𝑣, 3] 0 1 2 3
The table on the right in (2.14) shows the progress of the first version of the
algorithm. Suppose that in the second version in line 6, the arcs are considered in
the order (𝑠, 𝑥), (𝑥, 𝑦), and (𝑦, 𝑧). Then the assignments in the inner loop in line 7
are 𝑑[𝑥] ← 1, 𝑑[𝑦] ← 2, 𝑑[𝑧] ← 3, so the complete array is already found in the
first iteration of the main loop in lines 4–7, without any further improvements in
the second and third iteration of the main loop. However, if the order of arcs in
line 6 is (𝑦, 𝑧), (𝑥, 𝑦), and (𝑠, 𝑥), then the only update in the main loop in line 7 in the
first iteration is 𝑑[𝑥] ← 1, with 𝑑[𝑦] ← 2 in the second iteration, and 𝑑[𝑧] ← 3
in the last iteration. In general, the main loop does need |𝑉 | − 1 iterations for the
algorithm to work correctly, as asserted by the following theorem.
Theorem 2.19. In Algorithm 2.18, at the beginning of the 𝑖th iteration of the main loop
(lines 4–7), 1 ≤ 𝑖 ≤ |𝑉 | − 1, we have 𝑑[𝑣] ≤ 𝑤(𝑊) for any node 𝑣 and any 𝑠, 𝑣-walk 𝑊
that has at most 𝑖 − 1 arcs. Moreover, if 𝑑[𝑣] < ∞, then there is some 𝑠, 𝑣-walk of weight
2.8. O-Notation and Running-Time Analysis 35
𝑑[𝑣]. If the algorithm terminates without stopping in line 11, then 𝑑[𝑣] = dist(𝑠, 𝑣) as
claimed.
Proof. The algorithm performs at least the updates of Algorithm 2.13 (possibly
more quickly), which shows that 𝑑[𝑣] ≤ 𝑤(𝑊) for any 𝑠, 𝑣-walk 𝑊 with at most
𝑖 − 1 arcs, as in Theorem 2.14. If 𝑑[𝑣] < ∞, then 𝑑[𝑣] = 𝑤(𝑊 ′) for some 𝑠, 𝑣-walk 𝑊 ′
because of the way 𝑑[𝑣] is computed in line 7. Furthermore, dist(𝑠, 𝑣) = 𝑑[𝑣] ≠ −∞
if and only if 𝑑[𝑢] + 𝑤(𝑢, 𝑣) ≥ 𝑑[𝑣] for all arcs (𝑢, 𝑣), as proved for Theorem 2.16.
In this section, we define the 𝑂-notation used to describe the running time of
algorithms, and apply it to the analysis of the Bellman–Ford algorithm.
The time needed to execute an algorithm depends on the size of its input, and
on the machine that performs the instructions of the algorithm. The size of the
input can be very accurately measured in terms of the number of bits (binary digits)
to represent the input. If the input is a network, then the input size is normally
measured more coarsely by the number of nodes and arcs, assuming that each
piece of associated information (such as the endpoints of an arc, and the weight of
an arc) can be stored in some fixed number of bits (which is realistic in practice).
The execution time of an instruction depends on the computer, and on the
way that the instruction is represented in terms of more primitive instructions,
for example how an assigment translates to the evaluation of the right-hand side
of the assignment and to storing the computed value of the assigned variable in
memory. Because computing technology is constantly improving, it is normally
assumed that a basic instruction, such as an assignment or a test of a condition
like 𝑥 < 𝑚, takes a certain constant amount of time, without specifying what that
constant is.
The functions to measure running times take nonnegative values. Let
R≥ = { 𝑥 ∈ R | 𝑥 ≥ 0}. (2.15)
Suppose 𝑓 : N → R≥ is a function where 𝑓 (𝑛) measures, say, the number of
microseconds needed to run the Bellman–Ford Algorithm 2.13 for a network with
𝑛 nodes on a specific computer. Changing the computer, or changing microseconds
to nanoseconds, would result in changing 𝑓 (𝑛) by a constant factor. It makes sense
to specify running times “up to a constant factor” as a function of the input size.
The 𝑂-notation, or “order of” notation, is designed to capture this, as well as the
asymptotic behaviour of a function (that is, of 𝑓 (𝑛) for sufficiently large 𝑛).
Definition 2.20. Consider two functions 𝑓 , 𝑔 : N → R≥ . Then we say 𝑓 (𝑛) ∈
𝑂(𝑔(𝑛)), or 𝑓 (𝑛) is of order 𝑔(𝑛), if there is some 𝐶 > 0 and 𝐾 ∈ N so that
𝑓 (𝑛) ≤ 𝐶 · 𝑔(𝑛) for all 𝑛 ≥ 𝐾.
36 Chapter 2. Combinatorial Optimisation
because if there are 𝐶, 𝐷 > 0 with 𝑓 (𝑛) ≤ 𝐶 · 𝑔(𝑛) for 𝑛 ≥ 𝐾 and 𝑔(𝑛) ≤ 𝐷 · ℎ(𝑛) for
𝑛 ≥ 𝐿, then 𝑓 (𝑛) ≤ 𝐶 · 𝐷 · ℎ(𝑛) for 𝑛 ≥ max{𝐾, 𝐿}. Note that (2.17) is equivalent to
the statement
𝑔(𝑛) ∈ 𝑂(ℎ(𝑛)) ⇔ 𝑂(𝑔(𝑛)) ⊆ 𝑂(ℎ(𝑛)) (2.18)
for the following reason: Suppose 𝑔(𝑛) ∈ 𝑂(ℎ(𝑛)). Then (2.17) says that any
function 𝑓 (𝑛) in 𝑂(𝑔(𝑛)) is also in 𝑂(ℎ(𝑛)), which shows 𝑂(𝑔(𝑛)) ⊆ 𝑂(ℎ(𝑛))
and thus “⇒” in (2.18). Conversely, if 𝑂(𝑔(𝑛)) ⊆ 𝑂(ℎ(𝑛)), then we have clearly
𝑔(𝑛) ∈ 𝑂(𝑔(𝑛)) and thus 𝑔(𝑛) ∈ 𝑂(ℎ(𝑛)), which shows “⇐” in (2.18).
What is 𝑂(1)? This is the set of functions 𝑓 (𝑛) that fulfill 𝑓 (𝑛) ≤ 𝐶 for all
𝑛 ≥ 𝐾, for some constants 𝐶 and 𝐾. Because the finitely many numbers 𝑛 with
𝑛 < 𝐾 are bounded, we can if necessary increase 𝐶 to obtain that 𝑓 (𝑛) ≤ 𝐶 for all
𝑛 ∈ N. In other words, 𝑂(1) is the set of functions that are bounded by a constant.
In addition to (2.17), the following rules are useful and easy to prove:
which shows that a sum of two functions can be “absorbed” into the function with
higher growth rate. With the definition
In addition,
𝑓 (𝑛) ∈ 𝑂(𝑔(𝑛)) ⇒ 𝑛 · 𝑓 (𝑛) ∈ 𝑂(𝑛 · 𝑔(𝑛)) . (2.22)
We now apply this notation to analyse the running time of the Bellman–Ford
algorithm, where we consider Algorithm 2.17 because it is slightly more detailed
than Algorithm 2.13. Suppose the input to the algorithm is a network (𝑉 , 𝐴, 𝑤)
with 𝑛 = |𝑉 | and 𝑚 = |𝐴|. Line 1 takes running time 𝑂(𝑛) because in that line all
nodes are considered, each with two assignments that take constant time. Lines 2
and 3 take constant time 𝑂(1). The main loop in lines 4–8 is executed 𝑛 − 1 times.
Testing the condition 𝑖 < |𝑉 | − 1 in line 4 takes time 𝑂(1). Line 5 takes time 𝑂(𝑛).
The “inner loop” in lines 6–7b takes time 𝑂(𝑚) because the evaluation of the if
condition in line 7 and the assigments in lines 7a–7b take constant time (which is
shorter when they are not executed because the if condition is false, but bounded
in either case). Line 8 takes time 𝑂(1). So the time to perform one iteration of
the main loop in lines 4–8 is 𝑂(1) + 𝑂(𝑛) + 𝑂(𝑚) + 𝑂(1), which by (2.19) we can
shorten to 𝑂(𝑛 + 𝑚) because we can assume 𝑛 > 0. The main loop is performed
𝑛 − 1 times, where in view of the constants this can be simplified to multiplication
with 𝑛, that is, it takes together time 𝑛 · 𝑂(𝑛 + 𝑚) = 𝑂(𝑛 2 + 𝑛𝑚). The test for
negative cycles in lines 9–11 takes time 𝑂(𝑚), and the final assigment of distance
in line 12 time 𝑂(𝑛). So the overall running time from lines 1–3, 4–8, 9–11, 12
is 𝑂(𝑛) + 𝑂(𝑛 2 + 𝑛𝑚) + 𝑂(𝑚) + 𝑂(𝑛) where the second term absorbs the others
according to (2.19). So the overall running time is 𝑂(𝑛 2 + 𝑛𝑚).
The number 𝑚 of arcs of a digraph with 𝑛 nodes is at most 𝑛 · (𝑛 − 1), that is,
𝑚 ∈ 𝑂(𝑛 2 ), so that 𝑂(𝑛 2 + 𝑛𝑚) ⊆ 𝑂(𝑛 2 + 𝑛 3 ) = 𝑂(𝑛 3 ). That is, for a network with
𝑛 nodes, the running time of the Bellman–Ford algorithm is 𝑂(𝑛 3 ). (It is therefore also
called a cubic algorithm.)
The above analysis shows a more accurate running time of 𝑂(𝑛 2 + 𝑛𝑚) that
depends on the number 𝑚 of arcs in the network. The algorithm works for any
number of arcs (even if 𝑚 = 0). Normally the number of arcs is at least 𝑛 − 1
because otherwise some nodes cannot be reached from the source node 𝑠 (this
can be seen by induction on 𝑛 by adding nodes one at a time to the network,
starting with 𝑠: every new node 𝑣 requires at least one new arc (𝑢, 𝑣) in order to
be reachable from the nodes 𝑢 that are currently reachable from 𝑠). In that case
𝑛 ∈ 𝑂(𝑚) and thus 𝑂(𝑛 2 + 𝑛𝑚) = 𝑂(𝑛𝑚), so that the running time is 𝑂(𝑛𝑚). (We
have to be careful here: when we say the digraph has at least 𝑛 − 1 arcs, we cannot
write this as 𝑚 ∈ 𝑂(𝑛), because this would mean an upper bound on 𝑚; the correct
way to say this is 𝑛 ∈ 𝑂(𝑚), which translates to 𝑛 ≤ 𝐶𝑚 and thus to 𝑚 ≥ 𝑛/𝐶,
meaning that the number of arcs is at least proportional to the number of nodes.
38 Chapter 2. Combinatorial Optimisation
An upper bound for 𝑚 is given by 𝑚 ∈ 𝑂(𝑛 2 ).) In short, for a network with 𝑛 nodes
and 𝑚 arcs, where 𝑛 ∈ 𝑂(𝑚), the running time of the Bellman–Ford algorithm is
𝑂(𝑛𝑚).
The second version of the Bellman–Ford algorithm has the same running time
𝑂(𝑛 3 ). Algorithm 2.18 is faster than the first version but, in the worst case, only
by a constant factor, because the main loop in lines 4–7 is still performed 𝑛 − 1
times, and the algorithm would in general be incorrect with fewer iterations, as
the example in (2.14) shows (which can be generalised to an arbitrary path).
𝑢 𝑠 𝑎 𝑏 𝑐 𝑥 𝑦
0𝐺 ∞𝐵 ∞𝐵 ∞𝐵 ∞𝐵 ∞𝐵
y
3 1 𝑠 0𝑊 1 𝐺
𝑠 4 𝐺
𝑠 ∞ ∞ ∞
0
c x 𝑎 0 1𝑊 3 𝐺
𝑎 6 𝐺
𝑎 ∞ ∞
2
𝐺 𝐺
5 2 1 𝑏 0 1 3𝑊 5 𝑏
4 𝑏
∞
2 𝐺
a b 𝑥 0 1 3 4 𝑥 4𝑊 5 𝐺
𝑥
3
1 4 𝑐 0 1 3 4𝑊 4 5𝐺
s
𝑦 0 1 3 4 4 5𝑊
dist(𝑠, 𝑢) 0 1 3 4 4 5
pred[𝑣] NIL 𝑠 𝑎 𝑥 𝑏 𝑥
with the exception of 𝑣 consists exclusively of white nodes, that is, colour[𝑢𝑖 ] = white for
0 ≤ 𝑖 < 𝑘. When 𝑢 is made white in line 5, we have 𝑑[𝑢] = dist(𝑠, 𝑢).
In line 1, all nodes 𝑣 are initially black with 𝑑[𝑣] ← ∞. In line 2, the source
𝑠 becomes grey and 𝑑[𝑠] ← 0. This is also shown as the first row in the table in
Figure 2.1, where we use superscripts 𝐵, 𝐺, 𝑊 for a newly assigned colour black,
grey, or white. Because grey nodes are of special interest, we will indicate their
colour all the time, even if it has not been updated in that particular iteration.
The main loop in lines 3–8 operates as long as the set of grey nodes is not empty,
in which case it selects in line 4 a particular grey node 𝑢 with smallest value 𝑑[𝑢].
Because of the initialisation in line 2, the only grey node is 𝑠, which is therefore
chosen in the first iteration. Each row in the table in Figure 2.1 represents one
iteration of the main loop, where the node 𝑢 that is chosen in line 4 is displayed
on the left of that row. The row entries are the values 𝑑[𝑣] for all nodes 𝑣, as they
become updated or stay unchanged in that iteration. The chosen node 𝑢 changes
its colour to white in line 5, indicated by the superscript 𝑊 in the table, where that
node is also underlined.
Lines 6–8 are a second inner loop of the algorithm. It traverses all non-white
neighbours 𝑣 of 𝑢, that is, all nodes 𝑣 so that colour[𝑣] is not white and so that
(𝑢, 𝑣) is an arc. For all these non-white neighbours 𝑣 of 𝑢 are set to grey in line 7
(indicated by a superscript 𝐺), and their distance 𝑑[𝑣] is updated to 𝑑[𝑢] + 𝑤(𝑢, 𝑣)
in case this is smaller than the previous value of 𝑑[𝑣] (which happens always if
40 Chapter 2. Combinatorial Optimisation
𝑣 is black and therefore 𝑑[𝑣] = ∞). If such an update happens, this means that
there is an all-white path from 𝑠 to 𝑢 followed by an arc (𝑢, 𝑣) that connects 𝑢
to the grey node 𝑣, and in that case we can also set pred[𝑣] ← 𝑢 to indicate that
𝑢 is the predecessor of 𝑣 on the current path from 𝑠 to 𝑣 (we have omitted that
update of pred[𝑣] to keep Algorithm 2.21 short; it is the same as in lines 7–7b of
Algorithm 2.17). As in example (2.13) for the Bellman–Ford algorithm, the update
of pred[𝑣] with 𝑢 is shown with the subscript 𝑢, and the update of 𝑑[𝑣] is shown
by surrounding the new value with a box.
In the first iteration in Figure 2.1 where 𝑢 = 𝑠, the updated neighbours 𝑣 of
𝑢 are 𝑎 and 𝑏. These are also the grey nodes in the next iteration of the main
loop, where node 𝑎 is selected because 𝑑[𝑎] < 𝑑[𝑏], and 𝑎 is made white. The two
neighbours of 𝑎 are 𝑏 and 𝑐. Both are non-white and become grey (𝑏 is already
grey). The value of 𝑑[𝑏] is updated from 4 to 𝑑[𝑎] + 𝑤(𝑎, 𝑏) = 3, with pred[𝑏] ← 𝑎.
The value of 𝑑[𝑐] is updated from ∞ to 𝑑[𝑎] + 𝑤(𝑎, 𝑐) = 6, with pred[𝑐] ← 𝑎. The
current row shows that the grey nodes are 𝑏 and 𝑐, where 𝑑[𝑏] < 𝑑[𝑐].
In the next iteration therefore 𝑢 is chosen to be 𝑏, which gives the next row
of the table. The neighbours of 𝑏 are 𝑎, 𝑐, 𝑥. Here 𝑎 is white and is ignored, 𝑐 is
non-white and gets the update 𝑑[𝑐] ← 𝑑[𝑏] + 𝑤(𝑏, 𝑐) = 5 because this is smaller
than the current value 6, and pred[𝑐] ← 𝑏. Node 𝑥 changes colour from black to
grey and 𝑑[𝑥] ← 𝑑[𝑏] + 𝑤(𝑏, 𝑥) = 4. In the next iteration, 𝑥 is the grey node 𝑢 with
smallest 𝑑[𝑢], creating updates for 𝑐 and 𝑦. The next and penultimate iteration
chooses 𝑐 among two remaining grey nodes, where the neighbour 𝑦 of 𝑐 creates no
update (other than being set to grey, which is already its colour). The final iteration
chooses 𝑦. Because all nodes are now white, the algorithm terminates with the
output of distances in line 9, as shown in the table in Figure 2.1 in addition to the
predecessors in the shortest-path tree with root 𝑠.
Proof of Theorem 2.22. First, we note that because all weights are nonnegative, the
shortest walk from 𝑠 to any node 𝑣 can always be chosen as a path by Proposition 2.5
because any tour that the walk contains can be removed and is of nonnegative
weight, which will not increase the weight of the walk.
We prove the theorem by induction. Before the main loop in lines 3–8 is
executed for the first time, there are no white nodes. Hence, the only path from
𝑠 where all but the last node are white is a path with no arcs that consists of the
single node 𝑠, and its weight is zero, where 𝑑[𝑠] = 0 as claimed. Furthermore, this
is the only (and shortest) path from 𝑠 to 𝑠, so dist(𝑠, 𝑠) = 𝑑[𝑠] = 0.
Suppose now that at the beginning of the main loop the condition is true for
any set of white nodes. If there are no grey nodes, then the main loop will no
longer be performed and the algorithm proceeds to line 9. If there are grey nodes,
then the main loop will be executed, and we will show that the condition holds
again afterwards. Let 𝑢 be a node that is chosen in line 4, which is made white
2.9. Single-Source Shortest Paths: Dijkstra’s Algorithm 41
in line 5. We prove, as claimed in the theorem, that just before this assignment
we have 𝑑[𝑢] = dist(𝑠, 𝑢). This has already been shown when 𝑢 = 𝑠. There is a
path from 𝑠 to 𝑢 with weight 𝑑[𝑢], namely the assumed path (by the induction
hypothesis) where all nodes except 𝑢 are white. Consider any shortest path 𝑃 from
𝑠 to 𝑢; we will show 𝑑[𝑢] ≤ 𝑤(𝑃) which implies 𝑑[𝑢] = 𝑤(𝑃) = dist(𝑠, 𝑢). Let 𝑦 be
the first node on the path 𝑃 which is not white. Let 𝑃 ′ be the prefix of 𝑃 given by
the path from 𝑠 to 𝑦, which is a shortest path from 𝑠 to 𝑦 by Lemma 2.15. Moreover,
𝑦 is grey because there are no arcs (𝑥, 𝑦) where 𝑥 is white (such as the previous
node 𝑥 on 𝑃 before 𝑦) and 𝑦 is black because after 𝑥 has been made white in line 5,
all its black neighbours 𝑦 are made grey in line 7. So is 𝑃 ′ is a shortest path from
𝑠 to 𝑦 and certainly a shortest path among those where all but the last node are
white, so by the induction hypothesis, 𝑑[𝑦] = 𝑤(𝑃 ′). By the choice of 𝑢 in line 4 we
have 𝑑[𝑢] ≤ 𝑑[𝑦] = 𝑤(𝑃 ′) ≤ 𝑤(𝑃), where the latter inequality holds because all
weights are nonnegative. That is, 𝑑[𝑢] ≤ 𝑤(𝑃) as claimed.
We now show that updating the non-white neighbours 𝑣 of 𝑢 in lines 7–8 will
complete the induction step, that is, any shortest path 𝑃 from 𝑠 to 𝑣 where all nodes
but 𝑣 are white has weight 𝑑[𝑣]. If the last arc of such a shortest path is not (𝑢, 𝑣),
then this is true by the induction hypothesis. If the last arc of 𝑃 is (𝑢, 𝑣), then 𝑃
without its last node 𝑣 defines a shortest path from 𝑠 to 𝑢 (where all nodes are white),
were we just proved 𝑑[𝑢] = dist(𝑠, 𝑢), and hence 𝑤(𝑃) = 𝑑[𝑢] + 𝑤(𝑢, 𝑣) = 𝑑[𝑣]
because that is how 𝑑[𝑣] has been updated in line 8. This completes the induction.
Proof. When the algorithm terminates, every node is either white or black. As
shown in the preceding proof, at the end of each iteration of the main loop there
are no arcs (𝑥, 𝑦) where 𝑥 is white and 𝑦 is black. Hence, the white nodes are exactly
the nodes 𝑢 that can be reached from 𝑠 by a path, with dist(𝑠, 𝑢) = 𝑑[𝑢] < ∞ by
Theorem 2.22. The black nodes 𝑣 are those that cannot be reached from 𝑠, where
dist(𝑠, 𝑣) = ∞ = 𝑑[𝑣] as set at initialisation in line 1.
In Dijkstra’s algorithm, a grey node 𝑢 with minimal 𝑑[𝑢] has already its final
distance 𝑑(𝑠, 𝑣) given by 𝑑[𝑢], so that 𝑢 can be made white. There can be no shorter
“detour” to reach 𝑢 via nodes that at that time are grey or black, because the first
grey node 𝑦 on such a path from 𝑠 to 𝑢 would fulfill 𝑑[𝑢] ≤ 𝑑[𝑦] (see the proof of
Theorem 2.22), and the remaining part of that path from 𝑦 to 𝑢 has nonnegative
weight by assumption. This argument fails for negative weights. In the following
network,
y u
−5
4 1
s
42 Chapter 2. Combinatorial Optimisation
the next node made white after 𝑠 is 𝑢 and is recorded with distance 1, and after
that node 𝑦 with distance 4. However, the path 𝑠, 𝑦, 𝑢 has weight −1 which is less
than the computed weight 1 of the path 𝑠, 𝑢. So the output of Dijkstra’s algorithm
is incorrect, here because of the negative weight of the arc (𝑦, 𝑢). It may happen
that the output of Dijkstra’s algorithm is correct (as in the preceding example if
𝑤(𝑠, 𝑢) = 5), but in general this is not guaranteed.
We now analyse the running time of Dijkstra’s algorithm. Let 𝑛 = |𝑉 | and
𝑚 = |𝐴|. The initialisation in lines 1–2 takes time 𝑂(𝑛), and so does the final
output (if 𝑑[𝑣] is not taken directly as the output) in line 9. In each iteration of the
main loop in lines 3–8, exactly one node becomes (and stays) white. Hence, the
loop is performed 𝑛 times, assuming (which in general is the case) that all nodes
are eventually white, that is, are reachable by a path from the source node 𝑠. We
assume that the colour of a node 𝑣 is represented by the array entry colour[𝑣], and
that nodes themselves are just represented by the numbers 1, . . . , 𝑛. By iterating
through the colour array, identifying the grey nodes in line 3 and finding the node 𝑢
with minimal 𝑑[𝑢] in line 4 takes time 𝑂(𝑛). (Even if the number of grey nodes were
somehow represented in an array of shrinking size, it is possible that they are at
least a constant fraction, if not all, of the nodes that are not white, and their number
is initially 𝑛, then 𝑛 − 1, and so on, so that the number of nodes checked in line 4
counted over the 𝑛 iterations of the main loop is 𝑛 + (𝑛 − 1) + · · · + 2 + 1 = 𝑂(𝑛 2 ),
which is the same as in our current analysis.)
The inner loop in lines 6–8 of Algorithm 2.21 iterates through the nodes 𝑣,
and checks if they are not white and if they are neighbours of 𝑢, that is, (𝑢, 𝑣) ∈ 𝐴.
The time this takes depends on the representation of the digraph (𝑉 , 𝐴). If the
neighbours of every node 𝑢 are stored in an adjacency list as in (2.7), then this is
as efficient as possible, that is, over the entire iterations of the main loop each arc
(𝑢, 𝑣) is considered exactly once, namely after 𝑢 has just become white. So over
all iterations of the main loop the steps in lines 6–8 take time 𝑂(𝑚). Alternatively,
the digraph may be represented by an adjacency table which has Boolean entries
𝑎[𝑢, 𝑣] which have value true if and only if (𝑢, 𝑣) ∈ 𝐴, otherwise false. In that case,
all nodes 𝑣 have to be checked in line 6, which takes time 𝑂(𝑛) for each iteration of
the main loop.
Taken together, lines 4, 5, and 6–8 take time 𝑂(𝑛) + 𝑂(1) + 𝑂(𝑛), and because
they are performed up to 𝑛 times in total time 𝑂(𝑛 2 ), which dominates the overall
running time compared to lines 1–2 and 9. If the digraph is represented by
adjacency lists, the running time is 𝑂(𝑛 2 ) for lines 3–4 plus 𝑂(𝑚) for lines 6–8 over
all iterations of the main loop, which is 𝑂(𝑛 2 ) + 𝑂(𝑚) = 𝑂(𝑛 2 ) because 𝑚 ∈ 𝑂(𝑛 2 ),
by (2.19). In summary, for a network with 𝑛 nodes, the running time of Dijkstra’s
algorithm is 𝑂(𝑛 2 ). This is better by a factor of 𝑛 than the Bellman–Ford algorithm,
but requires the assumption of nonnegative arc weights.
2.10. Reminder of Learning Outcomes 43
Exercise 2.1. In the example (2.1) of the marriage problem with three women and
men on each side, it was shown in the text that it is not possible to find a perfect
matching of three couples. Now assume you can add exactly one more possible
couple as an edge to the graph in (2.1), for example the pair (𝑎 1 , 𝑏2 ). Show for
which added edge it will be possible to create a perfect matching, and for which
added edge it will not work and why (arguing similarly to the text).
Exercise 2.4. Recall Algorithm 2.11 that computes the minimum of 𝑛 array elements:
Input : 𝑛 numbers 𝑆[1], 𝑆[2], . . . , 𝑆[𝑛], 𝑛 ≥ 1.
Output : their minimum 𝑚 and its index 𝑖, so that 𝑚 = 𝑆[𝑖 ] and 𝑚 ≤ 𝑆[𝑘] for all 𝑘.
1. 𝑚 ← 𝑆[1]
2. 𝑖 ← 1
3. 𝑘 ← 2
4. while 𝑘 ≤ 𝑛 :
5. if 𝑆[𝑘] < 𝑚 :
6. 𝑚 ← 𝑆[𝑘]
7. 𝑖 ← 𝑘
8. 𝑘 ← 𝑘+1
The array elements 𝑆[1], 𝑆[2], . . . , 𝑆[𝑛], 𝑛 ≥ 1 need not be distinct so that the
returned index 𝑖 with 𝑚 = 𝑆[𝑖 ] may not be unique according to the output speci-
fication. With this algorithm, what is 𝑖 when 𝑆[1], 𝑆[2], . . . , 𝑆[𝑛] = 5, 3, 3, 4, 3, 8?
Which value of 𝑖 is returned in general? How should one modify the algorithm
so that the index 𝑖 is as large as possible (i.e., 𝑆[𝑖 ] is the last among the minimal
elements in the array)? Justify your answers.
Exercise 2.5. In the adjacency list for a digraph we place 𝑦 in column 𝑥 whenever
(𝑥, 𝑦) is an arc. Sketch the digraph whose adjacency list is the following:
𝑎 𝑏 𝑐 𝑑 𝑒 𝑓
𝑑 𝑎 𝑏 𝑏 𝑓 𝑎
𝑒 𝑐
𝑒
Find a directed path from 𝑐 to 𝑓 , and all directed cycles that start and end at 𝑑.
Exercise 2.6.
(a) Apply the first version of the Bellman-Ford algorithm (which records separate
distances 𝑑[𝑣, 𝑖 ] from the source 𝑠 for each vertex 𝑣 and iteration 𝑖 ) to the
following network with source 𝑠. Do so by listing for each vertex the interme-
diate distances for each iteration and showing which ones are newly updated.
What is the output of the algorithm, with the found distances from 𝑠 to each
vertex?
Also, for each node 𝑣, give the predecessor node pred[𝑣] on the shortest path
from the source 𝑠 to 𝑣 (record any update of pred[𝑣] in the table that documents
the progress of the algorithm).
2.11. Exercises for Chapter 2 45
−1
c d
2 3 −7
s −2 6 2 e
3 1
a 1 b
(b) Do the same as in (a) with the same network except that the weight 6 on the
arc (𝑑, 𝑎) is replaced with weight 4.
Exercise 2.7. Apply the first version of the Bellman-Ford algorithm to the following
network.
−2
s x
1
2 −1
y z
1
Exercise 2.8. Using the definition of 𝑂-notation, prove the following: Let 𝑓 : N → R
be a polynomial of degree 𝑑 ≥ 0 with positive leading coefficient, that is,
where 𝑎 𝑑 , 𝑎 𝑑−1 , . . . , 𝑎 1 , 𝑎0 are real numbers with 𝑎 𝑑 > 0. Prove that 𝑓 (𝑛) ∈ 𝑂(𝑛 𝑑 ).
Hint: consider a polynomial that uses only the positive coefficients of 𝑓 .
Exercise 2.9. What is the running time of the algorithm in Exercise 2.4 that finds a
minimum in an array of 𝑛 numbers? Use 𝑂-notation. Justify your answer.
Exercise 2.10.
(a) Demonstrate Dijkstra’s shortest-path algorithm for the following network with
source node 𝑎, in the style of the example in (2.18). Draw the computed
shortest-path tree.
b 7
4 6
e f
3
a d 3
1 1
2
1
c g h
7 2
(b) Would the algorithm still give the correct answer for this network if the weight
7 of the arc (𝑏, 𝑒) was replaced by −7 ? Explain precisely why or why not.
3
Continuous Optimisation
3.1 Introduction
46
3.1. Introduction 47
You should know the important concepts of real analysis, which we review in this
chapter. A good introductory book is
Bryant, V. (1990). Yet Another Introduction to Analysis. Cambridge University
Press, Cambridge, UK. ISBN 978-0521388351.
The image of “seaview hotels” to prove the Bolzano–Weierstrass theorem using
Proposition 3.15 is taken from page 32 of that book. That book is also a good
introduction to the material in Section 3.6.
The following introductory textbook on optimisation provides also some
foundational material.
Sundaram, R. K. (1996). A First Course in Optimization Theory. Cambridge
University Press, Cambridge, UK. ISBN 978-0521497190.
In particular, appendix B and section 1.2.4 of that book complement our (optional)
Section 3.4 on constructions of the real numbers. Further useful explanations of
this topic are found in these wikipedia articles on construction of the real numbers
and the Dedekind cut:
https://en.wikipedia.org/wiki/Construction_of_the_real_numbers
https://en.wikipedia.org/wiki/Dedekind_cut
For proofs about real numbers as Dedekind cuts see pages 17–21 of
Rudin, W. (1976). Principles of Mathematical Analysis, 3rd ed., volume 3. McGraw-
Hill, New York. ISBN 978-0070542358.
A lot of this chapter is review material, in a more rigorous version that what you
may know already, but also largely self-contained:
48 Chapter 3. Continuous Optimisation
• Sections 3.2–3.4 deal with the real numbers, namely why we can maximise real
values (which uses their order, see Section 3.2) and their main property of the
existence of infimum and supremum for nonempty bounded sets (Section 3.3).
Section 3.4 (which is optional, as indicated by a star *) explains mathematical
“constructions” of real numbers. We mostly rely on the intuition of the real
numbers as points on a line.
• Section 3.5 recalls concepts about functions such as domain, range, and image.
• Sequences and their limits are important tools to prove continuity. They make
their appearance twice, in Section 3.6 for real numbers and in Section 3.8 for
vectors of real numbers (elements of R𝑛 ).
• Section 3.7 is about measuring distance in R𝑛 .
• Open and closed sets (Section 3.9) and compact sets (Section 3.10) are important
concepts when studying continuity, the topic of Sections 3.11 and 3.12.
• All this leads up to the central theorem of this chapter, the Theorem of
Weierstrass (Section 3.13) and its use (Section 3.14).
and a second point as 1 (normally to the right of 0 if the line is drawn horizontally,
or above 0 if the line is drawn vertically). Then the distance between 0 and 1 is the
“unit length”. Any other point 𝑥 is then a point on this line where 𝑥 is to the right
of 0 if 𝑥 > 0, to the left of 0 if 𝑥 < 0, with a distance 𝑥 from 0 assuming that the
distance between 0 and 1 is 1. The set of reals R is thought to be exactly the set of
these points on the line.
The fact that real numbers are ordered makes them one of the most useful
mathematical objects in practical applications. For example, the complex numbers
cannot be ordered in such a useful way. The complex numbers allow us to solve
arbitrary polynomial equations such as 𝑥 2 = −1, for which no real solution 𝑥 exists,
because 𝑥 2 ≥ 0 holds for every real number 𝑥. This, as we show shortly, is a
property of the order relation ≥, and so we cannot have a system of numbers that
we can order and find a minimum or maximum, and that at the same time allows
solving arbitrary polynomial equations.
We state a few properties of the order relation ≥ that imply the inequality
𝑥2 ≥ 0 for all 𝑥. Most importantly, the order relation 𝑥 ≥ 𝑦 should be compatible
with addition in the sense that we can add any number 𝑧 to both sides and preserve
the property (which is obvious from our picture of the real line). That is, for any
reals 𝑥, 𝑦, 𝑧
𝑥 ≤ 𝑦 ⇒ 𝑥+𝑧 ≤ 𝑦+𝑧 (3.1)
which for 𝑧 = −𝑥 − 𝑦 implies
𝑥≤𝑦 ⇒ − 𝑦 ≤ −𝑥 . (3.2)
Because −𝑥 = (−1) · 𝑥, the implication (3.2) states the well-known rule that
multiplication with −1 reverses an inequality. If you forget why this is the case, simply
subtract the terms on both sides from the inequality to get this result, as we have
done. Another condition concerning multiplication of real numbers is
𝑥, 𝑦 ≥ 0 ⇒ 𝑥 · 𝑦 ≥ 0 . (3.3)
In terms of the real number line, this means that 𝑦 is “stretched” (if 𝑥 > 1) or
“shrunk” (if 0 ≤ 𝑥 < 1) or stays the same (if 𝑥 = 1) when multiplied with the
nonnegative number 𝑥, but stays on the same side as 0 (this holds for any real
number 𝑦; here 𝑦 is also assumed to be nonnegative). Condition (3.3) holds for a
positive integer 𝑥 as a consequence of (3.1), because 𝑦 ≥ 0 implies 𝑦 + 𝑦 ≥ 𝑦 and
hence 2𝑦 = 𝑦 + 𝑦 ≥ 𝑦 ≥ 0, and similarly for any repeated addition of 𝑦. Extending
this from positive integers 𝑥 to real numbers 𝑥 gives (3.3).
We now show that 𝑥 · 𝑥 ≥ 0 for any real number 𝑥. This holds if 𝑥 ≥ 0 by
(3.3). If 𝑥 ≤ 0 then −𝑥 ≥ 0 by (3.2), and so 𝑥 · 𝑥 = (−1) · (−1) · 𝑥 · 𝑥 = (−𝑥)(−𝑥) ≥ 0
again by (3.2), where we have used that (−1) · (−1) = 1. This, in turn, follows
from something we have already used, namely that (−1) · 𝑦 = −𝑦 for any 𝑦,
50 Chapter 3. Continuous Optimisation
𝑥 ≤ 𝑦, 𝑦 ≤ 𝑧 ⇒ 𝑥 ≤ 𝑧. (3.4)
𝑥 ≤ 𝑦, 𝑦 ≤ 𝑥 ⇒ 𝑥 = 𝑦. (3.5)
𝑥 ≤ 𝑥. (3.6)
(and correspondingly the relations ≥ and >). Transitivity (3.4), antisymmetry (3.5)
and reflexivity (3.6) define what is called a partial order. The term “partial” means
that there can be elements 𝑥 and 𝑦 that are incomparable in the sense that neither
𝑥 ≤ 𝑦 not 𝑦 ≤ 𝑥 holds. One of the most important partial orders is the inclusion
relation ⊆ between sets, where 𝐴 ⊆ 𝐵 means that 𝐴 is a subset of 𝐵.
For the order ≤ on the set R of reals, incomparability does not occur. This
order is therefore called total in the sense that for all 𝑥, 𝑦 ∈ R
𝑥 ≤ 𝑦 or 𝑦 ≤ 𝑥 . (3.8)
Definition 3.1. An ordered set is a set 𝑆 together with a binary relation ≤ that is
transitive, antisymmetric, and reflexive, that is, (3.4) (3.5), (3.6) hold for all 𝑥, 𝑦, 𝑧
in 𝑆. The order is called total and the set totally ordered if (3.8) holds for all 𝑥, 𝑦 in 𝑆.
3.3. Infimum and Supremum 51
The real numbers R allow addition and multiplication, and comparison with the
order ≤. The same applies to the rational numbers Q. However, the real numbers
have an additional property of completeness that we will state using the order
relation ≤. The rational numbers lack this√property. Because there is no rational
number 𝑥 such that 𝑥 2 = 2 (in other words, 2 is irrational), the set {𝑥 ∈ Q | 𝑥 2 < 2}
has no least upper bound in Q (defined shortly), but it does in R. Intuitively, the
parabola that consists of all pairs (𝑥, 𝑦) such that 𝑦 = 𝑥 2 − 2 is a “continuous curve”
in R2 √that should intersect the “𝑥-axis” where 𝑦 = 0 at two points (𝑥, 𝑦) where
𝑥 = ± 2 and 𝑦 = 0, in agreement with the intuition that 𝑥 can take all values on
the real number line.
The following definition of an upper or lower bound of a set applies to any
ordered set; the order need not be total. The applications we have in mind occur
when the ordered set 𝑆 is R or Q.
Definition 3.3. Let 𝑆 be an ordered set and let 𝐴 ⊆ 𝑆. We say 𝐴 is bounded from
above if 𝐴 has an upper bound, and bounded from below if 𝐴 has a lower bound,
and just bounded if 𝐴 is bounded from above and below. The least upper bound or
supremum of a set 𝐴, denoted sup 𝐴, is the least element of the set of upper bounds
of 𝐴 (if it exists). The greatest lower bound or infimum of a set 𝐴, denoted inf 𝐴, is
the greatest element of the set of lower bounds of 𝐴 (if it exists).
We have stated condition 3.5 as an axiom about the real numbers rather than a
theorem. That is, we assume this condition for all sets of real numbers, according
to our intuition about the real number line.
It is also possible to prove Axiom 3.5 as a theorem when one has “constructed” the
real numbers. There are several ways to do this, which we outline in this section,
as an excursion and general background.
The standard approach is to consider the set of R as a “system” of numbers
in a sequence of increasingly powerful systems N, Z, Q, R. We first consider the
set N of natural numbers (positive integers) as used for counting, then introduce
zero, and negative integers, which gives us the set of all integers Z. We also have a
way of writing down these integers, namely as finite sequences of decimal digits
(elements of {0, 1, . . . , 9}), preceded by a minus sign for negative integers. This
representation of an integer is unique if the first digit is not 0; all integers can be
written in this way except for the integer zero itself which is written as 0.
3.4. Constructing the Real Numbers * 53
That is, for any positive 𝜀, which can be as small as one likes, there is a subscript 𝐾
so that for all 𝑖 and 𝑗 that are at least 𝐾, the sequence elements 𝑥 𝑖 and 𝑥 𝑗 differ by
less than 𝜀. Note that, in particular, we could choose 𝑖 = 𝐾 and 𝑗 arbitrarily larger
than 𝑖, and yet 𝑥 𝑖 and 𝑥 𝑗 would differ by less than 𝜀.
In the Cauchy condition (3.9), all elements 𝑥 1 , 𝑥2 , . . . and 𝜀 can be rational
numbers. An example of such a Cauchy sequence is the sequence of finite decimal
fractions 𝑥 𝑘 obtained from an infinite decimal fraction up to the 𝑘th place past
the decimal point. For example, if the infinite decimal fraction is 3.1415926 . . .,
then this sequence of rational numbers is given by 𝑥1 = 3.1, 𝑥2 = 3.14, 𝑥 3 = 3.141,
𝑥4 = 3.1415, 𝑥 5 = 3.14159, 𝑥6 = 3.141592, 𝑥7 = 3.1415926, and so on, which is easily
seen to be a Cauchy sequence.
More generally, we can define a real number to be a Cauchy sequence of rational
numbers. Two sequences {𝑥 𝑘 } and {𝑦 𝑘 } are equivalent if |𝑥 𝑘 − 𝑦 𝑘 | is arbitrarily small
for sufficiently large 𝑘, and if one of these two sequences is a Cauchy sequence
then so is the other (which is easily seen). Any two equivalent Cauchy sequences
define the same real number. Note that a real number is an infinite object (in fact
an entire equivalence class of Cauchy sequences of rational numbers), similar to
an infinite decimal fraction.
With real numbers defined as Cauchy sequences of rational numbers, it is
possible to prove Axiom 3.5 as a theorem. This requires to show the existence
of limits of sequences of real numbers, and the construction of a supremum as a
suitable limit; see appendix B and section 1.2.4 of Sundaram (1996).
We mention a second possible construction of the real numbers where Axiom
3.5 is much easier to prove. A Dedekind cut is a partition of Q into a lower set 𝐿 and
an upper set 𝑈 such that 𝑎 < 𝑏 for every 𝑎 ∈ 𝐿 and 𝑏 ∈ 𝑈, and such that 𝐿 has no
maximal element. The idea is that each real number 𝑥 defines uniquely such a cut
of the rational numbers into 𝐿 and 𝑈 given by
If 𝑥 is itself a rational number, then 𝑥 belongs to the upper set 𝑈 for the Dedekind
cut 𝐿, 𝑈 for 𝑥 (which is why we require that 𝐿 has no maximal element, to make
this a unique choice). If 𝑥 is irrational, then 𝑥 belongs to neither 𝐿 nor 𝑈 and is
“between” 𝐿 and 𝑈. Hence, we can see this cut as a definition of 𝑥. The Dedekind
cut 𝐿, 𝑈 in (3.10) that represents 𝑥 is unique. This holds because any two different
3.5. Maximisation and Minimisation 55
real numbers 𝑥 and 𝑦 define different cuts (because a suitable rational number 𝑐
with 𝑥 < 𝑐 < 𝑦 belongs to the upper set of the cut for 𝑥 but to the lower set of the
cut for 𝑦).
In constructing R as Dedekind cuts, the described partitions 𝐿, 𝑈 of Q, each
real number has a unique description as such a cut. (The lower set 𝐿 suffices,
because 𝑈 is just the set of upper bounds of 𝐿 in Q.) Similar to the representation as
a Cauchy sequence, a real number 𝑥 has an infinite description as a set of rational
numbers. If 𝑥 is described by the cut 𝐿, 𝑈 and 𝑥 ′ is described by the cut 𝐿′ , 𝑈 ′,
then we can define 𝑥 ≤ 𝑥 ′ by the inclusion relation 𝐿 ⊆ 𝐿′ (as seen from (3.10)).
Now Axiom 3.5, the order completeness of R, is very easy to prove: Given a
nonempty set 𝐴, bounded above, of real numbers 𝑥 represented by their lower
cut sets 𝐿 in (3.10), the supremum of 𝐴 is represented by the union of these sets 𝐿.
This union is a set of rational numbers, which can be easily shown to fulfill the
properties of a lower cut set of a Dedekind cut, and thus defines a real number,
which can be shown to be the supremum sup 𝐴.
Dedekind cuts are an elegant construction of the real numbers from the rational
numbers. It is slightly more complicated to define arithmetic operations of addition
and, in particular, multiplication of real numbers via the rational numbers in the
respective cut sets than using Cauchy sequences; see Rudin (1976, pages 17–21).
Dedekind cuts are an abstraction that “defines” a point 𝑥 on the real line via
all the rational numbers 𝑎 to the left of 𝑥, which defines the lower cut set 𝐿 in (3.10).
This infinite set 𝐿 is mathematically “simpler” than 𝑥 because it contains only
rational numbers 𝑎. We “know” these rational numbers via their finite descriptions
as fractions, but as points on the line they do not provide a good intuition about
the reals. In our reasoning about the real numbers, we therefore refer usually to
our intuition of the real number line.
Real numbers are the values that a function takes which we want to optimise
(maximise or minimise). The domain of the considered function can be rather
general, and will often by denoted by 𝑋, which is always a nonempty set.
Consider a function 𝑓 : 𝑋 → R, where 𝑋 is a nonempty set. The function 𝑓 is
called the objective function. The domain 𝑋 of 𝑓 is sometimes called the constraint
set (typically when 𝑋 is described by certain “constraints”, for example 𝑥 ≥ 0 and
𝑥 ≤ 1 if 𝑋 is the interval [0, 1]).
Our basic optimisation problems are:
(a) maximise 𝑓 (𝑥) subject to 𝑥 ∈ 𝑋,
(b) minimise 𝑓 (𝑥) subject to 𝑥 ∈ 𝑋.
56 Chapter 3. Continuous Optimisation
The following is an easy but useful observation, proved with the help of (3.2).
We use it in order to focus on maximisation problems and to avoid repeating very
similar considerations for minimisation problems.
Theorem 3.10. Suppose 𝑋 = 𝑋1 ∪ 𝑋2 (the two sets 𝑋1 and 𝑋2 need not be disjoint) such
that there exists an element 𝑦 in 𝑋1 with 𝑓 (𝑦) ≥ 𝑓 (𝑥) for all 𝑥 ∈ 𝑋2 and that 𝑓 attains a
maximum on 𝑋1 . Then 𝑓 attains a maximum on 𝑋.
Analysis and the study of continuity require the use of sequences and limits. For
the moment we consider only sequences of real numbers. Recall that a sequence
𝑥1 , 𝑥2 , . . . is denoted by {𝑥 𝑘 } 𝑘∈N or just {𝑥 𝑘 }. The limit of such a sequence, if it
exists, is a real number 𝐿 so that the elements 𝑥 𝑘 of the sequence are eventually
arbitrary close to 𝐿. This closeness is described by a maximum distance from 𝐿,
often called 𝜀, that is chosen as an arbitrarily small positive real number.
In words, (3.11) says that for every (arbitrarily small) positive 𝜀 there is some
index 𝐾 such that from 𝐾 onwards (𝑘 ≥ 𝐾) all sequence elements 𝑥 𝑘 differ in
absolute value by less than 𝜀 from 𝐿.
The next proposition asserts that if a sequence has a limit, that limit is unique.
You should remember this from a course on real analysis.
⇒ Try proving Proposition 3.12 on your own before you study its proof.
Proof. Suppose there are two limits 𝐿 and 𝐿′ of the sequence {𝑥 𝑘 } 𝑘∈N with 𝐿 ≠ 𝐿′.
We arrive at a contradiction as follows. Let 𝜀 = |𝐿 − 𝐿′ |/2 and consider 𝐾 and 𝐾 ′
such that 𝑘 ≥ 𝐾 implies |𝑥 𝑘 − 𝐿| = |𝐿 − 𝑥 𝑘 | < 𝜀 and 𝑘 ≥ 𝐾 ′ implies |𝑥 𝑘 − 𝐿′ | < 𝜀.
Consider some 𝑘 such that 𝑘 ≥ 𝐾 and 𝑘 ≥ 𝐾 ′. Then
2𝜀 = |𝐿 − 𝐿′ | = |𝐿 − 𝑥 𝑘 + 𝑥 𝑘 − 𝐿′ | ≤ |𝐿 − 𝑥 𝑘 | + |𝑥 𝑘 − 𝐿′ | < 𝜀 + 𝜀 = 2𝜀, (3.12)
which is a contradiction.
The symbol ∞ for (positive) infinity can be thought of as an extra element
that is larger than any real number. Similarly, −∞ is an additional element that is
smaller than any real number. In terms of the order ≤, it is unproblematic to extend
the set R with the elements ∞ and −∞. However, when used with arithmetic
operations these infinite elements are in general not useful and should not be
treated as “numbers”; for example, ∞ − ∞ cannot be meaningfully defined.
We say that a sequence is bounded (from above or below, or just bounded
if both) if this holds for the set of its elements. A sequence that converges is
necessarily bounded: Fix some 𝜀 (for example 𝜀 = 1). Choose 𝐾 in (3.11) such
that 𝐿 − 𝜀 < 𝑥 𝑘 < 𝐿 + 𝜀 for all 𝑘 ≥ 𝐾. If 𝐾 = 1 then the sequence is bounded by
𝐿 − 𝜀 from below and 𝐿 + 𝜀 from above. If 𝐾 > 1, then the set {𝑥 𝑖 | 1 ≤ 𝑖 < 𝐾} is
nonempty and finite, and thus has a maximum 𝑎 and minimum 𝑏. Then the larger
number of 𝐿 + 𝜀 and 𝑎 is an upper bound and the smaller number of 𝐿 − 𝜀 and 𝑏 is
a lower bound for the sequence.
An unbounded sequence may nevertheless show the “limiting behaviour” that
eventually its elements become arbitrarily large. We then say that the sequence
tends to infinity.
Definition 3.13. A sequence {𝑥 𝑘 } 𝑘∈N of real numbers tends to infinity, written
𝑥 𝑘 → ∞ as 𝑘 → ∞, or lim 𝑘→∞ 𝑥 𝑘 = ∞, if
∀𝑀 ∈ R ∃𝐾 ∈ N ∀𝑘 ∈ N : 𝑘 ≥ 𝐾 ⇒ 𝑥 𝑘 > 𝑀 . (3.13)
For example, these may be the even numbers given by 𝑘 𝑛 = 2𝑛, but they do not
need to be defined explicitly. The subsequence just considers some (infinite) subset
of the elements of the original sequence in increasing order.
Proposition 3.14. A weakly increasing sequence that is bounded from above converges to
the supremum of the set of its elements. A weakly decreasing sequence that is bounded
from below converges to the infimum of the set of its elements.
Proof. Let the sequence be {𝑥 𝑘 } and let 𝐴 = {𝑥 𝑘 | 𝑘 ∈ N} be the set of the elements of
the sequence. Assume 𝐴 is bounded from above, so 𝐴 has a supremum 𝐿 = sup 𝐴
by Axiom 3.5. Let 𝜀 > 0. We want to show that for some 𝐾 ∈ N we have |𝑥 𝑘 − 𝐿| < 𝜀
for all 𝑘 ≥ 𝐾. Because 𝐿 is an upper bound of 𝐴, we have 𝐿 ≥ 𝑥 𝑘 and thus
|𝑥 𝑘 − 𝐿| = 𝐿 − 𝑥 𝑘 for all 𝑘, so we want to show 𝐿 − 𝑥 𝑘 < 𝜀 or equivalently 𝐿 − 𝜀 < 𝑥 𝑘 .
It suffices to find 𝐾 with 𝐿 − 𝜀 < 𝑥 𝐾 because 𝑥 𝐾 ≤ 𝑥 𝐾+1 ≤ 𝑥 𝐾+2 ≤ · · · ≤ 𝑥 𝑘 for all
𝑘 ≥ 𝐾 because the sequence is weakly increasing. Now, if for all 𝐾 ∈ N we had
𝐿 − 𝜀 ≥ 𝑥 𝐾 then 𝐿 − 𝜀 would be an upper bound of 𝐴 which is less than 𝐿, but 𝐿 is
the least upper bound of 𝐴. So the desired 𝐾 with 𝐿 − 𝜀 < 𝑥 𝐾 exists.
The claim for the infimum is proved similarly, or obtained by considering the
sequence {−𝑥 𝑘 } and its supremum instead of the original sequence.
Proof. The following argument has a nice visualisation in terms of “hotels that have
a view of the sea”. Suppose the real numbers 𝑥1 , 𝑥2 , . . . are the heights of hotels.
From the top of each hotel with height 𝑥 𝑘 you can look beyond the subsequent
hotels with heights 𝑥 𝑘+1 , 𝑥 𝑘+2 , . . . if they have lower height, and see the sea at
infinity if these are all lower. In other words, a hotel has “seaview” if it belongs to
the set 𝑆 given by
𝑆 = {𝑘 ∈ N | 𝑥 𝑘 > 𝑥 𝑗 for all 𝑗 > 𝑘}
(presumably, these are very expensive hotels). If 𝑆 is infinite, then we take the
elements of 𝑆, in ascending order as the subscripts 𝑘 1 , 𝑘2 , 𝑘3 , . . . that give our
subsequence 𝑥 𝑘1 , 𝑥 𝑘2 , 𝑥 𝑘3 , . . ., which is clearly decreasing. If, however, 𝑆 is finite
with maximal element 𝐾 (take 𝐾 = 0 if 𝑆 is empty), then for each 𝑘 > 𝐾 we have
𝑘 ∉ 𝑆 and hence for 𝑥 𝑘 there exists some 𝑗 > 𝑘 with 𝑥 𝑘 ≤ 𝑥 𝑗 . Starting with 𝑥 𝑘1
for 𝑘 1 = 𝐾 + 1 we let 𝑘2 = 𝑗 > 𝑘 1 with 𝑥 𝑘1 ≤ 𝑥 𝑘2 . Then find another 𝑘 3 > 𝑘2 such
that 𝑥 𝑘2 ≤ 𝑥 𝑘3 , and so on, which gives a weakly increasing subsequence {𝑥 𝑘 𝑛 } 𝑛∈N
60 Chapter 3. Continuous Optimisation
Proof. The sequence has a monotonic subsequence by Proposition 3.15. Because the
sequence is bounded, so is the subsequence, whose set of elements therefore has
supremum and infinimum. If the subsequence is weakly increasing, it converges
to the supremum of its elements by Proposition 3.14; if the subsequence is weakly
decreasing, it converges to the infimum.
Recall that if 𝑋 and 𝑌 are sets and 𝑓 : 𝑋 → 𝑌 is a function, then 𝑋 is called the
domain and 𝑌 the range of 𝑓 . The range should be distinguished from the image
of 𝑓 , denoted by
𝑓 (𝑋) = { 𝑓 (𝑥) | 𝑥 ∈ 𝑋 } , (3.14)
which is the set of possible values of 𝑓 . The image 𝑓 (𝑋) is always a subset of the
range 𝑌. When 𝑓 (𝑋) = 𝑌 then 𝑓 is called a surjective function. Because we want
to maximise or minimise the function, the range will always be R, that is, 𝑓 is
real-valued.
The domain 𝑋 of 𝑓 will in the following be assumed to be a subset of R𝑛 ,
which is the set of 𝑛-tuples of real numbers,
generalise this when 𝑥 and 𝑦 are points in R𝑛 . The standard “Euclidean” distance
is well known from measuring geometric distances in R2 , for example. In order to
deal with continuity, a distance function defined in terms of the “maximum norm”
will often be simpler to use.
𝑑(𝑥, 𝑦) = ∥𝑥 − 𝑦∥ , (3.18)
symmetry
𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥) , (3.21)
and the triangle inequality
It can be shown that the maximum-norm distance fulfills these axioms (see
Exercise 3.3), and so does the Euclidean norm.
The triangle inequality is then often stated as
∥𝑥 + 𝑦∥ ≤ ∥𝑥∥ + ∥ 𝑦∥ (3.23)
which implies (3.22) using 𝑥 − 𝑦 and 𝑦 − 𝑧 instead of 𝑥 and 𝑦. For an arbitrary set,
a trivially defined distance function that also fulfills axioms (3.20)–(3.22) is given
by 𝑑(𝑥, 𝑥) = 0 and 𝑑(𝑥, 𝑦) = 1 for 𝑥 ≠ 𝑦.
62 Chapter 3. Continuous Optimisation
Let 𝜀 > 0 and 𝑥 ∈ R𝑛 . The set of all points 𝑦 that have distance less than 𝜀 is
called the 𝜀-ball around 𝑥, defined as
𝐵(𝑥, 𝜀) = {𝑦 ∈ R𝑛 | ∥𝑦 − 𝑥∥ < 𝜀 } . (3.24)
It is also called the open ball because the inequality in (3.24) is strict. That is, 𝐵(𝑥, 𝜀)
does not include its “boundary”, called a sphere, which consists of all points 𝑦
whose distance to 𝑥 is equal to 𝜀.
Similary, the maximum-norm 𝜀-ball around a point 𝑥 in R𝑛 is defined as the set
𝐵max (𝑥, 𝜀) = {𝑦 ∈ R𝑛 | ∥𝑦 − 𝑥∥ max < 𝜀 } . (3.25)
The following is elementary but extremely useful:
∀𝑦 ∈ R𝑛 : ( 𝑦 ∈ 𝐵max (𝑥, 𝜀) ⇔ ∀𝑖 ∈ {1, . . . , 𝑛} : |𝑦 𝑖 − 𝑥 𝑖 | < 𝜀 ) . (3.26)
In other words, 𝑦 is in the maximum-norm 𝜀-ball around 𝑥 if and only if 𝑦 differs
from 𝑥 in each component by less than 𝜀. This follows immediately from (3.17).
The following picture shows the 𝜀-ball and the maximum-norm 𝜀-ball for
𝜀 = 1 around the origin 0 in R2 . The latter, 𝐵max (0, 1), is the set of all points (𝑥 1 , 𝑥2 )
so that −1 < 𝑥1 < 1 and −1 < 𝑥 2 < 1, which is the open square shown on the right.
x2 x2
x1 x1
0 0
(1,0) (1,0)
x1 x1
0 0
(1,0) (1,0)
3.8. Sequences and Convergence in R𝑛 63
as claimed.
In Section 3.6, we considered sequences of real numbers and their limits, if they
exist. In this section we give analogous definitions and results for sequences of
points in R𝑛 . Because 𝑥 𝑖 denotes a component of a point 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 ), we will
write sequence elements, which are now elements of R𝑛 , in the form 𝑥 (𝑘) for 𝑘 ∈ N.
Analogously to (3.11), the sequence {𝑥 (𝑘) } 𝑘∈N has limit 𝑥 ∈ R𝑛 , or 𝑥 (𝑘) → 𝑥 as
𝑘 → ∞, if
∀𝜀 > 0 ∃𝐾 ∈ N ∀𝑘 ∈ N : 𝑘 ≥ 𝐾 ⇒ ∥𝑥 (𝑘) − 𝑥 ∥ < 𝜀 . (3.29)
The sequence is bounded if there is some 𝑀 ∈ R so that
∀𝑘 ∈ N : ∥𝑥 (𝑘) ∥ ≤ 𝑀 . (3.30)
Analogously to Proposition 3.12, a sequence can have at most one limit. This is
proved in the same way, where the contradiction (3.12) is proved with the help of
the triangle inequality (3.23), using the norm instead of the absolute value.
In the definitions (3.29) and (3.30), we have used the Euclidean norm, but we
could have used in the same way the maximum norm as defined in (3.17) instead,
as asserted by the following lemma.
Lemma 3.18. The sequence {𝑥 (𝑘) } 𝑘∈N in R𝑛 has limit 𝑥 ∈ R𝑛 if and only if
∀𝜀 > 0 ∃𝐾 ∈ N ∀𝑘 ∈ N : 𝑘 ≥ 𝐾 ⇒ ∥𝑥 (𝑘) − 𝑥 ∥ max < 𝜀 , (3.31)
and it is bounded if and only if for some 𝑀 ∈ R
∀𝑘 ∈ N : ∥𝑥 (𝑘) ∥ max ≤ 𝑀 . (3.32)
64 Chapter 3. Continuous Optimisation
Proof. Suppose {𝑥 (𝑘) } converges to 𝑥 in the Euclidean norm. Let 𝜀 > 0, and choose
𝐾 in (3.29) so that 𝑘 ≥ 𝐾 implies ∥𝑥 (𝑘) − 𝑥∥ < 𝜀, that is, 𝑥 (𝑘) ∈ 𝐵(𝑥, 𝜀). Because
𝐵(𝑥, 𝜀) ⊆ 𝐵max (𝑥, 𝜀) by (3.27), this also means 𝑥 (𝑘) ∈ 𝐵max (𝑥, 𝜀), which shows (3.31).
Conversely, assume (3.31) holds and let 𝜀 > 0. Choose 𝐾 so that 𝑘 ≥ 𝐾 implies
√ √
𝑥 (𝑘)∈ 𝐵max (𝑥, 𝜀/ 𝑛). Then 𝐵max (𝑥, 𝜀/ 𝑛) ⊆ 𝐵max (𝑥, 𝜀) by (3.28), which shows
(3.29).
The equivalence of (3.30) and (3.32) is shown similarly.
According to (3.29), a sequence {𝑥 (𝑘) } converges to 𝑥 when for every 𝜀 > 0
eventually (that is, for sufficiently large 𝑘) all elements of the sequence are in
the open ball 𝐵(𝑥, 𝜀) of radius 𝜀 around 𝑥. The same applies to (3.31), using the
(square- or cubical-looking) ball 𝐵max (𝑥, 𝜀). Another useful view of (3.31) is that
the sequence {𝑥 (𝑘) } converges to 𝑥 if for each component 𝑖 = 1, . . . , 𝑛 of these
(𝑘)
𝑛-tuples, we have 𝑥 𝑖 → 𝑥 𝑖 as 𝑘 → ∞, because the condition
(𝑘)
|𝑥 𝑖 − 𝑥 𝑖 | < 𝜀 (3.33)
We are concerned with the behaviour of a function 𝑓 “near a point 𝑎”, that is, how
the function value 𝑓 (𝑥) behaves when 𝑥 is near 𝑎, where 𝑥 and 𝑎 are points in some
subset 𝑆 of R𝑛 . For that purpose, it is of interest whether 𝑎 can be approached
with suitable choices of 𝑥 from “all sides”, which is the case if there is an 𝜀-ball
around 𝑎 that is fully contained in 𝑆. If that is the case, then the set 𝑆 will be called
open according to the following definition.
By (3.27) and (3.28), we could use the maximum-norm ball instead of the
Euclidean-norm ball in (3.34), that is, 𝑆 is open if and only if
⇒ Prove that the open balls 𝐵(𝑎, 𝜀) and 𝐵max (𝑎, 𝜀) are themselves open subsets
of R𝑛 .
Definition 3.21. Let 𝑆 ⊆ R𝑛 . Then 𝑆 is called closed if for all 𝑎 ∈ R𝑛 and all
sequences {𝑥 (𝑘) } in 𝑆 (that is, 𝑥 (𝑘) ∈ 𝑆 for all 𝑘 ∈ N) with limit 𝑎 we have 𝑎 ∈ 𝑆.
Another common term for limit point is accumulation point. Clearly, a set is
closed if and only if it contains all its limit points. Trivially, every element 𝑎 of 𝑆 is a
limit point of 𝑆, by taking the constant sequence given by 𝑥 (𝑘) = 𝑎 in Definition 3.22.
The next lemma is important to show the connection between open and closed
sets.
Proof. Suppose 𝑆 is closed, so it contains all its limit points. We want to show that
𝑇 is open, so let 𝑎 ∈ 𝑇. We want to show that 𝐵(𝑎, 𝜀) ⊆ 𝑇 for some 𝜀 > 0. If that
was not the case, then for all 𝜀 > 0 there would be some element 𝑎 in 𝐵(𝑎, 𝜀) that
does not belong to 𝑇 and hence to 𝑆, so that 𝐵(𝑎, 𝜀) ∩ 𝑆 ≠ ∅. But then 𝑎 is a limit
point of 𝑆 according to Lemma 3.23, hence 𝑎 ∈ 𝑆 because 𝑆 is closed, contrary to
the assumption that 𝑎 ∈ 𝑇.
Conversely, assume 𝑇 is open, so for all 𝑎 ∈ 𝑇 we have 𝐵(𝑎, 𝜀) ⊆ 𝑇 for some
𝜀 > 0. But then 𝐵(𝑎, 𝜀) ∩ 𝑆 = ∅, and thus 𝑎 is not a limit point of 𝑆. Hence 𝑆
contains all its limits points (if not, such a point would belong to 𝑇), so 𝑆 is closed.
It is possible that a set is both open and closed, which applies to the full set R𝑛
and to the empty set ∅. (For any “connected space” such as R𝑛 , these are the only
possibilities.) A set may also be neither open nor closed, such as the half-open
interval [0, 1) as a subset of R1 . This set does not contain its limit point 1 and is
therefore not closed. It is also not open, because its element 0 does not have a ball
𝐵(0, 𝜀) = (−𝜀, 𝜀) around it that is fully contained in [0, 1). Another example of a
set which is neither open nor closed is the set {1/𝑛 | 𝑛 ∈ N} which is missing its
limit point 0.
The following theorem states that the intersection of any two open sets 𝑆 and
𝑆′ is open, and the arbitrary union 𝑖∈𝐼 𝑆 𝑖 of any open sets 𝑆 𝑖 is open. Similarly, the
Ð
intersection of any two closed sets 𝑆 and 𝑆′ is closed, and the arbitrary intersection
𝑖∈𝐼 𝑆 𝑖 of any closed sets 𝑆 𝑖 is closed. Here 𝐼 is any (possibly infinite) nonempty
Ð
set of subscripts 𝑖 for the sets 𝑆 𝑖 , and
𝑆 𝑖 = {𝑥 | ∃𝑖 ∈ 𝐼 : 𝑥 ∈ 𝑆 𝑖 } 𝑆 𝑖 = {𝑥 | ∀𝑖 ∈ 𝐼 : 𝑥 ∈ 𝑆 𝑖 } .
Ð Ñ
𝑖∈𝐼 and 𝑖∈𝐼 (3.37)
Theorem 3.25. Let 𝑆, 𝑆′ ⊆ R𝑛 , and let 𝑆 𝑖 ⊆ R𝑛 for 𝑖 ∈ 𝐼 for some arbitrary nonempty
set 𝐼. Then
(a) If 𝑆 and 𝑆′ are both open, then 𝑆 ∩ 𝑆′ is open.
(b) If 𝑆 and 𝑆′ are both closed, then 𝑆 ∪ 𝑆′ is closed.
(c) If 𝑆 𝑖 is open for 𝑖 ∈ 𝐼, then 𝑆 𝑖 is open.
Ð
𝑖∈𝐼
Proof. Assume both 𝑆 and 𝑆′ are open, and let 𝑎 ∈ 𝑆 ∩ 𝑆′. Then 𝐵(𝑎, 𝜀) ⊆ 𝑆 and
𝐵(𝑎, 𝜀′) ⊆ 𝑆′ for suitable positive 𝜀 and 𝜀′. The smaller of the two balls 𝐵(𝑎, 𝜀) and
𝐵(𝑎, 𝜀′) is therefore a subset of both sets 𝑆 and 𝑆′ and therefore of their intersection.
So 𝑆 ∩ 𝑆′ is open, which shows (a).
Condition (b) holds because if 𝑆 and 𝑆′ are closed, then 𝑇 = R𝑛 \ 𝑆 and
𝑇 ′ = R𝑛 \ 𝑆′ are open, and so is 𝑇 ∩ 𝑇 ′ by (a), and hence 𝑆 ∪ 𝑆′ = R𝑛 \ (𝑇 ∩ 𝑇 ′) is
open by Theorem 3.24.
3.10. Bounded and Compact Sets 67
To see (c), let 𝑆 𝑖 be open for all 𝑖 ∈ 𝐼, and let 𝑎 ∈ ∪𝑖∈𝐼 𝑆 𝑖 , that is, 𝑎 ∈ 𝑆 𝑗 for some
𝑗 ∈ 𝐼. Then there is some 𝜀 > 0 so that 𝐵(𝑎, 𝜀) is a subset of 𝑆 𝑗 , which is a subset of
the set ∪𝑖∈𝐼 𝑆 𝑖 which is therefore open.
We obtain (d) from (c) because the intersection of complements of sets is the
complement of their union, that is, 𝑖∈𝐼 (R𝑛 \ 𝑆 𝑖 ) = R𝑛 \ 𝑖∈𝐼 𝑆 𝑖 , which we consider
Ñ Ð
here for closed sets 𝑆 𝑖 and hence open sets R𝑛 \ 𝑆 𝑖 .
Note that by induction, Theorem 3.25(a) extends to the statement that the
intersection of any finite number of open sets is open. However, this is no longer
true for arbitrary intersections. For example, each of the intervals 𝑆𝑛 = (− 𝑛1 , 𝑛1 ) for
𝑛 ∈ N is open, but their intersection 𝑛∈N 𝑆𝑛 is the singleton {0} which is not an
Ñ
open set. Similarly, arbitrary unions of closed sets are not necessarily closed, for
example the closed intervals [ 𝑛1 , 1] for 𝑛 ∈ N, whose union is the half-open interval
(0, 1] which is not closed. However, (c) and (d) do allow arbitrary unions of open
sets and arbitrary intersections of closed sets.
Condition states (3.30) what it means for a sequence {𝑥 𝑘 } to be bounded. The same
definition applies to a set.
That is, 𝑆 is bounded if and only if the components 𝑥 𝑖 of the points 𝑥 in 𝑆 are
bounded.
Theorem 3.16 states that a bounded sequence in R has convergent subsequence.
The same holds for R𝑛 instead of R.
(𝑘) (𝑘)
Proof. Consider a bounded sequence {𝑥 (𝑘) } in R𝑛 , where 𝑥 (𝑘) = (𝑥 1 , . . . , 𝑥 𝑛 ) for
(𝑘)
each 𝑘. Because the sequence is bounded, the sequence of 𝑖th components {𝑥 𝑖 }
(𝑘)
is bounded in R, for each 𝑖 = 1, . . . , 𝑛. In particular, the sequence {𝑥1 } given
by the first component is bounded, and because it is a bounded sequence of real
68 Chapter 3. Continuous Optimisation
(𝑘 𝑗 )
numbers, it has a convergent subsequence by Theorem 3.16, call it {𝑥 1 } where
𝑘 𝑗 for 𝑗 = 1, 2, . . . indicates the subsequence. That is, the sequence {𝑥 (𝑘 𝑗 ) } 𝑗∈N of
points in R𝑛 converges in its first component. We now consider the sequence
(𝑘 𝑗 )
of real numbers {𝑥2 } 𝑗∈N given the second components of the elements of that
subsequence. Again, by Theorem 3.16, this sequence has a convergent subsequence
for suitable values 𝑗ℓ for ℓ = 1, 2, . . ., so that the resulting sequence of points
{𝑥 (𝑘 𝑗ℓ ) }ℓ ∈N of points in R𝑛 convergences in its second component. Because the
subscripts 𝑘 𝑗ℓ for ℓ ∈ N define a subsequence of 𝑘 𝑗 for 𝑗 ∈ N, the first components
(𝑘 𝑗 ) (𝑘 𝑗 )
𝑥1 ℓ of these vectors are a subsequence of the convergent sequence 𝑥1 which is
therefore also convergent with the same limit.
So the sequence of vectors {𝑥 (𝑘 𝑗ℓ ) }ℓ ∈N convergences in their first and second
component. We now proceed in the same manner by considering the sequence
(𝑘 𝑗 )
of third components {𝑥3 ℓ }ℓ ∈N of these vectors, which again has a convergent
subsequence since these are bounded real numbers, and that subsequence now
defines a sequence of vectors in R𝑛 that convergence in their first, second, and
third components. By continuing in this manner, we obtain eventually, after 𝑛
applications of Theorem 3.16, a subsequence of the original sequence {𝑥 (𝑘) } 𝑘∈N that
converges in each component. As mentioned at the end of Section 3.8, convergence
in each component means overall convergence. So we have found the required
subsequence.
By the previous theorem, a sequence of points in a bounded subset 𝑆 of R𝑛
has a convergent subsequence. If that sequence has also its limit in 𝑆, then 𝑆 is
called compact, which is a very important concept.
Definition 3.28. Let 𝑆 ⊆ R𝑛 . Then 𝑆 is called compact if and only if every sequence
of points in 𝑆 has a convergent subsequence whose limit belongs to 𝑆.
Theorem 3.29. Let 𝑆 ⊆ R𝑛 . Then 𝑆 is compact if and only if 𝑆 is closed and bounded.
Proof. Assume first that 𝑆 is closed and bounded, and consider an arbitrary
sequence of points in 𝑆. Then by Theorem 3.27, this sequence has a convergent
subsequence with limit 𝑥, say, which belongs to 𝑆 because 𝑆 is closed. So 𝑆 is
compact according to Definition 3.28.
Conversely, assume 𝑆 is compact. Consider any convergent sequence of
points in 𝑆 with limit 𝑥. Because 𝑆 is compact, that sequence has a convergent
subsequence, whose limit is also 𝑥, and which belongs to 𝑆. So every limit point of
𝑆 belongs to 𝑆, which means that 𝑆 is closed.
In order to show that 𝑆 is bounded, assume this is not the case. Then
for every 𝑘 ∈ N there is a point that we call 𝑥 (𝑘) in 𝑆 with ∥𝑥 (𝑘) ∥ ≥ 𝑘. This
3.11. Continuity 69
3.11 Continuity
lim 𝑓 (𝑥 𝑘 ) = 𝑓 ( lim 𝑥 𝑘 )
𝑘→∞ 𝑘→∞
The condition ∥ 𝑓 (𝑥) − 𝑓 (𝑥)∥ < 𝜀 in (3.40) states that 𝑓 (𝑥) belongs to the 𝜀-ball
around 𝑓 (𝑥) (in R𝑚 ), which says that the function values 𝑓 (𝑥) are “close” to 𝑓 (𝑥).
This is required to hold for all points 𝑥 provided these belong to 𝑆 (so that 𝑓 (𝑥)
is defined) and are within an 𝛿-ball around 𝑥. Here 𝛿 can be chosen as small as
required but must be positive. This captures the intuition that 𝑓 maps points near
𝑥 to points near 𝑓 (𝑥).
A simple example of a function that is not continuous is the function 𝑓 : R → R
defined by (
0 if 𝑥 ≠ 0,
𝑓 (𝑥) = (3.41)
1 if 𝑥 = 0,
which is not continuous at 𝑥 = 0. Namely, if we choose 𝜀 = 12 , for example, then
for any 𝛿 > 0 we will always have a point 𝑥 near 𝑥 so that we have ∥𝑥 − 𝑥∥ < 𝛿
but ∥ 𝑓 (𝑥) − 𝑓 (𝑥)∥ ≥ 𝜀, in this case for example 𝑥 = 𝛿/2 where ∥ 𝑓 (𝑥) − 𝑓 (𝑥)∥ = 1,
which contradicts the requirement (3.40).
𝑔(𝑥, 𝑦)
0
if (𝑥, 𝑦) = (0, 0)
𝑥𝑦
𝑔(𝑥, 𝑦) = (3.42)
𝑥2 + 𝑦2
otherwise.
What is interesting about the function 𝑔 is that it is separately continuous in each
variable, which means the following: Consider 𝑔(𝑥, 𝑦) as a function of 𝑥 for fixed 𝑦,
3.12. Proving Continuity 71
that is, consider the function 𝑔 𝑦 : R → R given by 𝑔 𝑦 (𝑥) = 𝑔(𝑥, 𝑦). If 𝑦 = 0, then we
clearly have 𝑔 𝑦 (𝑥) = 0 for all 𝑥, which is certainly a continuous function. If 𝑦 ≠ 0,
𝑥𝑦
then 𝑦 2 > 0, and 𝑔 𝑦 given by 𝑔 𝑦 (𝑥) = 𝑥 2 +𝑦 2 is also continuous. Because 𝑔(𝑥, 𝑦) is
symmetric in 𝑥 and 𝑦, the function 𝑔 is also separately continuous in 𝑦. However,
𝑔 is not a continuous function when its arguments are allowed to vary jointly.
Namely, for 𝑥 = 𝑦 ≠ 0 we have 𝑔(𝑥, 𝑦) = 𝑔(𝑥, 𝑥) = 𝑥 2𝑥+𝑥 2 = 12 , so this function is
2
constant but with a different constant value 12 than 𝑔(0, 0), which is zero. That is, 𝑔
is not continuous at (0, 0).
A useful criterion to prove that a function is continuous relates to our initial
consideration of this section in terms of sequences.
Proposition 3.32. Let 𝑆 ⊆ R𝑛 , let 𝑓 : 𝑆 → R𝑚 be a function defined on 𝑆, and let 𝑥 ∈ 𝑆.
Then 𝑓 is continuous at 𝑥 if and only if for all sequences {𝑥 (𝑘) } in 𝑆 that converge to 𝑥
𝑓 (𝑥) + 𝜀
𝑓 (𝑥)
𝑓 (𝑥) − 𝜀
0 𝑥
0 𝑥 1
The first function we consider is 𝑓 (𝑥) = 1/𝑥. The function is not defined for
𝑥 = 0. If we extend 𝑓 with some value for 𝑓 (0) in order to have 𝑓 (𝑥) defined
for all 𝑥 ∈ R, the resulting function is surely not continuous at 0 because 𝑓 (𝑥)
is arbitrarily negative for negative 𝑥 that are close to 0, and arbitrarily positive
for small positive 𝑥, and these function values will not approach 𝑓 (0) no matter
how we defined 𝑓 (0). But the domain of 𝑓 is not R but R \ {0}, which is an open
set, and on this set we will show that 𝑓 is continuous. In short, “continuity at 0”
is not an issue because 𝑓 (0) is not defined. In order to simplify things, we just
consider 𝑓 (𝑥) = 1/𝑥 for 𝑥 ∈ (0, ∞), that is, 𝑥 > 0, where we want to show that 𝑓 is
continuous at 𝑥; a similar consideration applies to 𝑥 ∈ (−∞, 0).
Let 𝑥 > 0. We want to show continuity of 𝑓 at 𝑥. Our proof will involve
algebraic manipulations, but it is also very helpful to draw a graph of the function,
as in Figure 3.2, to understand what needs to be done. Given some 𝜀 > 0, we want
to ensure that | 𝑓 (𝑥) − 𝑓 (𝑥)| < 𝜀 for all 𝑥 such that |𝑥 − 𝑥| < 𝛿. The choice of 𝛿 will
3.12. Proving Continuity 73
depend on 𝜀, but also on 𝑥 because the graph of 𝑓 becomes very steep when 𝑥 is
close to zero.
We now work backwards from the condition | 𝑓 (𝑥) − 𝑓 (𝑥)| < 𝜀 in order to
obtain a suitable constraint on |𝑥 − 𝑥|:
The last inequality in (3.44) is a condition on |𝑥 − 𝑥|, but we cannot choose 𝛿 = 𝜀𝑥𝑥
because this expression does not depend solely on 𝜀 and 𝑥 but also on 𝑥. However,
all we want is that |𝑥 − 𝑥| < 𝛿 implies |𝑥 − 𝑥| < 𝜀𝑥𝑥. It is not necessary, as in
the dashed lines in Figure 3.2, that 𝑓 (𝑥 − 𝛿) meets exactly one of the endpoints
𝑓 (𝑥) ± 𝜀 of the “target interval” of the function values. Any positive 𝛿 that is at
most 𝜀𝑥𝑥 will serve the purpose. If 𝛿 is 𝑥/2 or less, then |𝑥 − 𝑥| < 𝛿 clearly implies
𝑥 ∈ ( 12 𝑥, 32 𝑥) and thus in particular 𝑥 > 12 𝑥. With that consideration, we let
𝛿 = min{ 12 𝑥, 21 𝜀𝑥 2 }. (3.45)
Then |𝑥 − 𝑥| < 𝛿 implies 12 𝑥 < 𝑥 and thus |𝑥 − 𝑥| < 𝛿 ≤ 12 𝜀𝑥 2 < 𝜀𝑥𝑥 and therefore
| 𝑓 (𝑥) − 𝑓 (𝑥)| < 𝜀 according to (3.44), as intended.
In the preceding proof, the function 𝑓 : 𝑥 ↦→ 1/𝑥 was shown to be continuous
at 𝑥 by choosing 𝛿 = 12 𝜀𝑥 2 (which for small 𝜀 also implies 𝛿 ≤ 12 𝑥 as required in
(3.45)). We see here that we have to choose 𝛿 as a function not only of 𝜀 but also of
the point 𝑥 at which we want to prove continuity.
As an aside, the concept of uniform continuity means that 𝛿 can be chosen as a
function of 𝜀 only. That is, a function 𝑓 : 𝑆 → R is called uniformly continuous if
In contrast, the function 𝑓 is just continuous if (3.40) holds prefixed with the
quantification ∀𝑥 ∈ 𝑆, so that 𝛿 can be chosen depending on 𝜀 and 𝑥, as in (3.45).
It can be shown that a continuous function on a compact domain is uniformly
continuous.
⇒ Show that the function 𝑓 : (0, ∞) → R, 𝑓 (𝑥) = 1/𝑥 is not uniformly continuous.
Note that its domain is not compact.
√
Our second example of a continuous function is 𝑓 : [0, ∞) → R, 𝑓 (𝑥) = 𝑥.
For 𝑥 > 0, the functon 𝑓 has the derivative 𝑓 ′(𝑥) = 12 𝑥 −1/2 . At 𝑥 = 0, the function 𝑓
has no derivative because that derivative would have to be arbitrarily steep. The
74 Chapter 3. Continuous Optimisation
√
|𝑥 − 𝑥| = 𝑥 − 𝑥 < 𝜀2 + 2( 𝑥𝑥 − 𝑥), (3.48)
√ √
where this inequality is implied by |𝑥 − 𝑥| < 𝜀2 because
√ 𝑥𝑥 − 𝑥 ≥ 𝑥 𝑥 − 𝑥 = 0.
Similarly, if 𝑥 < 𝑥, then we rewrite 𝑥 + 𝑥 < 𝜀 + 2 𝑥𝑥 equivalently as
2
√
|𝑥 − 𝑥| = 𝑥 − 𝑥 < 𝜀2 + 2( 𝑥𝑥 − 𝑥), (3.49)
√ √
where this inequality is again implied by |𝑥 −𝑥| < 𝜀2 because 𝑥𝑥 −𝑥 ≥ 𝑥 𝑥−𝑥 = 0.
In both cases, if we choose 𝛿 = 𝜀2 , then (3.40) holds, which proves that 𝑓 is
continuous, and in fact uniformly continuous on 𝑆 = [0, ∞) according to (3.46).
(This is an example of a function that is uniformly continuous even though its
domain 𝑆 is not compact.) Note that we do not need to worry that |𝑥 − 𝑥| < 𝛿
implies 𝑥 ∈ 𝑆 because that condition is also imposed in (3.40) and (3.46). For the
function 𝑥 ↦→ 1/𝑥 defined for all (positive or negative) 𝑥 ≠ 0, we also have to make
sure that 𝑥 and 𝑥 have the same sign because otherwise 1/𝑥 and 1/𝑥 would be
very far apart, but this follows from (3.45).
We now prove the continuity of some functions on R𝑛 . The following lemma
states that we can replace the Euclidean norm in (3.40) with the maximum norm,
which in many situations is more convenient to use.
Proof. Suppose first 𝑓 is continuous at 𝑥. Let 𝜀 > 0 and choose 𝛿 > 0 so that (3.40)
√
holds, and let 𝛿′ = 𝛿/ 𝑛. Then 𝑥 ∈ 𝐵max (𝑥, 𝛿′) ⊆ 𝐵(𝑥, 𝛿) by (3.28), which implies
𝑓 (𝑥) ∈ 𝐵( 𝑓 (𝑥), 𝜀) by choice of 𝛿 and thus 𝑓 (𝑥) ∈ 𝐵max ( 𝑓 (𝑥), 𝜀) by (3.27), which
implies (3.50) (with 𝛿′ instead of 𝛿) as claimed.
Conversely, given (3.50) and 𝜀 > 0, we choose 𝛿 > 0 so that 𝑥 ∈ 𝐵max (𝑥, 𝛿)
√
implies 𝑓 (𝑥) ∈ 𝐵max ( 𝑓 (𝑥), 𝜀/ 𝑚). Then 𝑥 ∈ 𝐵(𝑥, 𝛿) implies 𝑥 ∈ 𝐵max (𝑥, 𝛿) by (3.27)
√
and thus 𝑓 (𝑥) ∈ 𝐵max ( 𝑓 (𝑥), 𝜀/ 𝑚) ⊆ 𝐵( 𝑓 (𝑥), 𝜀) by (3.28), which proves (3.40).
3.12. Proving Continuity 75
|𝑥 𝑦−𝑥 𝑦| = |𝑥𝑦−𝑥 𝑦+𝑥 𝑦−𝑥 𝑦| ≤ |𝑥𝑦−𝑥 𝑦|+|𝑥 𝑦−𝑥 𝑦| = |𝑥−𝑥| |𝑦|+|𝑥| |𝑦−𝑦| (3.52)
Define
𝜀
𝛿𝑥 = . (3.56)
2(|𝑦| + 1)
Then |𝑥 − 𝑥| < 𝛿 𝑥 and |𝑦 − 𝑦| < 𝛿 𝑦 imply |𝑥 − 𝑥| |𝑦| < 𝜀/2, that is, the first inequality
in (3.53). Now let 𝛿 = min{𝛿 𝑥 , 𝛿 𝑦 }. Then |(𝑥, 𝑦) − (𝑥, 𝑦)| max < 𝛿 implies |𝑥 − 𝑥| < 𝛿 𝑥
and |𝑦 − 𝑦| < 𝛿 𝑦 , which in turn imply (3.53) and therefore (3.51). With Lemma 3.33,
this shows the continuity of the function (𝑥, 𝑦) ↦→ 𝑥𝑦.
This is an important observation: the arithmetic operation of multiplication
is continuous, and it is also easy to prove that addition, that is, the function
(𝑥, 𝑦) ↦→ 𝑥 + 𝑦, is continuous. Similarly, the function 𝑥 ↦→ −𝑥 is continuous, which
is nearly trivial compared to proving that 𝑥 ↦→ 1/𝑥 (for 𝑥 ≠ 0) is continuous.
The following lemma exploits that we have defined continuity for functions
that take values in R𝑚 and not just in R1 . It states that the composition of continuous
functions is continuous. Recall that 𝑓 (𝑆) is the image of 𝑓 as defined in (3.14).
Proof. Assume that 𝑓 and 𝑔 are continuous. Let 𝑥 ∈ 𝑆 and 𝜀 > 0. We want to show
that there is some 𝛿 > 0 so that ∥𝑥 − 𝑥∥ < 𝛿 and 𝑥 ∈ 𝑆 imply ∥ 𝑔( 𝑓 (𝑥))− 𝑔( 𝑓 (𝑥))∥ < 𝜀.
Because 𝑔 is continuous at 𝑓 (𝑥), there is some 𝛾 > 0 such that for any 𝑦 ∈ 𝑇 with
∥ 𝑦 − 𝑓 (𝑥)∥ < 𝛾 we have ∥ 𝑔(𝑦) − 𝑔( 𝑓 (𝑥))∥ < 𝜀. Now choose 𝛿 > 0 such that, by
continuity of 𝑓 at 𝑥, we have for any 𝑥 ∈ 𝑆 that ∥𝑥 −𝑥∥ < 𝛿 implies ∥ 𝑓 (𝑥)− 𝑓 (𝑥)∥ < 𝛾.
Then (for 𝑦 = 𝑓 (𝑥)) this implies ∥ 𝑔( 𝑓 (𝑥)) − 𝑔( 𝑓 (𝑥))∥ < 𝜀 as required.
In principle, this lemma should allow us to prove that the function (𝑥, 𝑦) ↦→ 𝑥/𝑦
is continuous on R ×(R \{0}) (that is, for 𝑦 ≠ 0), by considering it as the composition
of the functions (𝑥, 𝑦) ↦→ (𝑥, 1/𝑦) and (𝑥, 𝑧) ↦→ 𝑥𝑧 . We have just proved that
(𝑥, 𝑧) ↦→ 𝑥𝑧 is continuous, and earlier that 𝑦 ↦→ 1/𝑦 is continuous, but what about
(𝑥, 𝑦) ↦→ (𝑥, 1/𝑦) where the function 𝑦 ↦→ 1/𝑦 affects only the second component
of its input? Clearly, the function (𝑥, 𝑦) ↦→ (𝑥, 1/𝑦) should also be continuous, but
we need one further simple observation to prove this.
Whereas Lemma 3.34 considers the sequential composition of two functions,
the following lemma refers to a “parallel” composition of functions.
Proof. Let 𝑥 ∈ 𝑆 and 𝜀 > 0. Suppose 𝑓 and 𝑔 are continuous at 𝑥. Then according
to Lemma 3.33 there is some 𝛿 > 0 so that ∥𝑥 − 𝑥∥ max < 𝛿 and 𝑥 ∈ 𝑆 imply
∥ 𝑓 (𝑥) − 𝑓 (𝑥)∥ max < 𝜀 and ∥ 𝑔(𝑥) − 𝑔(𝑥)∥ max < 𝜀. But then also ∥ ℎ(𝑥) − ℎ(𝑥)∥ max < 𝜀
because each of the 𝑚 + ℓ components of ℎ(𝑥) − ℎ(𝑥) is either the corresponding
component of 𝑓 (𝑥) − 𝑓 (𝑥) or of 𝑔(𝑥) − 𝑔(𝑥).
Conversely, if ℎ is continuous at 𝑥, there is some 𝛿 > 0 so that ∥𝑥 − 𝑥∥ max < 𝛿
and 𝑥 ∈ 𝑆 imply ∥ ℎ(𝑥)− ℎ(𝑥)∥ max < 𝜀. Because ∥ 𝑓 (𝑥)− 𝑓 (𝑥)∥ max ≤ ∥ ℎ(𝑥)− ℎ(𝑥)∥ max
and ∥ 𝑔(𝑥)− 𝑔(𝑥)∥ max ≤ ∥ ℎ(𝑥)− ℎ(𝑥)∥ max by the definition of ℎ and of the maximum
norm (3.17), this implies ∥ 𝑓 (𝑥) − 𝑓 (𝑥)∥ max < 𝜀 and ∥ 𝑔(𝑥) − 𝑔(𝑥)∥ max < 𝜀. This
proves the claim.
|𝑥| |𝑦|
q
|ℎ(𝑥, 𝑦) − ℎ(0, 0)| = |ℎ(𝑥, 𝑦)| = p ≤ 𝑥 2 + 𝑦 2 = ∥(𝑥, 𝑦)∥ . (3.59)
𝑥2 + 𝑦2
So if we choose 𝛿 = 𝜀 then ∥(𝑥, 𝑦)∥ < 𝛿 implies |ℎ(𝑥, 𝑦)| ≤ ∥(𝑥, 𝑦)∥ < 𝜀 and thus
(3.58) as required.
In this section, we have seen how continuity can be proved for functions that
are defined on R𝑛 . The maximum norm is particularly useful for these proofs.
We first recall the notions used in Theorem 3.37, which is about a function
𝑓 : 𝑋 → R. The function 𝑓 is assumed to be continuous and the domain 𝑋 to be
compact (and nonempty). The theorem says that under these conditions, there are
𝑥 ∗ and 𝑥 in 𝑋 such that 𝑓 (𝑥 ∗ ) is the maximum and 𝑓 (𝑥) the minimum of 𝑓 (𝑋) (see
Section 3.5).
The proof of the Theorem of Weierstrass is based on the following two lemmas.
The first lemma refers to a subset of R.
Lemma 3.38. Let 𝐴 be a nonempty compact subset of R. Then 𝐴 has a maximum and a
minimum.
Proof. We only show that 𝐴 has a maximum. By Theorem 3.29, 𝐴 is closed and
bounded, and sup 𝐴 exists. We show that sup 𝐴 is a limit point of 𝐴. Otherwise,
𝐵(sup 𝐴, 𝜀) ∩ 𝐴 = ∅ for some 𝜀 > 0 by Lemma 3.23. But then there is no 𝑡 ∈ 𝐴 with
𝑡 > sup 𝐴 − 𝜀, so sup 𝐴 − 𝜀 is an upper bound of 𝐴, but sup 𝐴 is the least upper
bound, a contradiction. So sup 𝐴 is a limit point of 𝐴 and therefore belongs to 𝐴
because 𝐴 is closed, and hence sup 𝐴 is also the maximum of 𝐴.
The second lemma says that for a continuous function the image of a compact
set is compact.
Lemma 3.39. Let ∅ ≠ 𝑋 ⊆ R𝑛 and 𝑓 : 𝑋 → R. Then if 𝑓 is continuous and 𝑋 is
compact, then 𝑓 (𝑋) is compact.
Proof. Let {𝑦 𝑘 } 𝑘∈N be any sequence in 𝑓 (𝑋). We show that there exists a subse-
quence {𝑦 𝑘 𝑛 } 𝑛∈N and a point 𝑦 ∈ 𝑓 (𝑋) such that lim𝑛→∞ 𝑦 𝑘 𝑛 = 𝑦, which will show
that 𝑓 (𝑋) is compact. For that purpose, for each 𝑘 choose 𝑥 (𝑘) ∈ 𝑋 with 𝑓 (𝑥 (𝑘) ) = 𝑦 𝑘 ,
which exists by the definition of 𝑓 (𝑋). Then {𝑥 (𝑘) } is a sequence in 𝑋, which has a
convergent subsequence {𝑥 (𝑘 𝑛 ) } 𝑛∈N with limit 𝑥 in 𝑋 because 𝑋 is compact. Then,
because 𝑓 is continuous,
𝑓 (𝑥) = 𝑓 ( lim 𝑥 (𝑘 𝑛 ) ) = lim 𝑓 (𝑥 (𝑘 𝑛 ) ) = lim 𝑦 𝑘 𝑛
𝑛→∞ 𝑛→∞ 𝑛→∞
number of examples that show how to prove that a function is continuous. In order
to prove that a subset 𝑋 of R𝑛 is compact, we normally use the characterisation
in Theorem 3.29 that compact means “closed and bounded”. Boundedness is
typically most easily proved using the maximum norm, that is, boundedness in
each component, as for example in (3.38).
For closedness, the following observation is most helpful: sets that are pre-
images of closed sets under continuous functions are closed. We state and prove
this via the equivalent statement that pre-images of open sets under continuous
functions are open.
Proof. We first prove that if 𝑇 is open, then 𝑆 is also open. Let 𝑥 ∈ 𝑆 and thus
𝑓 (𝑥) ∈ 𝑇 by (3.60). Because 𝑇 is open, there is some 𝜀 > 0 such that 𝐵( 𝑓 (𝑥), 𝜀) ⊆ 𝑇.
Because 𝑓 is continuous, there is some 𝛿 > 0 such that for all 𝑥 ∈ 𝐵(𝑥, 𝛿) we have
𝑓 (𝑥) ∈ 𝐵( 𝑓 (𝑥), 𝜀) and therefore 𝑓 (𝑥) ∈ 𝑇, that is, 𝑥 ∈ 𝑆. This shows 𝐵(𝑥, 𝛿) ⊆ 𝑆,
which proves that 𝑆 is open, as required.
In order to observe the same property for closed sets note that a set is closed
if and only if its set-theoretic complement is open, and that the pre-image of
the complement is the complement of the pre-image. Namely, let 𝑇 ′ ⊆ R 𝑘 and
suppose that 𝑇 ′ is closed, that is, 𝑇 given by 𝑇 = R 𝑘 \ 𝑇 ′ is open. Let 𝑆 = 𝑓 −1 (𝑇),
where 𝑆 is open as just shown, and let 𝑆′ = R𝑛 \ 𝑆, so that 𝑆′ is closed. But then
𝑆′ = {𝑥 ∈ R𝑛 | 𝑓 (𝑥) ∉ 𝑇} = 𝑓 −1 (𝑇 ′) , and 𝑆′ is closed, as claimed.
Note that Lemma 3.40 concerns the pre-image of a continuous function. In
contrast, Lemma 3.39 concerns the image of a continuous function 𝑓 . The statement
in Lemma 3.40 is not true for images, that is, if 𝑆 is closed, then 𝑓 (𝑆) is not
necessarily closed; for a counterexample, 𝑆 has to be unbounded since otherwise 𝑆
would be compact. An example is the function 𝑓 : R → R given by 𝑓 (𝑥) = 1/(1+ 𝑥 2 )
and 𝑆 = R, where 𝑓 (𝑆) = (0, 1]. That is, 𝑓 (𝑆) is neither closed nor open even though
𝑆 is both closed and open. A simpler example of a continuous function 𝑓 where
𝑓 (𝑆) is not open for open sets 𝑆 is a constant function 𝑓 , where 𝑓 (𝑆) is singleton if
𝑆 is nonempty.
In a more abstract setting, Lemma 3.40 can also be used to define that a function
𝑓 : 𝑋 → 𝑌 is continuous. In that case, 𝑋 and 𝑌 are so-called “topological spaces”.
A topological space is a set 𝑋 together with a set (called a “topology”) of subsets
of 𝑋 which are called open sets, which have to fulfill the following conditions: The
empty set ∅ and the entire set 𝑋 are open; the intersection of any two open sets is
80 Chapter 3. Continuous Optimisation
open; and arbitrary unions of open sets are open. These conditions hold for the
open sets as defined in Definition 3.19 (which define the standard topology on R𝑛 )
according to Theorem 3.25. Given that 𝑋 and 𝑌 are topological spaces, a function
𝑓 : 𝑋 → 𝑌 is called continuous if and only if the pre-image of any open set (in 𝑌)
is open (in 𝑋). This characterisation of continuous functions is important enough
to state it as a theorem.
For our purposes, we only need the part of Theorem 3.41 that is stated in
Lemma 3.40. It implies the following observation, which is most useful to identify
certain subsets of R𝑛 as closed or open.
𝑋 = {(𝑥, 𝑦) ∈ R2 | 𝑥 ≥ 0, 𝑦 ≥ 0, 𝑥𝑦 ≥ 1} . (3.61)
3.14. Using the Theorem of Weierstrass 81
2 𝑋2
𝑋1
1
0 𝑥
0 1 2 3
Weierstrass. We also need that 𝑋1 is not empty: it contains for example the point
(2, 1). Now, the maximum of 𝑓 on 𝑋1 is also the maximum of 𝑓 on 𝑋. Namely,
for (𝑥, 𝑦) ∈ 𝑋2 we have 𝑥 + 𝑦 ≥ 3 and therefore 𝑓 (𝑥, 𝑦) = 𝑥+𝑦
1
≤ 13 = 𝑓 (2, 1), where
(2, 1) ∈ 𝑋1 , so Theorem 3.10 applies.
In this example of maximising the function 𝑓 in (3.62) on the domain 𝑋 in
(3.61), 𝑋 is closed but not compact. However, we have applied Theorem 3.10 with
𝑋 = 𝑋1 ∪ 𝑋2 as in (3.63) in order to obtain a compact domain 𝑋1 where we know
the maximisation of 𝑓 has a solution, which then applies to all of 𝑋. This is an
important example which we will consider further in some exercises.
The Theorem of Weierstrass only gives us the existence of a maximum of 𝑓 on
𝑋1 (and thereby on 𝑋), but it does not show how to find it. It seems rather clear
that the maximum of 𝑓 (𝑥, 𝑦) on 𝑋 is obtained for (𝑥, 𝑦) = (1, 1), but proving (and
finding) this maximum is shown in the next chapter.
Exercise 3.1.
(a) Use the formal definition of the limit of a sequence to prove that the sequence
{𝑥 𝑘 } given by 𝑥 𝑘 = 𝑘−1
𝑘 for 𝑘 ∈ N converges to 1.
(b) Use the formal definition of the limit of a sequence to prove that the sequence
{𝑦 𝑘 } given by 𝑦 𝑘 = (−1) 𝑘 does not converge to any limit.
Exercise 3.2. Let 𝐴 ⊆ R, 𝐴 ≠ ∅. Use the definitions of sup and inf to prove the
following: If inf 𝐴 = sup 𝐴, then 𝐴 has only one element.
Exercise 3.3. Prove the triangle inequality ∥𝑥 + 𝑦∥ max ≤ ∥𝑥∥ max + ∥𝑦∥ max for the
maximum norm. Hint: for a set 𝐴 of reals that has a maximum and 𝑏 ∈ R, we have
max 𝐴 ≤ 𝑏 if and only if 𝑎 ≤ 𝑏 for all 𝑎 ∈ 𝐴.
Exercise 3.4. Which of the following sets 𝐴, 𝐵, 𝐶, 𝐷 are open, closed, compact?
Justify your answers.
0
if (𝑥, 𝑦) = (0, 0)
𝑥𝑦
𝑔(𝑥, 𝑦) =
𝑥2 + 𝑦2
otherwise.
84 Chapter 3. Continuous Optimisation
Exercise 3.9. Which of the following sets are open, closed, compact? Justify your
answers. You can refer to any theorems in the guide.
4.1 Introduction
85
86 Chapter 4. First-Order Conditions
• find the zeroes of such gradients for examples of functions that are uncon-
strained, and then examine the points where the gradient is zero to identify
minima and maxima of the objective function
• for equality constraints, apply Lagrange’s Theorem to specific examples
• understand that so-called “critical points” (where gradients of the constraint
functions are not linearly independent) also have to be examined as possible
minima or maxima, and apply this to given examples
• use the insight that the KKT conditions distinguish between tight and non-tight
inequalities, and that non-tight inequalities are in effect treated as if they were
absent, in order to identify candidate points for local minima and maxima.
Apply this to specific examples.
The structure of this chapter is similar to chapters 5 and 6 of the following book:
Sundaram, R. K. (1996). A First Course in Optimization Theory. Cambridge
University Press, Cambridge, UK. ISBN 978-0521497190.
A classic book on differentiation of functions of several variables is this book:
Rudin, W. (1976). Principles of Mathematical Analysis, 3rd ed., volume 3. McGraw-
Hill, New York. ISBN 978-0070542358.
In that book, you can find on page 219, theorem 9.21, a proof of Theorem 4.5.
Theorem 4.11 is also known as the Kuhn–Tucker Theorem. The original
publication of that theorem is
Kuhn, H. W. and A. W. Tucker (1951). Nonlinear programming. In: Proceedings
of the Second Berkeley Symposium on Mathematical Statistics and Probability, edited
by J. Neyman, 481–492. University of California Press, Berkeley, CA.
An accessible history of that material is given in
Kuhn, H. W. (1991). Nonlinear programming: A historical note. In: History of
Mathematical Programming: A Collection of Personal Reminiscences, edited by J. K.
Lenstra, A. H. G. Rinnoy Kan, and A. Schrijver, 82–96. CWI and North-Holland,
Amsterdam. ISBN 978-0444888181.
It describes how Kuhn found out that the Kuhn-Tucker theorem had already been
shown this unpublished Master’s thesis:
4.2. Introductory Example 87
• Section 4.2 gives an introductory example to illustrate the main idea that the
gradients of objective function and constraint function have to be co-linear in
an optimum.
• Section 4.3 recalls matrix multiplication, which we will use throughout also
for scalar products of vectors and for products of vectors with scalars.
• Section 4.4 explains that differentiability means that a function can be lo-
cally approximated (intuitively, by looking at it “with a sufficiently strong
magnifying glass”) with a linear function, called the gradient of the function.
• Section 4.5 shows that the gradient of a function on R𝑛 is given by the 𝑛-tuple
of its partial derivatives.
• Taylor’s Theorem, which may be familiar for functions of a single variable, is
discussed in Section 4.6 for differentiable functions of 𝑛 variables.
• Section 4.7 is about unconstrained optimisation where a local minimum or
maximum necessarily has a zero gradient (intuitively, because the function
could be improved in a nonzero direction). Some examples are given.
• Section 4.8 considers equality constraints. The central Theorem of Lagrange
states that the gradient of the objective function in an optimum is a linear
combination of the gradients of the constraints, provided these are linearly
independent.
• Adding inequality constraints gives rise to the “KKT” conditions by Karush,
Kuhn, and Tucker, which are treated in Section 4.9.
𝑓 (𝑥, 𝑦) = 𝑥 + 4𝑦 (4.2)
18
16
14
12
10
8
6
4
2
0
3.0 2.5 5
2.0 1.5 4
3
y 1.0
0.5 1
2 x
0.0 0
Figure 4.1 Plot of the function 𝑓 (𝑥, 𝑦) = 𝑥 + 4𝑦 for 0 ≤ 𝑥 ≤ 5 and 0 ≤ 𝑦 ≤ 3,
with the blue curve showing the restriction 𝑥 𝑦 = 1.
(1, 4)
y
0 x
0 1 2 3 4 5
Figure 4.2 Contour lines and gradient (1, 4) of the function 𝑓 (𝑥, 𝑦) = 𝑥 + 4𝑦
for 𝑥 ≥ 0, 𝑦 ≥ 0.
A more instructive picture that can be drawn in two dimensions uses contour
lines of the function 𝑓 in (4.2), shown as the dashed lines in Figure 4.2. Such
a contour line for 𝑓 (𝑥, 𝑦) is the set of points (𝑥, 𝑦) where 𝑓 (𝑥, 𝑦) = 𝑐 for some
constant 𝑐, that is, where 𝑓 (𝑥, 𝑦) takes a fixed value. One could also say that
a contour line is the pre-image 𝑓 −1 ({𝑐}) under 𝑓 of one of its possible values 𝑐.
Clearly, for different values of 𝑐 any two such contour lines are disjoint. Here,
because 𝑓 is linear, these contour lines are parallel lines. For (𝑥, 𝑦) ∈ R2 , such a
contour line corresponds to the equation 𝑥 + 4𝑦 = 𝑐 or equivalently 𝑦 = 4𝑐 − 𝑥4
(we also only consider nonnegative values for 𝑥 and 𝑦). Contour lines are known
from topographical maps of, say, mountain regions, where each line corresponds
to a particular height above sea level; the two-dimensional picture of these lines
conveys information about the three-dimensional terrain. Here, they indicate how
the function should be minimised, by choosing the smallest function value 𝑐 that
is possible.
Figure 4.2 also shows the gradient of the function 𝑓 . We will define this gradient,
called 𝐷 𝑓 , formally later. It is given by the derivatives of 𝑓 with respect to 𝑥
𝑑 𝑑
and to 𝑦, that is, the pair ( 𝑑𝑥 𝑓 (𝑥, 𝑦), 𝑑𝑦 𝑓 (𝑥, 𝑦)), which is here (1, 4) for every (𝑥, 𝑦)
because 𝑓 is the linear function (4.2). This vector (1, 4) is drawn in Figure 4.2. The
gradient (1, 4) shows in which direction the function increases (which is discussed
in further detail in the introductory Section 5.2 of the next chapter), and can be
90 Chapter 4. First-Order Conditions
3
(12 , 2)
0 x
0 1 2 3 4 5
Moving along any direction which is not orthogonal to the gradient means
either moving partly in the same direction as the gradient (increasing the function
value), or away from it (decreasing the function value). Consider now Figure 4.3
where we have drawn the hyperbola which represents the constraint 𝑥𝑦 = 1 in (4.1).
A point on this hyperbola is, for example, (1, 1). At that point, the contour lines
show that the function value can still be lowered by moving towards (1 + 𝜀, 1+𝜀 1
).
But at the point (𝑥, 𝑦) = (2, 2 ) the contour line just touches the hyperbola and so
1
𝐷 𝑔, touch as required, so moving along the contour line of 𝑔 (that is, maintaining
the constraint) also neither increases nor decreases the value of 𝑓 .
Here 𝐷 𝑓 (𝑥, 𝑦) = 𝜆𝐷 𝑔(𝑥, 𝑦) is the equation (1, 4) = 𝜆 · (𝑦, 𝑥) = (𝜆𝑦, 𝜆𝑥) and
thus 𝜆 = 1/𝑦 = 4/𝑥 and thus 𝑦 = 𝑥/4, which together with the constraint 𝑥𝑦 = 1
means 𝑥 2 /4 = 1 or 𝑥 = 2 (because 𝑥 = −2 violates 𝑥 ≥ 0) and 𝑦 = 1/2, which is
indeed the optimum (𝑥, 𝑦) = (2, 12 ).
Of course, this simple example (4.1) has a solution that can be found directly
using one-variable calculus. Namely, the constraint 𝑥𝑦 = 1 translates to 𝑦 = 1/𝑥 so
that we can consider the problem of minimising 𝑥 + 4/𝑥 (for 𝑥 ≥ 0, in fact 𝑥 > 0
because 𝑥 = 0 is excluded by the condition 𝑥𝑦 = 1). We differentiate and set the
𝑑
derivative to zero. That is, 𝑑𝑥 (𝑥 + 4/𝑥) = 1 − 4/𝑥 2 = 0 gives the same solution 𝑥 = 2,
𝑦 = 1/2 .
The point of this introductory section was to illustrate the geometric un-
derstanding of contour lines and co-linear gradients of objective and constraint
functions in an optimum.
We recall here a useful way to treat all vectors and scalars as matrices, which will
be particularly important in the final Chapter 5. This gives a unified description
of multiplying matrices with vectors, of the scalar product of two vectors, and of
multiplying a vector with a scalar, as special cases of matrix multiplication.
If 𝐴 is an 𝑚 × 𝑘 matrix and 𝐵 is a 𝑘 × 𝑛 matrix, then their matrix product 𝐴 · 𝐵
(or 𝐴𝐵 for short) is the 𝑚 × 𝑛 matrix 𝐶 with entries
𝑘
Õ
𝑐 𝑖𝑗 = 𝑎 𝑖𝑠 𝑏 𝑠 𝑗 (1 ≤ 𝑖 ≤ 𝑚, 1 ≤ 𝑗 ≤ 𝑛). (4.4)
𝑠=1
𝑥⊤𝑦 : · = (4.5)
We use matrix multiplication where possible. In particular, scalars are treated like
1 × 1 matrices. A column vector 𝑥 is multiplied with a scalar 𝜆 from the right, and
a row vector 𝑥⊤ is multiplied with a scalar 𝜆 from the left,
𝑥𝜆 : · = , 𝜆𝑥⊤ : · = (4.6)
· = · = (4.7)
Let 𝐵 ∈ R 𝑘×𝑛 and 𝐵 = [𝐵1 · · · 𝐵𝑛 ]. Then the 𝑗th column of 𝐴𝐵 is the linear
combination 𝐴𝐵 𝑗 of the columns of 𝐴 with the components of 𝐵 𝑗 as coefficients.
We can visualise the columns of 𝐴𝐵 as follows (for 𝑛 = 2):
· = · = =
The second view of 𝐴𝐵 is the same but using rows. Let 𝑦 ∈ R 𝑘 , so that 𝑦⊤𝐵 is a
row vector in R1×𝑛 , given as the linear combination 𝑦1 𝑏⊤ 1
+ · · · + 𝑦 𝑘 𝑏⊤𝑘 of the rows
𝑏⊤
1
, . . . , 𝑏⊤𝑘 of 𝐵 (which we can write as 𝐵⊤ = [𝑏1 · · · 𝑏 𝑘 ]),
· = · = (4.8)
4.4. Differentiability in R𝑛 93
· = · = =
4.4 Differentiability in R𝑛
𝑓 (𝑥1 , . . . , 𝑥 𝑛 ) = 𝑐 0 + 𝑐 1 𝑥1 + · · · + 𝑐 𝑛 𝑥 𝑛 . (4.9)
𝑓 (𝑥)
𝑓 (𝑥) + 𝐺 · (𝑥 − 𝑥)
𝑓 (𝑥)
𝑥 𝑥
Figure 4.4 Approximating a function 𝑓 (𝑥) for 𝑥 near 𝑥 with an affine function
with slope or “gradient” 𝐺 that represents the tangent line at 𝑓 (𝑥).
𝑓 (𝑥) = 𝑐 0 + 𝑐 1 𝑥
𝑓 (𝑥) = 𝑐 0 + 𝑐 1 𝑥
𝑥 𝑥
Figure 4.5 Two points (𝑥, 𝑓 (𝑥)) and (𝑥, 𝑓 (𝑥)) on the graph of 𝑓 define a
“secant” line, given by an affine function 𝑡 ↦→ 𝑐 0 + 𝑐1 𝑡. If 𝑐1 has a limit as 𝑥 → 𝑥,
then this limit defines 𝐺 in Figure 4.4.
𝑓 (𝑥) − 𝑓 (𝑥)
𝑐1 = (4.12)
𝑥−𝑥
so that (by definition, with 𝑐1 defined as a function of 𝑥)
Consider now the case 𝑛 = 1, where ∥𝑧 ∥ = |𝑧| for any 𝑧 ∈ R1 , so that 𝑧 · 1/∥𝑧 ∥
is either 1 (if 𝑧 > 0) or −1 (if 𝑧 < 0). Then the right-hand side in (4.15) is 𝐺 if
𝑓 (𝑥)− 𝑓 (𝑥)
𝑥 > 𝑥 and −𝐺 if 𝑥 < 𝑥. Similarly, the left-hand side of (4.15 is 𝑥−𝑥 if 𝑥 > 𝑥 and
𝑓 (𝑥)− 𝑓 (𝑥)
− 𝑥−𝑥
if 𝑥 < 𝑥. Hence for 𝑥 ≠ 𝑥 these two conditions state
𝑓 (𝑥) − 𝑓 (𝑥)
lim = 𝐺, (4.16)
𝑥→𝑥 𝑥−𝑥
which is exactly the familiar notion of differentiability of a function defined on
R1 . The case distinction 𝑥 > 𝑥 and 𝑥 < 𝑥 that we just made emphasises that the
limit of the quotient in (4.16) has to exist for any possible approach of 𝑥 to 𝑥, which
is also stated in Definition 4.1. For example, consider the function 𝑓 (𝑥) = |𝑥|
which is well known not to be differentiable at 0. Namely, if we restrict 𝑥 to be
𝑓 (𝑥)− 𝑓 (𝑥)
positive (that is, 𝑥 > 𝑥 = 0), then 𝑥−𝑥 = |𝑥| 𝑥 = 1, whereas for 𝑥 < 𝑥 = 0 we have
𝑓 (𝑥)− 𝑓 (𝑥)
𝑥−𝑥
= |𝑥|
𝑥 = −1. Therefore, there is no common limit to these fractions as 𝑥 → 𝑥,
for example if we approach 𝑥 with the sequence {𝑥 𝑘 } defined by 𝑥 𝑘 = (−1/2) 𝑘
which converges to 0 but with alternating signs of 𝑥 𝑘 . In Definition 4.1, the limit
has to exist for any possible approach of 𝑥 to 𝑥 (for example, by letting 𝑥 be the
points of a sequence that converges to 𝑥).
Next, we show that the gradient 𝐺 of a differentiable function is unique
and nicely described by the row vector of “partial derivatives” of the function.
Subsequently, we describe in “Taylor’s theorem”, equation (4.20), how the derivative
represents a local linear approximation of the function.
𝑓 (𝑥 + 𝑒 𝑗 𝑡) − 𝑓 (𝑥) 𝜕 𝑓 (𝑥)
lim = . (4.17)
𝑡→0 𝑡 𝜕𝑥 𝑗
𝑑
We have earlier (in our introductory Section 4.2) used the notation 𝑑𝑥 𝑗 𝑓 (𝑥)
𝜕
rather than 𝑓 (𝑥), which means the same, namely differentiating 𝑓 (𝑥1 , . . . , 𝑥 𝑛 )
𝜕𝑥 𝑗
as a function of 𝑥 𝑗 only, while keeping the values of all other variables 𝑥 1 , . . . , 𝑥 𝑗−1 ,
4.5. Partial Derivatives and 𝐶 1 Functions 97
𝜕
𝑥 𝑗+1 , . . . 𝑥 𝑛 fixed. For example, if 𝑓 (𝑥1 , 𝑥2 ) = 𝑥 1 𝑥2 + 𝑥1 , then 𝜕𝑥1
𝑓 (𝑥1 , 𝑥2 ) = 𝑥2 + 1
𝜕
and 𝜕𝑥2
𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 .
Next we show that the gradient of a differentiable function is the vector of
partial derivatives.
𝑓 (𝑥 + 𝑒 𝑗 𝑡) − 𝑓 (𝑥) 𝐺 · 𝑒𝑗 · 𝑡
lim − = 0
𝑡→0 |𝑡| |𝑡 |
or equivalently, as in the consideration for (4.16) for the two cases 𝑡 > 0 and 𝑡 < 0,
𝜕 𝑓 (𝑥 + 𝑒 𝑗 𝑡) − 𝑓 (𝑥)
𝑓 (𝑥) = lim = 𝐺 · 𝑒𝑗
𝜕𝑥 𝑗 𝑡→0 𝑡
𝜕
and therefore 𝐺 · 𝑒 𝑗 , which is the 𝑗th component of 𝐺, is 𝜕𝑥 𝑗
𝑓 (𝑥) as claimed.
𝜕 𝑦(𝑥 2 + 𝑦 2 ) − 2𝑥(𝑥𝑦) 𝑦 3 − 𝑦𝑥 2
𝑔(𝑥, 𝑦) = = (4.19)
𝜕𝑥 (𝑥 2 + 𝑦 2 )2 (𝑥 2 + 𝑦 2 )2
𝜕 𝑥 3 −𝑥 𝑦 2
and for 𝑥 ≠ 0 we have 𝜕𝑦 𝑔(𝑥, 𝑦) = (𝑥 2 +𝑦 2 )2 because 𝑔(𝑥, 𝑦) is symmetric in 𝑥 and 𝑦.
So the partial derivatives of 𝑔 exist everywhere. However, 𝑔(𝑥, 𝑦) is not even
continuous, let alone differentiable.
It can be shown that the continuous function ℎ(𝑥, 𝑦) defined in (3.57) is not
differentiable at (0, 0).
Nevertheless, the partial derivatives of a function are very useful if they are
continuous, which is often the case.
The following theorem expresses, once more, that differentiability means local
approximation by a linear function. It will also be used to prove (sometimes only
with a heuristic argument) first-order conditions for optimality that we consider
later.
𝑓 (𝑥) − 𝑓 (𝑥) − 𝐺 · (𝑥 − 𝑥)
𝑅(𝑥) = (4.21)
∥𝑥 − 𝑥∥
4.6. Taylor’s Theorem 99
𝑓 ′′(𝑥)
′
𝑓 (𝑥) = 𝑓 (𝑥) + 𝑓 (𝑥)(𝑥 − 𝑥) + ˆ
(𝑥 − 𝑥)2 + 𝑅(𝑥) · |𝑥 − 𝑥| 2 (4.22)
2
ˆ
where lim𝑥→𝑥 𝑅(𝑥) = 0. By iterating this process for functions that are differentiable
sufficiently many times, one obtains a “Taylor expansion” that approximates the
function not just linearly but by a higher-degree polynomial.
Furthermore, the expression (4.22) for a function that is twice differentiable
is more informative than the expression (4.20), with the following additional
observation: one can show that it allows to represent the original remainder term
𝑅(𝑥) to be of the form 𝑓 ′′(𝑧)/2 · (𝑥 − 𝑥) for some “intermediate value” 𝑧 that is
between 𝑥 and 𝑥; hence bounds on 𝑓 ′′(𝑧) translate to bounds on 𝑅(𝑥). These
variations of Taylor’s theorem are often stated in the literature. We do not consider
them here, only the simple version of Theorem 4.6.
We illustrate (4.20) with a specific remainder term for some differentiable
function 𝑓 : R2 → R with gradient 𝐷 𝑓 (𝑥, 𝑦). Fix (𝑥, 𝑦) and let (𝑥, 𝑦) = (𝑥, 𝑦) +
(Δ𝑥 , Δ 𝑦 ). Then (4.20) becomes
Δ𝑥
𝑓 (𝑥 + Δ𝑥 , 𝑦 + Δ 𝑦 ) = 𝑓 (𝑥, 𝑦) + 𝐷 𝑓 (𝑥, 𝑦) · + 𝑅(Δ𝑥 , Δ 𝑦 ) · ∥(Δ𝑥 , Δ 𝑦 )∥. (4.23)
Δ𝑦
Consider now the function 𝑓 (𝑥, 𝑦) = 𝑥 · 𝑦 which has gradient 𝐷 𝑓 (𝑥, 𝑦) = (𝑦, 𝑥),
which is a continuous function of (𝑥, 𝑦). By Theorem 4.5, 𝑓 is continuously
differentiable. Then
𝑓 (𝑥, 𝑦) = 𝑓 (𝑥 + Δ𝑥 , 𝑦 + Δ 𝑦 ) = (𝑥 + Δ𝑥 ) · (𝑦 + Δ 𝑦 )
= 𝑥 𝑦 + 𝑦Δ𝑥 + 𝑥Δ 𝑦 + Δ𝑥 Δ 𝑦
(4.24)
Δ𝑥
= 𝑓 (𝑥, 𝑦) + 𝐷 𝑓 (𝑥, 𝑦) · + Δ𝑥 Δ 𝑦
Δ𝑦
100 Chapter 4. First-Order Conditions
which is of the form (4.23) if we can find a remainder term 𝑅(Δ𝑥 , Δ 𝑦 ) such that
𝑅(Δ𝑥 , Δ 𝑦 ) · ∥(Δ𝑥 , Δ 𝑦 )∥ = Δ𝑥 Δ 𝑦 . This holds if 𝑅(Δ𝑥 , Δ 𝑦 ) = Δ𝑥 Δ 𝑦 /∥(Δ𝑥 , Δ 𝑦 )∥, and
then v
t
|Δ𝑥 Δ 𝑦 | Δ2𝑥 Δ2𝑦 1
|𝑅(Δ𝑥 , Δ 𝑦 )| = q = 2 2
= q (4.25)
2
Δ𝑥 + Δ 𝑦 2 Δ 𝑥 + Δ 𝑦 1
+ 1
Δ2𝑦 Δ2𝑥
which indeed goes to zero as (Δ𝑥 , Δ 𝑦 ) → (0, 0) because then the denominator in
(4.25) becomes arbitrarily large.
a b
c d e
Figure 4.6 Illustration of Definition 4.7 for a function defined on the interval
[𝑎, 𝑒].
4.7. Unconstrained Optimisation 101
In Figure 4.7, 𝑎, 𝑐, and 𝑒 are local maximisers and 𝑏 and 𝑑 are local minimisers
of the function shown, where 𝑏, 𝑐, 𝑑 are unconstrained. The function attains its
global minimum at 𝑏 and its global maximum at 𝑒.
Proof. The direction “⇒” is immediate from Definition 4.7. To see the converse
direction “⇐”, if 𝑥 is a local maximiser of 𝑓 on 𝑋 then there is some 𝜀1 > 0 so that
𝑓 (𝑥) ≤ 𝑓 (𝑥) for all 𝑥 ∈ 𝑋 ∩ 𝐵(𝑥, 𝜀1 ), and 𝑥 is an interior point of 𝑋 if 𝐵(𝑥, 𝜀2 ) ⊆ 𝑋
for some 𝜀2 > 0. With 𝜀 = min{𝜀1 , 𝜀2 } we obtain 𝐵(𝑥, 𝜀) ⊆ 𝑋 and 𝑓 (𝑥) ≤ 𝑓 (𝑥) for
all 𝑥 ∈ 𝐵(𝑥, 𝜀), that is, 𝑥 is an unconstrained local maximiser of 𝑓 on 𝑋.
Because ∥𝐺∥ > 0, 𝑡 > 0, and 𝑅(𝐺⊤𝑡) → 0 as 𝑡 → 0, the term ∥𝐺∥ + 𝑅(𝐺⊤𝑡) is
positive for sufficiently small positive 𝑡, and therefore 𝑓 (𝑥 + Δ𝑥 ) > 𝑓 (𝑥), which
contradicts the local maximality of 𝑓 (𝑥). So 𝐷 𝑓 (𝑥) = 0 as claimed.
If 𝑥 is an unconstrained local minimiser of 𝑓 , then 𝑥 is an unconstrained local
maximiser of − 𝑓 , so that −𝐷 𝑓 (𝑥) = 0, which is equivalent to 𝐷 𝑓 (𝑥) = 0.
102 Chapter 4. First-Order Conditions
2 + 𝑥 2 + 𝑦 2 − (𝑥 − 𝑦)2𝑥 −2 − 𝑥 2 − 𝑦 2 − (𝑥 − 𝑦)2𝑦
= 0, )=0
(2 + 𝑥 2 + 𝑦 2 )2 (2 + 𝑥 2 + 𝑦 2 )2
or equivalently
2 − 𝑥 2 + 𝑦 2 + 2𝑥𝑦 = 0 ,
(4.28)
−2 − 𝑥 2 + 𝑦 2 − 2𝑥𝑦 = 0 .
2
𝑓 (𝑥, 𝑦) ≤ 𝑓 (1, −1) =
2+1+1
𝑥−𝑦 1
≤
2 + 𝑥2 + 𝑦2 2
(4.29)
2𝑥 − 2𝑦 ≤ 2 + 𝑥 2 + 𝑦 2
0 ≤ 1 − 2𝑥 + 𝑥 2 + 1 + 2𝑦 + 𝑦 2
0 ≤ (1 − 𝑥)2 + (1 + 𝑦)2
which is true (with equality for (𝑥, 𝑦) = (1, −1), a useful check). The inequality
𝑓 (𝑥, 𝑦) ≥ 𝑓 (−1, 1) = − 12 is shown very similarly, which shows that 𝑓 (−1, 1) is the
global minimum of 𝑓 .
In the following example, the first-order condition of a zero derivative gives
also useful information, although of a different kind. Consider the function
𝑔 : R2 → R,
𝑥𝑦
𝑔(𝑥, 𝑦) = , (4.30)
1 + 𝑥2 + 𝑦2
or equivalently
𝑦 − 𝑦𝑥 2 + 𝑦 3 = 0 ,
(4.31)
𝑥 + 𝑥 3 − 𝑥𝑦 2 = 0 .
An obvious solution to (4.31) is (𝑥, 𝑦) = (0, 0), but this is only a stationary point of
𝑔 and neither maximum nor minimum (not even locally), because 𝑔(0, 0) = 0 but
𝑔(𝑥, 𝑦) takes positive as well as negative values (also near (0, 0)). Similarly, when
𝑥 = 0 or 𝑦 = 0, then 𝑔(𝑥, 𝑦) = 0 but this is not a maximum or minimum, so that we
can assume 𝑥 ≠ 0 and 𝑦 ≠ 0. Then the equations (4.31) are equivalent to
1 − 𝑥2 + 𝑦2 = 0
(4.32)
1 + 𝑥2 − 𝑦2 = 0
which when added give 2 = 0 which is a contradiction. This shows that there is no
solution to (4.31) where 𝑥 ≠ 0 and 𝑦 ≠ 0 and thus 𝑔(𝑥, 𝑦) has no local and therefore
also no global maximum or minimum. This is possible because the domain R2 of
𝑔 is not compact. For 𝑥 = 𝑦 and large 𝑥, for example, we have
𝑥2 1
𝑔(𝑥, 𝑥) = 2
=
1 + 2𝑥 1/𝑥 2 + 2
104 Chapter 4. First-Order Conditions
which tends to 12 as 𝑥 → ∞. It seems that 12 is an upper bound for 𝑔(𝑥, 𝑦). We can
prove this for all (𝑥, 𝑦) via the following equivalences:
𝑥𝑦 1
𝑔(𝑥, 𝑦) = <
1+𝑥 +𝑦
2 2 2
2𝑥𝑦 < 1 + 𝑥 2 + 𝑦 2
0 < 1 + 𝑥 2 − 2𝑥𝑦 + 𝑦 2
0 < 1 + (𝑥 − 𝑦)2
which is true. We can show similarly that 𝑔(𝑥, 𝑦) > − 12 and that 𝑔(𝑥, −𝑥) gets
arbitrarily close to − 12 . This shows that the image of 𝑔 is the interval (− 21 , 12 ).
The following central Theorem of Lagrange gives conditions for a constrained local
maximum or minimum of a continuously differentiable function 𝑓 . The con-
straints are given as 𝑘 equations 𝑔1 (𝑥) = 0, . . . , 𝑔 𝑘 (𝑥) = 0 with continuously
differentiable functions 𝑔1 , . . . , 𝑔 𝑘 . The theorem states that at a local maximum
or minimum, the optimised function 𝑓 (𝑥) has a gradient that is a linear combina-
tion of the gradients of these constraint functions, provided these gradients are
linearly independent. The latter condition is called the constraint qualification. The
corresponding coefficients 𝜆1 , . . . , 𝜆 𝑘 are known as Lagrange multipliers.
To understand this theorem, consider first the case 𝑘 = 1, that is, a single
constraint 𝑔(𝑥) = 0. Then (4.33) states 𝐷 𝑓 (𝑥) = 𝜆 𝐷 𝑔(𝑥), which means that the
gradient of 𝑓 (a row vector) is a scalar multiple of the gradient of 𝑔. The two
gradients have the 𝑛 partial derivatives of 𝑓 and 𝑔 as components, and each partial
derivative of 𝑔 is multiplied with the same 𝜆 to equal the respective partial derivative
of 𝑓 . These are 𝑛 equations for the 𝑛 components of 𝑥 and 𝜆 as unknowns. An
additional equation is 𝑔(𝑥) = 0. Hence, these are 𝑛 +1 equations for 𝑛 +1 unknowns
in total. If there are 𝑘 constraints 𝑔𝑖 (𝑥) = 0 for 1 ≤ 𝑖 ≤ 𝑛, then (4.33) and these
4.8. Equality Constraints and the Theorem of Lagrange 105
For (4.34), 𝐷 𝑓 (𝑥, 𝑦) = (𝑦, 𝑥) and 𝐷 𝑔(𝑥, 𝑦) = (2𝑥, 2𝑦). Here 𝐷 𝑔(𝑥, 𝑦) is linearly
dependent only if (𝑥, 𝑦) = (0, 0), which however does not fulfill 𝑔(𝑥, 𝑦) = 0,
so the constraint qualification holds always. The Lagrange multiplier 𝜆 has
to fulfill 𝐷 𝑓 (𝑥, 𝑦) = 𝜆 𝐷 𝑔(𝑥, 𝑦), that is, (𝑦, 𝑥) = 𝜆(2𝑥, 2𝑦). Here 𝑥 = 0 would
imply 𝑦 = 0 and vice versa, so we have 𝑥 ≠ 0 and 𝑦 ≠ 0, and the first equation
𝑦 = 𝜆2𝑥 implies 𝜆 = 𝑦/2𝑥, which when substituted into the second equation
gives 𝑥 = 𝜆2𝑦 = 2𝑦 2 /2𝑥, and thus 𝑥 2 = 𝑦 2 or |𝑥| = |𝑦|. The constraint 𝑔(𝑥, 𝑦) = 0
then implies 𝑥 2 + 𝑦 2 − 2 = 2𝑥 2 − 2 = 0 and therefore |𝑥| = 1, which gives the four
solutions (1, 1), (−1, −1), (−1, 1), and (1, −1). For the first two solutions, 𝑓 takes the
value 1, and for the last two the value −1, so these are the local and in fact global
maxima and minima of 𝑓 on the circle 𝑋.
The following functions illustrate why the constraint qualification is needed in
Theorem 4.10. Let
𝑓 (𝑥, 𝑦) = −𝑦, 𝑔(𝑥, 𝑦) = 𝑥 2 − 𝑦 3 , (4.35)
106 Chapter 4. First-Order Conditions
where “≈” means that we neglect the remainder term because we assume Δ𝑥
to be sufficiently small. By (4.36), 𝐷 𝑔(𝑥) · Δ𝑥 = 0, and the set of these Δ𝑥 ’s is a
subspace of R𝑛 of dimension 𝑛 −1 provided 𝐷 𝑔(𝑥) ≠ 0, which holds by the constraint
qualification (this just says that the gradient of 𝑔 at the point 𝑥 is orthogonal to the
“contour set” {𝑥 ∈ R𝑛 | 𝑔(𝑥) = 0}). Similarly, a local maximum 𝑓 (𝑥) requires
and therefore
𝐷 𝑔(𝑥) · Δ𝑥 = 0 ⇒ 𝐷 𝑓 (𝑥) · Δ𝑥 = 0 (4.38)
4.8. Equality Constraints and the Theorem of Lagrange 107
Then the stationary points of 𝐹 are by definition the points (𝑥, 𝜆) ∈ R𝑛 × R 𝑘 with
zero derivative, that is,
𝐷𝐹(𝑥, 𝜆) = 0 . (4.40)
These are 𝑛 + 𝑘 equations for the partial derivatives of 𝐹 with 𝑛 + 𝑘 unknowns,
the components of (𝑥, 𝜆). These equations define exactly the problem of finding
the Lagrangian multipliers in (4.33) and of solving the given equality constraints.
Namely, the first 𝑛 equations in (4.40) are for the 𝑛 partial derivatives with respect
to 𝑥 𝑗 of 𝐹, that is, by (4.39),
𝑘
𝜕 𝜕 Õ 𝜕
𝐹(𝑥, 𝜆) = 𝑓 (𝑥) − 𝜆𝑖 𝑔𝑖 (𝑥) = 0 (1 ≤ 𝑗 ≤ 𝑛). (4.41)
𝜕𝑥 𝑗 𝜕𝑥 𝑗 𝜕𝑥 𝑗
𝑖=1
which is equivalent to (4.33). The last 𝑘 equations in (4.40) are for the 𝑘 partial
derivatives with respect to 𝜆 𝑖 of 𝐹, that is,
𝜕
𝐹(𝑥, 𝜆) = −𝑔𝑖 (𝑥) = 0 (1 ≤ 𝑖 ≤ 𝑘) (4.43)
𝜕𝜆 𝑖
which is equivalent to 𝑔𝑖 (𝑥) = 0 for 1 ≤ 𝑖 ≤ 𝑘, which are the given equality
constraints. Note that it does not make sense to maximise the Lagrangian 𝐹(𝑥, 𝜆)
without these constraints, because it is unbounded when we take any 𝑥 where
𝑔𝑖 (𝑥) ≠ 0 and let 𝜆 → ∞ (if 𝑔𝑖 (𝑥) > 0, or 𝜆 → −∞ if 𝑔𝑖 (𝑥) < 0)
𝑘
The Lagrangian is often defined as 𝐷𝐹(𝑥, 𝜆) = 𝑓 (𝑥) + 𝑖=1 𝜆 𝑖 𝑔𝑖 (𝑥) which is
Í
(4.39) but with a plus sign instead of a minus sign, which accordingly gives (4.42)
108 Chapter 4. First-Order Conditions
with a plus sign instead of a minus sign. This is also equivalent to (4.33) except for
the sign change of each 𝜆 𝑖 . We prefer (4.33) which states directly that 𝐷 𝑓 (𝑥) is a
linear combination of the gradients 𝐷 𝑔𝑖 (𝑥).
Lagrangian multipliers can be interpreted as shadow prices in certain economic
settings. In such a setting, 𝑥 may represent an allocation of the variables 𝑥1 , . . . , 𝑥 𝑛
according to some production schedule which results in profit 𝑓 (𝑥) for the firm,
subject to the constraints 𝑔𝑖 (𝑥) = 0 for 1 ≤ 𝑖 ≤ 𝑘. The profit 𝑓 (𝑥) is maximised at 𝑥,
with Lagrange multipliers 𝜆1 , . . . , 𝜆 𝑘 as in (4.33). Suppose that 𝑔ˆ 𝑗 (𝑥) is the amount
of some resource 𝑗 needed for production schedule 𝑥, for example manpower, of
which amount 𝑎 𝑗 is available, so that 𝑔 𝑗 (𝑥) = 𝑔ˆ 𝑗 (𝑥) − 𝑎 𝑗 = 0 (assuming all manpower
is used; we could more generally assume 𝑔 𝑗 (𝑥) ≤ 0, but here just assume that for
𝑥 = 𝑥 this inequality is tight, 𝑔 𝑗 (𝑥) = 0).
Now suppose the amount of manpower can be increased from 𝑎 𝑗 to 𝑎 𝑗 + 𝜀 for
some small amount 𝜀 > 0, which results in the new constraint 𝑔 𝑗 (𝑥) = 𝜀, where all
other constraints are kept fixed. Assume that the new constraint results in a new
optimal solution 𝑥(𝜀), that is, 𝑔 𝑗 (𝑥(𝜀)) = 𝜀 and 𝑔𝑖 (𝑥(𝜀)) = 0 for 𝑖 ≠ 𝑗. We claim that
then
𝑓 (𝑥(𝜀)) ≈ 𝑓 (𝑥) + 𝜆 𝑗 𝜀 . (4.44)
Namely, with 𝑥(𝜀) = 𝑥 + Δ𝑥 we have 𝐷 𝑔𝑖 (𝑥) · Δ𝑥 = 0 for 𝑖 ≠ 𝑗 in order to keep the
condition 𝑔𝑖 (𝑥 + Δ𝑥 ) = 0 (see (4.36) above), but
and thus
𝑓 (𝑥(𝜀)) = 𝑓 (𝑥 + Δ𝑥 ) ≈ 𝑓 (𝑥) + 𝐷 𝑓 (𝑥) · Δ𝑥
Í𝑘
= 𝑓 (𝑥) + 𝑖=1 𝜆 𝑖 𝐷 𝑔𝑖 (𝑥) · Δ𝑥
(4.46)
= 𝑓 (𝑥) + 𝜆 𝑗 𝐷 𝑔 𝑗 (𝑥) · Δ𝑥
= 𝑓 (𝑥) + 𝜆 𝑗 𝜀
which shows (4.44). The interpretation of (4.44) is that adding 𝜀 more manpower
(amount of resource 𝑗 ) so that the constraint 𝑔 𝑗 (𝑥) = 0 is changed to 𝑔 𝑗 (𝑥) = 𝜀
increases the firm’s profit by 𝜆 𝑗 𝜀. Hence, 𝜆 𝑗 is the price per extra unit of manpower
that the firm should be willing to pay, given the current maximiser 𝑥 and associated
Lagrangian multipliers 𝜆1 , . . . , 𝜆 𝑘 in (4.33).
The following is a typical problem that can be solved with the help of Lagrange’s
Theorem 4.10. A manufacturer of rectangular milk cartons wants to minimise the
material used to obtain a carton of a given volume. A carton is 𝑥 cm high, 𝑦 cm
wide and 𝑧 cm deep, and is folded according to the layout shown on the right in
Figure 4.9 (which is used twice, for front and back). Each of the four squares in a
corner of the layout with length 𝑧/2 (together with its counterpart on the back) is
folded into a triangle as shown on the left (the triangles at the bottom are folded
4.8. Equality Constraints and the Theorem of Lagrange 109
underneath the carton). We ignore any overlapping material used for glueing.
What are the optimal dimensions 𝑥, 𝑦, 𝑧 for a carton with volume 500 cm3 ?
z /2
z /2 z /2
x
x
z y
z /2
y
The layout on the right shows that the area 𝑓 (𝑥, 𝑦, 𝑧) of the material used is
(𝑥 + 𝑧)(𝑦 + 𝑧) times two (for front and back, but the factor 2 can be ignored in the
minimisation), subject to 𝑔(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 − 500 = 0. We have
𝐷 𝑓 (𝑥, 𝑦, 𝑧) = (𝑦 + 𝑧, 𝑥 + 𝑧, 𝑥 + 𝑦 + 2𝑧) , 𝐷 𝑔(𝑥, 𝑦, 𝑧) = (𝑦𝑧, 𝑥𝑧, 𝑥𝑦) .
Because clearly 𝑥, 𝑦, 𝑧 > 0, the derivative 𝐷 𝑔(𝑥, 𝑦, 𝑧) is never zero and therefore
linearly independent. By Lagrange’s theorem, there is some 𝜆 so that
𝑦 + 𝑧 = 𝜆𝑦𝑧 ,
𝑥 + 𝑧 = 𝜆𝑥𝑧 , (4.47)
𝑥 + 𝑦 + 2𝑧 = 𝜆𝑥𝑦 .
These equations are nonlinear, but simpler equations can be found by exploiting
their symmetry. Multiplying the first, second, and third equation in (4.47) by
𝑥, 𝑦, 𝑧, respectively (all of which are nonzero), these equations are equivalent to
𝑥(𝑦 + 𝑧) = 𝜆𝑥𝑦𝑧 ,
𝑦(𝑥 + 𝑧) = 𝜆𝑥𝑦𝑧 , (4.48)
𝑧(𝑥 + 𝑦 + 2𝑧) = 𝜆𝑥𝑦𝑧 ,
that is, they all have the same right-hand side. The first two equations in (4.48)
imply 𝑥𝑧 = 𝑦𝑧 and thus 𝑥 = 𝑦. With 𝑥 = 𝑦, the second and third equation give
𝑥(𝑥 + 𝑧) = 𝑧(2𝑥 + 2𝑧) = 2𝑧(𝑥 + 𝑧)
110 Chapter 4. First-Order Conditions
and thus 𝑥 = 2𝑧. That is, the only optimal solution is of the form (2𝑧, 2𝑧, 𝑧).
Applied to the volume equation this gives 4𝑧 3 = 500 or 𝑧 3 = 125, that is, 𝑥 = 𝑦 = 10
cm and 𝑧 = 5 cm.
The area of material used is 2(𝑥 + 𝑧)(𝑦 + 𝑧) = 2 × 152 = 450 cm2 . In comparison,
the surface area of the carton without the extra folded triangles is 2(𝑥𝑦 + 𝑥𝑧 + 𝑦𝑧) =
2(100 + 50 + 50) = 400 cm2 . The extra material is from the eight squares of size
2.5 × 2.5 for the folded triangles which do not contribute to the surface of the
carton, which have area 8 × 2.52 = 2 × 52 = 50 cm2 .
𝐷𝑓
ℎ(𝑥) < 0
𝐷ℎ
𝑥
𝑓 (𝑥) = 𝑐
ℎ(𝑥) = 0
ℎ(𝑥) < 0
𝐷𝑓 𝐷ℎ
𝑥
𝑓 (𝑥) = 𝑐
ℎ(𝑥) = 0
the “height of a terrain” at location 𝑥, then this grey region can be seen as a “lake”
with the surface of the water at height 0. The gradient 𝐷 ℎ(𝑥) points outwards,
orthogonally to the boundary, for getting out of the lake. The function 𝑓 (𝑥) may
denote the height of a different terrain, and maximising 𝑓 (𝑥) is achieved at 𝑓 (𝑥)
where the contour line of 𝑓 touches the contour line of ℎ. This is exactly the same
situation as in the Lagrange multiplier problem, meaning 𝐷 𝑓 (𝑥) = 𝜇𝐷 ℎ(𝑥) for
some Lagrange multiplier 𝜇, with the additional constraint that the gradients of 𝑓
and ℎ have to be not only co-linear, but point in the same direction, that is, 𝜇 ≥ 0.
(We use a different Greek letter 𝜇 instead of the usual 𝜆 to emphasise this.) The
reason is that at the point 𝑥 both ℎ(𝑥) and 𝑓 (𝑥) are maximised, by “getting out of
the lake”, and by maximising 𝑓 , in the direction of the gradients.
The following is the central theorem of this chapter. For its naming see
Section 4.1.3.
𝑋 = 𝑈 ∩ {𝑥 ∈ R𝑛 | ℎ 𝑖 (𝑥) ≤ 0 , 1 ≤ 𝑖 ≤ ℓ } . (4.50)
Let the set of vectors {𝐷 ℎ 𝑖 (𝑥) | ℎ 𝑖 (𝑥) = 0 } be linearly independent (“constraint qualifica-
tion” for the tight constraints). Then there exist 𝜇1 , . . . , 𝜇ℓ ∈ R so that for 1 ≤ 𝑖 ≤ ℓ
ℓ
Õ
𝜇𝑖 ≥ 0 , 𝜇𝑖 ℎ 𝑖 (𝑥) = 0 , 𝐷 𝑓 (𝑥) = 𝜇𝑖 𝐷 ℎ 𝑖 (𝑥) . (4.51)
𝑖=1
𝐸 = {𝑖 | 1 ≤ 𝑖 ≤ ℓ , ℎ 𝑖 (𝑥) = 0 } . (4.53)
Proof of Theorem 4.11. We prove the KKT Theorem with the help of the Theorem
4.10 of Lagrange. Let 𝑥 be a local maximiser of 𝑓 on 𝑋, and let 𝐸 be the set of
effective constraints as in (4.53). Because the functions ℎ 𝑖 for 𝑖 ∉ 𝐸 are continuous,
the set 𝑉 defined by
𝑉 ∩ {𝑥 ∈ R𝑛 | ℎ 𝑖 (𝑥) = 0, 𝑖 ∈ 𝐸 } (4.56)
where it has the local maximiser 𝑥. Because the constraint qualification holds for
the gradients 𝐷 ℎ 𝑖 (𝑥) for 𝑖 ∈ 𝐸, there are Lagrange multipliers 𝜇𝑖 for 𝑖 ∈ 𝐸 so that
(4.54) holds. It remains to show that they are nonnegative.
Suppose 𝜇 𝑗 < 0 for some 𝑗 ∈ 𝐸. Because 𝑥 is in the interior of 𝑉, for sufficiently
small 𝜀 > 0 we can find Δ𝑥 ∈ R𝑛 so that 𝑥 + Δ𝑥 ∈ 𝑉 and, as in (4.45),
ℎ 𝑗 (𝑥 + Δ𝑥 ) = −𝜀 (4.57)
𝑓 (𝑥)
ℎ(𝑥)
𝑥
𝑎 𝑏 𝑐 𝑑
The sign conditions in the KKT Theorem are most easily remembered (or
reconstructed) for a single constraint in dimension 𝑛 = 1, as shown in Figure 4.12.
114 Chapter 4. First-Order Conditions
There the condition ℎ(𝑥) ≤ 0 holds on the two intervals [𝑎, 𝑏] and [𝑐, 𝑑] and is tight
at any end of either interval. For 𝑥 = 𝑎 both 𝑓 and ℎ have a negative derivative,
and hence 𝐷 𝑓 (𝑥) = 𝜇𝐷 ℎ(𝑥) for some 𝜇 ≥ 0, and indeed 𝑥 is a local maximiser
of 𝑓 . For 𝑥 ∈ {𝑏, 𝑐} the derivatives of 𝑓 and ℎ have opposite sign, and in each case
𝐷 𝑓 (𝑥) = 𝜆𝐷 ℎ(𝑥) for some 𝜆 but 𝜆 < 0, so these are not maximisers of 𝑓 . However,
in that case −𝐷 𝑓 (𝑥) = −𝜆𝐷 ℎ(𝑥) and hence both 𝑏 and 𝑐 are local maximisers of
− 𝑓 and hence local minimisers of 𝑓 , in agreement with the picture. For 𝑥 = 𝑑 we
have a local maximum 𝑓 (𝑥) with 𝐷 𝑓 (𝑥) > 0 but 𝐷 ℎ(𝑥) = 0 and hence no 𝜇 with
𝐷 𝑓 (𝑥) = 𝜇𝐷 ℎ(𝑥) because the constraint qualification fails. Moreover, there are two
points 𝑥 in the interior of [𝑐, 𝑑] where 𝑓 has zero derivative, which is a necessary
condition for a local maximum of 𝑓 because ℎ(𝑥) ≤ 0 is not tight. One of these
points is indeed a local maximum.
Method 4.12. The following is a “cookbook procedure” to use the KKT Theorem
4.11 in order to find the optimum of a function 𝑓 : R𝑛 → R.
1. Write all inequality constraints in the form ℎ 𝑖 (𝑥) ≤ 0 for 1 ≤ 𝑖 ≤ ℓ . In particular,
write a constraint such as 𝑔(𝑥) ≥ 0 in the form −𝑔(𝑥) ≤ 0 .
2. Assert that the functions 𝑓 , ℎ1 , . . . , ℎℓ are 𝐶 1 functions on R𝑛 . If the function 𝑓
is to be minimised, replace it by − 𝑓 to obtain a maximisation problem.
3. Check if the set
𝑆 = {𝑥 ∈ R𝑛 | ℎ 𝑖 (𝑥) ≤ 0, 1 ≤ 𝑖 ≤ ℓ } (4.59)
𝑇 = 𝑆 ∩ {𝑥 ∈ R𝑛 | 𝑓 (𝑥) ≥ 𝑐} (4.60)
4b. Find solutions 𝑥 and 𝜇𝑖 for 𝑖 ∈ 𝐸 to (4.54) and to the equations ℎ 𝑖 (𝑥) = 0 for
𝑖 ∈ 𝐸. If 𝜇𝑖 ≥ 0 for all 𝑖 ∈ 𝐸, then this is a local maximum of 𝑓 , otherwise not.
5. Compare the function values of 𝑓 (𝑥) found in 4b, and of 𝑓 (𝑥) for the critical
points 𝑥 in 4a, to determine the global maximum (which may occur for more
than one maximiser).
The main step in this method is Step 4. As an example, we apply Method 4.12
to the problem
maximise 𝑥 + 𝑦 subject to 2𝑦
1
≤ 12 𝑥, 𝑦≤ 5
4 − 14 𝑥 2 , 𝑦 ≥ 0 . (4.61)
𝑓 (𝑥, 𝑦) = 𝑥 + 𝑦
ℎ 1 (𝑥, 𝑦) = − 21 𝑥 + 2𝑦
1
(4.62)
ℎ 2 (𝑥, 𝑦) = 4𝑥
1 2
+ 𝑦− 5
4
ℎ 3 (𝑥, 𝑦) = − 𝑦
𝐷 𝑓 (𝑥, 𝑦) = (1, 1)
𝐷 ℎ 1 (𝑥, 𝑦) = (− 21 , 21 )
(4.63)
𝐷 ℎ 2 (𝑥, 𝑦) = ( 12 𝑥, 1)
𝐷 ℎ 3 (𝑥, 𝑦) = (0, −1) .
There are eight possible subsets 𝐸 of {1, 2, 3}. If 𝐸 = ∅, then (4.54) holds if
𝐷 𝑓 (𝑥, 𝑦) = (0, 0), which is never the case. Next, consider the three “corners” of the
set 𝑆 which are defined when two inequalities are tight, where 𝐸 has two elements.
If 𝐸 = {1, 2} then ℎ1 (𝑥, 𝑦) = 0 and ℎ2 (𝑥, 𝑦) = 0 hold if 𝑥 = 𝑦 and 14 𝑥 2 + 𝑥 − 54 = 0
or 𝑥 2 + 4𝑥 − 5 = 0, that is, (𝑥 − 1)(𝑥 + 5) = 0 or 𝑥 ∈ {1, −5} where only 𝑥 = 𝑦 = 1
fulfills 𝑦 ≥ 0, so this is the point 𝑎 = (1, 1) shown in Figure 4.13. In this case
ℎ3 (1, 1) = −1 < 0, so the third inequality is indeed not tight (if it was then this would
116 Chapter 4. First-Order Conditions
𝑦 𝐷 ℎ2
𝐶
𝐷 ℎ1
1
𝑎
𝐷𝑓
𝑆
ℎ2 ≤ 0
ℎ1 ≤ 0
ℎ3 ≤ 0 𝑏
0 𝑥
0 1 2
Figure 4.13 The set 𝑆 defined by the constraints in (4.62). The triple short
lines next to each line defined by ℎ 𝑖 (𝑥, 𝑦) = 0 (for 𝑖 = 1, 2, 3) show the side
where ℎ 𝑖 (𝑥, 𝑦) ≤ 0 holds, abbreviated as ℎ 𝑖 ≤ 0. The (infinite) set 𝐶 is the
cone of all nonnegative linear combinations of the gradients 𝐷 ℎ1 (𝑥, 𝑦) and
𝐷 ℎ2 (𝑥, 𝑦) for the point 𝑎 = (𝑥, 𝑦) = (1, 1), which does not contain 𝐷 𝑓 (𝑥, 𝑦), so
𝑓 (𝑎) is not a local maximum. At the point 𝑏 = (2, 14 ) we have 𝐷 𝑓 (𝑏) = 𝜇2 𝐷 ℎ2 (𝑏)
for the (only) tight constraint ℎ 2 (𝑏) = 0, and 𝜇2 ≥ 0, so 𝑓 (𝑏) is a local maximum.
correspond to the case 𝐸 = {1, 2, 3}). Then the two gradients are 𝐷 ℎ 1 (1, 1) = (− 12 , 21 )
and 𝐷 ℎ 2 (1, 1) = ( 12 , 1) which are not scalar multiples of each other and therefore
linearly independent, so the constraint qualification holds. Because these are
two linearly independent vectors in R2 , any vector, in particular 𝐷 𝑓 (1, 1), can be
represented as a linear combination of them. That is, there are 𝜇1 and 𝜇2 with
𝑓 (𝑥, 𝑦) = 𝑐
ℎ2 (𝑥, 𝑦) = 0
𝑆
𝑥
ℎ1 (𝑥, 𝑦) = 0
𝑏 𝐷𝑓
𝐷 ℎ2
𝐷 ℎ1
𝑎
𝐷𝑓
𝐷 ℎ2
Figure 4.14 The maximisation problem√(4.65). The parabolas are contour
lines of 𝑓 . The points 𝑎 = (0, −1) and 𝑏 = ( 3/2, − 12 ) are local maximisers of 𝑓 ,
and 𝑓 (𝑏) is the global maximum. The gradients are displayed shorter to save
space.
The corresponding set 𝑆 is shown in Figure 4.14 and compact, so 𝑓 has a maximum.
We have
𝐷 𝑓 (𝑥, 𝑦) = (2𝑥, −1), 𝐷 ℎ1 (𝑥, 𝑦) = (−1, 0), 𝐷 ℎ2 (𝑥, 𝑦) = (2𝑥, 2𝑦). (4.67)
(0,1)
(1,0)
(0,0)
x
(2,1)
(0,1)
(1,0)
(0,0)
x
Figure 4.16 The feasible set and the objective function for the example
(4.68), and the optimal point (1, 0).
For (𝑥, 𝑦) = (0, 1), the tight constraints are given by ℎ 1 (0, 1) = ℎ3 (0, 1) = 0
whereas ℎ 2 (0, 1) < 0, so 𝐸 = {1, 3} in (4.53). By (4.69), 𝐷 ℎ1 (0, 1) = (−1, 0) and
𝐷 ℎ3 (0, 1) = (1, 1), which are linearly independent vectors. We want to find 𝜇1 and
𝜇3 that are nonnegative so that
that is,
(2, 1) = 𝜇1 (−1, 0) + 𝜇3 (1, 1)
which has the unique solution 𝜇3 = 1 and 𝜇1 = −1. Because 𝜇1 < −1, the KKT
conditions (4.51) fail and (𝑥, 𝑦) = (0, 1) cannot be a local maximum of 𝑓 (nor, for
that matter, a minimum because 𝜇3 > 0, because for a minimum of 𝑓 , that is, a
maximum of − 𝑓 , a we would need 𝜇1 ≤ 0 and 𝜇3 ≤ 0).
For completeness, we consider the cases of fewer tight constraints. 𝐸 = ∅
would require 𝐷 𝑓 (𝑥, 𝑦) = (0, 0), which is never the case. If 𝐸 = {1} then 𝐷 𝑓 (𝑥, 𝑦)
would have to be scalar multiple of 𝐷 ℎ1 (𝑥, 𝑦) but it is not, and neither is it a scalar
multiple of 𝐷 ℎ2 (𝑥, 𝑦) when 𝐸 = {2}. Consider 𝐸 = {3}, so ℎ 3 (𝑥, 𝑦) = 0 is the
4.10. Reminder of Learning Outcomes 121
only tight constraint, that is, 𝑥 > 0 and 𝑦 > 0. Then ℎ 3 (𝑥, 𝑦) = 0 is equivalent to
𝑥 + 𝑦 − 1 = 0. Then we need 𝜇3 ≥ 0 so that 𝐷 𝑓 (𝑥, 𝑦) = 𝜇3 𝐷 ℎ 3 (𝑥, 𝑦), that is,
(2, 1) = 𝜇3 (𝑦, 𝑥 − 2𝑦 − 1)
cylinder sidewall:
h thickness 1
How are ℎ and 𝑟 related when 𝐴 = 1? What are 𝑟 and ℎ if 𝑉 = 324 cm3 and
𝐴 = 6/𝜋 ≈ 1.91?
Exercise 4.4. A firm has to produce a good with two inputs 𝑥 and 𝑦, where 𝑦 ≥ 0
√
and, due to contractual obligations, 𝑥 ≥ 1. The output produced is given by 𝑥𝑦,
and the firm has to produce at least a certain fixed amount 𝑢 of output, 𝑢 > 0. The
firm’s costs are 𝑎𝑥 + 𝑏𝑦 with 𝑎, 𝑏 > 0, which the firm tries to minimise. We want to
find the optimal choice of 𝑥 and 𝑦 for the firm under these conditions.
(a) Show that as described, the set of possible choices 𝑥, 𝑦 is closed but not
compact. Show that the search for an optimum can nevertheless be restricted
4.11. Exercises for Chapter 4 123
3 𝑥 3
𝑓 (𝑥, 𝑦) = 𝑥 · 𝑦, 𝑔(𝑥, 𝑦) = 𝑥 2 − 𝑦 − , ℎ(𝑥, 𝑦) = − +𝑦− .
2 2 2
State why 𝑓 (𝑥, 𝑦) has or does not have a maximum or minimum on the set 𝑆,
and find all points of 𝑆 (if any) where 𝑓 attains its maximum and minimum.
5
Linear Optimisation
5.1 Introduction
124
5.1. Introduction 125
vector 𝑦 are both optimal. Simply put, if there is an inequality involved, one of
these inequalities on the primal or dual side must be tight, for each primal and
dual variable. If the constraint is an equality, this holds automatically.
• The purpose of Section 5.10 (also optional) is to connect LP duality with the
KKT conditions from Section 4.9 in the previous chapter. Essentially, if the
optimisation problem is linear, then LP duality means the same as the KKT
conditions.
• Section 5.11 introduces the most important algorithm for solving an LP, called
the simplex algorithm, by means of an example.
• The final, also optional Section 5.12 is a description of the simplex algorithm
for an LP subject to equality constraints given in abstract form. The algorithm
traverses a sequence of basic solutions to these equality constraints. This de-
scription is meant to describe the mathematical concepts behind the algorithm,
in algebraic form, for the mathematically minded student. The main ideas
have already been covered in the preceding Section 5.11.
In this form, the problem is in the standard inequality form of a linear optimisation
problem, also called linear programming problem, or just linear program. (The term
“programming” was popular in the middle of the 20th century when optimisation
problems started to be solved with computer programs, with electronic computers
also being developed around the same time.)
A linear function 𝑓 : R𝑛 → R is of the form
𝑓 (𝑥1 , . . . , 𝑥 𝑛 ) = 𝑐 1 𝑥1 + · · · + 𝑐 𝑛 𝑥 𝑛 (5.2)
which means that both 𝑐 and 𝑥 are considered as column vectors in R𝑛 , which we
consider as 𝑛 × 1 matrices. Then 𝑐⊤ is the corresponding row vector (a 1 × 𝑛 matrix)
and 𝑐⊤𝑥 is just a matrix product, which in this case produces a 1 × 1 matrix which
is a real number that represents the scalar product of the vectors 𝑐 and 𝑥 in (5.2).
For that reason, we write the multiplication of a vector 𝑥 with a scalar 𝜆 in the
form 𝑥𝜆 (rather than as 𝜆𝑥) because it is the product of a 𝑛 × 1 with a 1 × 1 matrix.
This consistency is very helpful in re-grouping products of several matrices and
vectors.
Recall that we write the derivative 𝐷 ℎ(𝑥) of a function ℎ as a row vector, so
that we multiply it with a scalar like 𝜆 from the left as in 𝜆𝐷 ℎ(𝑥). Also, when we
write 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 ), say, then this is just meant to define 𝑥 as an 𝑛-tuple of real
numbers and not as a row vector, because otherwise we would always have to
introduce 𝑥 tediously as 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 )⊤. The thing to remember is that when we
use matrix multiplication, then a vector like 𝑥 is always a column vector and 𝑥⊤ is
a row vector.
Let 𝑐 ≠ 0 (where 0 is the vector with all components zero, in any dimension),
and let 𝑓 be the linear function defined by 𝑓 (𝑥) = 𝑐⊤𝑥 as in (5.2). The set
{𝑥 | 𝑓 (𝑥) = 0} where 𝑓 takes value 0 is a linear subspace of R𝑛 . By definition, it
consists of all vectors 𝑥 that are orthogonal to 𝑐, that is, have scalar product 0 with 𝑐.
If 𝑛 = 2, then this “nullspace” of 𝑓 is a line, but in general it will be a “hyperplane”
in R𝑛 , a space of dimension 𝑛 − 1.
More generally, let 𝑢 ∈ R and consider the set
where 𝑓 takes value 𝑢, which we have earlier called a contour set or level set for 𝑓 .
Then for any two 𝑥 and 𝑥ˆ on this level set 𝐻, that is, so that 𝑓 (𝑥) = 𝑓 (𝑥)
ˆ = 𝑢, we
have 𝑐 (𝑥 − 𝑥)
⊤ ˆ = 0, so that the vector 𝑥 − 𝑥ˆ is orthogonal to 𝑐. Then 𝐻 is also called
a hyperplane through the point 𝑥 (which does not contain the origin 0 unless 𝑢 = 0)
with normal vector 𝑐. To repeat, such a hyperplane 𝐻 is of the form (5.3) for some
𝑐 ∈ R𝑛 , 𝑐 ≠ 0, and 𝑢 ∈ R. The different contour sets for 𝑓 are therefore parallel
hyperplanes, all with the same normal vector 𝑐.
Figure 5.1 shows an example of such level sets, where these “hyperplanes”
are contour lines because 𝑛 = 2. The vector 𝑐, here 𝑐 = (2, −1), is orthogonal to
any such level set. Moreover, 𝑐 points in the direction in which the function value
of 𝑓 (𝑥) increases, because if we replace 𝑥 by 𝑥 + 𝑐 then 𝑓 (𝑥) changes from 𝑐⊤𝑥 to
𝑓 (𝑥 + 𝑐) = 𝑐⊤𝑥 + 𝑐⊤𝑐 which is larger than 𝑐⊤𝑥 because 𝑐⊤𝑐 = 𝑛𝑖=1 𝑐 2𝑖 > 0 because
Í
𝑐 ≠ 0. Note that 𝑐 may have negative components (as in the figure). Only the
direction of 𝑐 matters to find out where 𝑓 (𝑥) gets larger.
Similar to a hyperplane 𝐻 defined by 𝑐 and 𝑢 in (5.3), a halfspace 𝑆 is defined
by an inequality according to
5.2. Linear Functions, Hyperplanes, and Halfspaces 129
0 0 (1,0) (2,0)
c S c
Figure 5.1 Left: Contour lines (level sets) of the function 𝑓 : R2 → R defined
by 𝑓 (𝑥) = 𝑐⊤𝑥 for 𝑐 = (2, −1). Right: Halfspace 𝑆 in (5.4) given by 𝑐⊤𝑥 ≤ 5. As
before, the strokes next to 𝐻 indicate the side of the line where this inequality
is valid, which defines 𝑆.
𝑆 = {𝑥 ∈ R𝑛 | 𝑐⊤𝑥 ≤ 𝑢} (5.4)
which consists of all points 𝑥 that are on the hyperplane 𝐻 or “below” it, that
is, with smaller values of 𝑐⊤𝑥 than the points on 𝐻. Figure 5.1 shows such a
halfspace 𝑆 for 𝑐 = (2, −1) and 𝑢 = 5, which contains, for example, the point
𝑥 = (2.5, 0). It is customary to “shade” the side of the hyperplane 𝐻 that defines 𝑆
with a few small parallel strokes as shown in the picture, and then it is not needed
to indicate 𝑐 which is the orthogonal vector to 𝐻 that points away from 𝑆.
x2
c
(0,1)
(1,0)
(0,0)
x1
Figure 5.2 The feasible set and the objective function for the example (5.1),
and the optimal point (1, 0).
With these conventions, Figure 5.2 gives a graphical description of the problem
in (5.1) where the feasible set where all inequalities hold is the intersection of the
three halfspaces defined by 𝑥1 ≥ 0 (which could be written as −𝑥1 ≤ 0), 𝑥2 ≥ 0,
(which could be written as −𝑥2 ≤ 0), and 𝑥 1 + 𝑥2 ≤ 0. This is the shaded triangle.
In this graphical way, the optimal solution is nearly obvious.
130 Chapter 5. Linear Optimisation
maximise 𝑥1 + 𝑥2
subject to 𝑥1 ≥ 0
𝑥2 ≥ 0
(5.5)
−𝑥1 + 𝑥 2 ≤ 1
𝑥1 + 6𝑥 2 ≤ 15
4𝑥1 − 𝑥 2 ≤ 10 .
The set of points (𝑥1 , 𝑥2 ) in R2 that fulfill these inequalities is called the feasible set
and shown in Figure 5.3.
𝑥2
𝐷 𝑓 = (1, 1)
−𝑥1 + 𝑥2 ≤ 1
𝑥1 + 6𝑥2 ≤ 15
(3, 2)
𝑥1 + 𝑥2 = 5
𝑥2 ≥ 0
𝑥1
𝑥1 ≥ 0
4𝑥 1 − 𝑥2 ≤ 10
Figure 5.3 Feasible set and objective function vector (1, 1) for the LP (5.5),
with optimum at (3, 2) and objective function value 5.
The contour lines of the objective function 𝑓 (𝑥1 , 𝑥2 ) = 𝑥 1 + 𝑥 2 are parallel lines
where the maximum is clearly at the top-right corner (3, 2) of the feasible set, where
the constraints ℎ 4 (𝑥 1 , 𝑥2 ) = 𝑥1 + 6𝑥 2 ≤ 15 and ℎ5 (𝑥1 , 𝑥2 ) = 4𝑥1 − 𝑥 2 ≤ 10 are tight.
The fact that this is a local maximum can be seen with the KKT Theorem 4.11
because there are nonnegative 𝜇4 and 𝜇5 so that 𝐷 𝑓 = (1, 1) = 𝜇4 𝐷 ℎ4 + 𝜇5 𝐷 ℎ5 =
𝜇4 (1, 6) + 𝜇5 (4, −1), namely 𝜇4 = 51 , 𝜇5 = 15 . We can write 𝐷 𝑓 instead of 𝐷 𝑓 (𝑥 1 , 𝑥2 )
because the gradient of a linear function is constant. The picture shows that (3, 2)
is in fact also the global maximum of 𝑓 . We will see that the KKT Theorem has a
simpler version for linear programming, which is called the duality theorem, where
5.3. Linear Programming in Two Dimensions 131
the multipliers 𝜇𝑖 are called dual variables. Moreover, there will be better ways of
finding a maximum than testing all combinations of possible tight constraints.
𝑥2
𝐷 𝑓 = ( 61 , 1)
−𝑥1 + 𝑥2 ≤ 1
𝑥1 + 6𝑥2 ≤ 15
6 𝑥1 + 𝑥2 =
1 5
2
𝑥2 ≥ 0
𝑥1
𝑥1 ≥ 0
4𝑥 1 − 𝑥2 ≤ 10
Figure 5.4 Feasible set and objective function vector ( 16 , 1) with non-unique
maximum along the side where the constraint 𝑥 1 + 6𝑥 2 ≤ 15 is tight.
It can be shown that if the feasible set is bounded, then an optimum of a linear
program can always be found at a corner of the feasible set. However, there may be
more than one corner where the optimum is obtained, and then any point on the
line segment that connects these optimal corners is also optimal. Figure 5.4 shows
this with the same constraints as in (5.5), but a different objective function to be
maximised, namely 𝑓 (𝑥 1 , 𝑥2 ) = 61 𝑥 1 + 𝑥 2 . The corner (3, 2) is also optimal here, but
so is the entire line where 𝑓 (𝑥1 , 𝑥2 ) = 16 𝑥1 + 𝑥2 = 52 (intersected with the feasible
set), which coincides with the tight constraint 𝑥1 + 6𝑥 2 = 15.
Figure 5.5 shows an example where the feasible set is empty, which is called
an infeasible linear program. This occurs, for example, by reversing two of the
inequalities in (5.5) to obtain the following constraints:
𝑥1 ≥ 0
𝑥2 ≥ 0
−𝑥1 + 𝑥 2 ≥ 1 (5.6)
𝑥1 + 6𝑥 2 ≤ 15
4𝑥1 − 𝑥2 ≥ 10 .
Finally, an optimal solution need not exist even when there are feasible
solutions. This happens when the objective function can attain arbitrarily large
132 Chapter 5. Linear Optimisation
𝑥2
−𝑥1 + 𝑥2 ≥ 1
𝑥 1 + 6𝑥2 ≤ 15
𝑥2 ≥ 0
𝑥1
𝑥1 ≥ 0
4𝑥1 − 𝑥2 ≥ 10
Figure 5.5 Example of an infeasible set, for the constraints (5.6). Recall
that the little strokes indicate the side where the inequality is valid, and here
there is no point (𝑥1 , 𝑥2 ) where all inequalities are valid. This would be the
case even without the constraints 𝑥 1 ≥ 0 and 𝑥2 ≥ 0.
values; such a linear program is called unbounded. This is the case when we remove
the constraints 4𝑥 1 − 𝑥2 ≤ 10 and 𝑥1 + 6𝑥 2 ≤ 15 from the initial example (5.5), as
shown in Figure 5.6.
𝑥2
−𝑥1 + 𝑥2 ≤ 1
𝐷 𝑓 = (1, 1)
𝑥2 ≥ 0
𝑥1
𝑥1 ≥ 0
The pictures shown in this section provide a good intuition of how linear
programs look in principle. However, this graphical method hardly extends
beyond R2 or R3 . Our development of the theory of linear programming will
5.4. Linear Programs and Duality 133
proceed largely algebraically, with some geometric intuition for the important
Theorem 5.4 of Farkas.
We recall the notation introduced in Section 4.3. For positive integers 𝑚, 𝑛, the set
of 𝑚 × 𝑛 matrices is denoted by R𝑚×𝑛 . An 𝑛-vector is an element of R𝑛 . Unless
stated otherwise, all vectors are column vectors, so a vector 𝑥 in R𝑛 is considered
as an 𝑛 × 1 matrix. Its transpose 𝑥⊤ is the corresponding row vector in R1×𝑛 .
The components of an 𝑛-vector 𝑥 are 𝑥1 , . . . , 𝑥 𝑛 . The vectors 0 and 1 have all
components equal to zero and one, respectively, and have suitable dimension,
which may vary with each use of 0 or 1. An inequality between vectors like 𝑥 ≥ 0
holds for all components. The identity matrix, of any dimension, is denoted by 𝐼.
A linear optimisation problem or linear program (LP) says: optimise (maximise
or minimise) a linear objective function subject to linear constraints (inequalities or
equalities).
The standard inequality form of an LP is given by an 𝑚 × 𝑛 matrix 𝐴, an 𝑚-vector
𝑏 and an 𝑛-vector 𝑐 and says:
maximise 𝑐⊤𝑥
subject to 𝐴𝑥 ≤ 𝑏 , (5.7)
𝑥 ≥ 0.
The horizontal line is often written to separate the objective function from the
constraints.
or
9𝑥1 + 10𝑥2 + 8𝑥3 ≤ 19.
In this inequality, which holds for any feasible solution, all coefficients of the
nonnegative variables 𝑥 𝑗 are at least as large as in the primal objective function, so
the right-hand side 19 is certainly an upper bound for this objective function. In
fact, we can obtain an even better bound by multiplying the two primal inequalities
by 𝑦1 = 2 and 𝑦2 = 2, getting
or
8𝑥1 + 10𝑥2 + 6𝑥3 ≤ 18.
5.4. Linear Programs and Duality 135
Again, all coefficients are at least as large as in the primal objective function. Thus,
it cannot be larger than 18, which was achieved by the above solution 𝑥 1 = 1, 𝑥2 = 1,
𝑥3 = 0, which is therefore optimal.
In general, the dual LP for the primal LP (5.7) is obtained as follows:
• Multiply each primal inequality by some nonnegative number 𝑦 𝑖 (so as to not
reverse the inequality).
• Sum the resulting entries of each of the 𝑛 columns and require that the resulting
coefficient of 𝑥 𝑗 for 𝑗 = 1, . . . , 𝑛 is at least as large as the coefficient 𝑐 𝑗 of the
objective function. (Because 𝑥 𝑗 ≥ 0, this will at most increase the objective
function.)
• Minimise the resulting right-hand side 𝑦1 𝑏 1 + · · · + 𝑦𝑚 𝑏 𝑚 (because it is an upper
bound for the primal objective function).
So the dual of (5.7) says:
minimise 𝑦⊤𝑏
(5.9)
subject to 𝑦⊤𝐴 ≥ 𝑐⊤, 𝑦 ≥0.
Clearly, (5.9) is also an LP in standard inequality form because it can be written as:
maximise −𝑏⊤𝑦 subject to −𝐴⊤𝑦 ≤ −𝑐, 𝑦 ≥ 0 . In that way, it is easy to see that the
dual LP of the dual LP (5.9) is again the primal LP (5.7).
A good way to simultaneously picture the primal and dual LP (which are
defined by the same data 𝐴, 𝑏, 𝑐) is the following “Tucker diagram”:
𝑥≥0
𝑦≥0 𝐴 ≤ 𝑏 (5.10)
∨ ↩→ min
𝑐⊤ → max
The diagram (5.10) shows the 𝑚 × 𝑛 matrix 𝐴 with the 𝑚-vector 𝑏 on the right
and the row vector 𝑐⊤ at the bottom. The top shows the primal variables 𝑥 with
their constraints 𝑥 ≥ 0. The left-hand side shows the dual variables 𝑦 with their
constraints 𝑦 ≥ 0. The primal LP is to be read horizontally, with constraints 𝐴𝑥 ≤ 𝑏,
and the objective function 𝑐⊤𝑥 that is to be maximised. The dual LP is to be read
vertically, with constraints 𝑦⊤𝐴 ≥ 𝑐⊤ (where in the diagram (5.10) ≥ is written
vertically as ∨ ), and the objective function 𝑦⊤𝑏 that is to be minimised. A way
to remember the direction of the inequalities is to see that one inequality 𝐴𝑥 ≤ 𝑏
points “towards” 𝐴 and the other, 𝑦⊤𝐴 ≥ 𝑐⊤, “away from” 𝐴, where maximisation
is subject to upper bounds and minimisation subject to lower bounds, apart from
the nonnegativity constraints for 𝑥 and 𝑦.
136 Chapter 5. Linear Optimisation
The fact that the primal and dual objective functions are mutual bounds is
known as the “weak duality” theorem, which is very easy to prove – essentially in
the way we have motivated the dual LP above.
Theorem 5.2 (Weak LP duality). For a pair 𝑥, 𝑦 of feasible solutions to the primal LP
(5.7) and its dual LP (5.9), the objective functions are mutual bounds:
𝑐⊤𝑥 ≤ 𝑦⊤𝑏 .
If thereby 𝑐⊤𝑥 = 𝑦⊤𝑏 (equality holds), then these two solutions are optimal for both LPs.
as claimed.
If 𝑐⊤𝑥 ∗ = (𝑦 ∗ )⊤𝑏 for some primal feasible 𝑥 ∗ and dual feasible 𝑦 ∗ , then 𝑐⊤𝑥 ≤
(𝑦 ∗ )⊤𝑏 = 𝑐⊤𝑥 ∗ for any primal feasible 𝑥, and 𝑦⊤𝑏 ≥ 𝑐⊤𝑥 ∗ = (𝑦 ∗ )⊤𝑏 for any dual
feasible 𝑦, so equality of the objective functions implies optimality.
The following “strong duality” theorem is the central theorem of linear
programming.
Theorem 5.3 (Strong LP duality). Whenever both the primal LP (5.7) and its dual LP
(5.9) are feasible, they have optimal solutions with equal value of their objective functions.
We will prove this theorem in Section 5.5. Its proof is not trivial. In fact,
many theorems in economics have a hidden LP duality so that they can be proved
by writing down a suitable LP and interpreting its dual LP. For that reason,
Theorem 5.3 is extremely useful.
This section is about the Lemma of Farkas, also known as the theorem of the
separating hyperplane. It is used to prove the strong LP duality Theorem 5.3.
The Lemma of Farkas gives a condition that the conditions 𝐴𝑥 = 𝑏, 𝑥 ≥ 0 have
no solution. It has a strong geometric intuition. Moreover, solutions to the system
5.5. The Lemma of Farkas and Strong LP Duality 137
𝐴𝑥 = 𝑏, 𝑥 ≥ 0 are used in the important simplex algorithm for solving an LP, which
we treat later (from Section 5.11 onwards).
In this section we first state and explain the Lemma of Farkas. We then use it to
prove strong LP duality. The Lemma of Farkas itself is then shown in a number of
steps. Each of these steps is quite accessible, and explained in a separate subsection
in order to have a better overview of the argument. Some of these proof steps,
such as the use of linearly independent solutions to 𝐴𝑥 = 𝑏 (see Subsection 5.5.6),
will also be of help for understanding the simplex algorithm.
𝐶 𝐴4 𝐶 𝐴4
𝐴1 𝐴1
𝐴3 𝐴3
𝐴2 𝐴2
𝑐
𝐻 𝑦 = 𝑐−𝑏
0 0
𝑏 𝑏
Figure 5.7 Left: Vectors 𝐴1 , 𝐴2 , 𝐴3 , 𝐴4 , the cone 𝐶 generated by them (which
extends to infinity between the two “rays” that extend 𝐴3 and 𝐴2 ), and a vector
𝑏 not in 𝐶. Right: A separating hyperplane 𝐻 for 𝑏 with normal vector 𝑦 = 𝑐−𝑏.
The right diagram in Figure 5.7 shows a vector 𝑦 so that 𝑦⊤𝐴 𝑗 ≥ 0 for all 𝑗,
for 1 ≤ 𝑗 ≤ 𝑛, and 𝑦⊤𝑏 < 0. The set 𝐻 = {𝑧 ∈ R𝑚 | 𝑦⊤𝑧 = 0} is called a separating
hyperplane with normal vector 𝑦 because all vectors 𝐴 𝑗 are on one side of 𝐻 (they
fulfill 𝑦⊤𝐴 𝑗 ≥ 0, which includes the case 𝑦⊤𝐴 𝑗 = 0 where 𝐴 𝑗 belongs to 𝐻, like 𝐴2
in Figure 5.7), whereas 𝑏 is strictly on the other side of 𝐻 because 𝑦⊤𝑏 < 0. The
Lemma of Farkas asserts that such a separating hyperplane exists for any 𝑏 that
does not belong to 𝐶.
Theorem 5.4 (Lemma of Farkas). Let 𝐴 ∈ R𝑚×𝑛 and 𝑏 ∈ R𝑚 . Then exactly one of the
following statements holds:
138 Chapter 5. Linear Optimisation
(a) ∃𝑥 ∈ R𝑛 : 𝑥 ≥ 0, 𝐴𝑥 = 𝑏.
(b) ∃𝑦 ∈ R𝑚 : 𝑦⊤𝐴 ≥ 0⊤, 𝑦⊤𝑏 < 0.
In Theorem 5.4, it is clear that (a) and (b) cannot both hold because if (a) holds,
then 𝑦⊤𝐴 ≥ 0⊤ implies 𝑦⊤𝑏 = 𝑦⊤(𝐴𝑥) = (𝑦⊤𝐴)𝑥 ≥ 0.
If (a) is false, that is, 𝑏 does not belong to the cone 𝐶 in (5.12), then 𝑦 can be
constructed by the following intuitive geometric argument: Take a vector 𝑐 in 𝐶
that is closest to 𝑏 (see Figure 5.7), and let 𝑦 = 𝑐 − 𝑏. We will show later that 𝑦
fulfills the conditions in (b) and that the point 𝑐 exists (which is nontrivial and
shown in Section 5.5.7).
However, we postpone the proof of the Lemma of Farkas in order to show first
how it can be used to prove the strong LP duality theorem. For that we use some
elementary algebraic manipulations and no longer appeal to geometry.
Note the subtle difference between the conditions (a) and (b) in Theorems 5.4
and 5.5, respectively. Theorem 5.4(a) is about equations 𝐴𝑥 = 𝑏 and Theorem 5.5(a)
is about inequalities 𝐴𝑥 ≤ 𝑏. Theorem 5.4(b) states the existence of an arbitrary
vector 𝑦 and Theorem 5.5(b) the existence of a nonnegative vector 𝑦. This is a
recurring theme: equations are preserved when multiplying them with arbitrary
coefficients, namely 𝐴𝑥 = 𝑏 implies 𝑦⊤𝐴𝑥 = 𝑦⊤𝑏 for any 𝑦, whereas inequalities
𝐴𝑥 ≤ 𝑏 are only preserved if we have 𝑦 ≥ 0, which then implies 𝑦⊤𝐴𝑥 ≤ 𝑦⊤𝑏. The
following proof uses a trick (the introduction of “slack variables” 𝑠) to convert
inequalities into equations. This trick will also be used again, see (5.37).
Proof of Theorem 5.5. Clearly, there is a vector 𝑥 so that 𝐴𝑥 ≤ 𝑏 and 𝑥 ≥ 0 if and
only if there are 𝑥 ∈ R𝑛 and 𝑠 ∈ R𝑚 with
𝐴𝑥 + 𝑠 = 𝑏, 𝑥 ≥ 0, 𝑠 ≥ 0. (5.13)
The system (5.13) is a system of equations as in Theorem 5.4 with the matrix [𝐴 𝐼]
instead of 𝐴, where 𝐼 is the 𝑚 × 𝑚 identity matrix, and the vector (𝑥, 𝑠) instead of 𝑥.
The condition 𝑦⊤[𝐴 𝐼] ≥ 0⊤ in Theorem 5.4(b) is then simply 𝑦⊤𝐴 ≥ 0⊤, 𝑦 ≥ 0 as
stated here in (b).
5.5. The Lemma of Farkas and Strong LP Duality 139
− 𝐴⊤𝑦 ≤ − 𝑐
𝐴𝑥 ≤ 𝑏 (5.15)
−𝑐⊤𝑥 + 𝑏⊤𝑦 ≤ 0
but
−𝑢⊤𝑐 + 𝑣⊤𝑏 < 0 . (5.17)
We derive a contradiction as follows: If 𝑡 = 0, this means that already the first
𝑛 + 𝑚 inequalities in (5.16) have no nonnegative solution 𝑥, 𝑦, contrary to our
assumption, for the following reason. Namely, by (5.16) 𝑡 = 0 means that 𝑣⊤𝐴 ≥ 0⊤,
𝑢⊤𝐴⊤ ≤ 0⊤, which with (5.14) implies
in contradiction to (5.17).
If 𝑡 > 0, then 𝑢 and 𝑣 are essentially primal and dual feasible solutions that
violate weak LP duality, because then by (5.16) 𝑏𝑡 ≥ 𝐴𝑢 and 𝑣⊤𝐴 ≥ 𝑡𝑐⊤ and
therefore
𝑣⊤𝑏 𝑡 ≥ 𝑣⊤𝐴𝑢 ≥ 𝑡𝑐⊤𝑢 ,
which is an equality between two scalars. After division by 𝑡 it gives 𝑣⊤𝑏 ≥ 𝑐⊤𝑢,
again contradicting (5.17).
In summary, if the first 𝑛 + 𝑚 inequalities in (5.15) have a solution 𝑥, 𝑦 ≥ 0,
then there is also such a solution that fulfills the last inequality, as claimed by the
strong LP duality Theorem 5.3.
The following subsections are about completing the proof of Theorem 5.4, the
Lemma of Farkas. We will use geometric arguments about points in R𝑚 . In most
140 Chapter 5. Linear Optimisation
cases, this will involve at most three such points. These three points define a
triangle, so that the arguments are very accessibly visualised in a two-dimensional
plane.
Our first geometric concept, discussed in this subsection, is about convex
combinations and convexity.
𝑤′
𝑥
𝑧
𝑦
𝑤
↗
{𝑥 + (𝑦 − 𝑥)𝑝 | 𝑝 ∈ R}
Figure 5.8 The line through the points 𝑥 and 𝑦 consists of points written as
𝑥 + (𝑦 − 𝑥)𝑝 where 𝑝 ∈ R. Examples are point 𝑧 for 𝑝 = 0.6, point 𝑤 for 𝑝 = 1.5,
and point 𝑤 ′ when 𝑝 = −0.4. The line segment [𝑥, 𝑦] that connects 𝑥 and 𝑦
(drawn as a solid line) results when 𝑝 is restricted to 0 ≤ 𝑝 ≤ 1.
Let 𝑥 and 𝑦 be two vectors in R𝑚 . Figure 5.8 shows two points 𝑥 and 𝑦 in the
plane, but the picture may also be regarded as a suitable view of the situation
in a higher-dimensional space. The line that goes through the points 𝑥 and 𝑦 is
obtained by adding to the point 𝑥, regarded as a vector, any scalar multiple of the
difference 𝑦 − 𝑥. The resulting vector 𝑥 + (𝑦 − 𝑥)𝑝, for 𝑝 ∈ R, gives 𝑥 when 𝑝 = 0
and 𝑦 when 𝑝 = 1. Figure 5.8 gives some examples 𝑧, 𝑤, 𝑤 ′ of other points. When
0 ≤ 𝑝 ≤ 1, as for point 𝑧, the resulting points define the line segment that joins 𝑥
and 𝑦, which we denote by [𝑥, 𝑦] (note that 𝑥 and 𝑦 belong to R𝑚 here):
If 𝑝 > 1, then one obtains points on the line through 𝑥 and 𝑦 on the other side of 𝑦
relative to 𝑥, like the point 𝑤 in Figure 5.8. For 𝑝 < 0, the corresponding point,
like 𝑤 ′ in Figure 5.8, is on that line but on the other side of 𝑥 relative to 𝑦.
As already done in (5.18), the expression 𝑥 + (𝑦 − 𝑥)𝑝 can be re-written as
𝑥(1 − 𝑝) + 𝑦𝑝, where the given points 𝑥 and 𝑦 appear only once. This special linear
combination of 𝑥 and 𝑦 with nonnegative coefficients that sum to one is called a
convex combination of 𝑥 and 𝑦. It is useful to remember the expression 𝑥(1 − 𝑝) + 𝑦𝑝
in this order with 1 − 𝑝 as the coefficient of the first vector and 𝑝 of the second
vector, because then the line segment [𝑥, 𝑦] that joins 𝑥 to 𝑦 corresponds to the
real interval [0, 1] for the possible values of 𝑝, with the endpoints 0 and 1 of the
interval corresponding to the respective endpoints 𝑥 and 𝑦 of the line segment.
5.5. The Lemma of Farkas and Strong LP Duality 141
x
x
y y
Figure 5.9 Examples of sets that are convex (left) and not convex (right).
The topic of this section is stated in Theorem 5.7 and shown in Figure 5.10: Given
a closed convex set 𝐶 and a point 𝑏 not in 𝐶, there is a hyperplane 𝐻 that separates
𝐶 from 𝑏, which means that 𝐶 is on one side of the hyperplane and 𝑏 is strictly on
the other side.
𝑏 𝐻
𝑐 𝑎
𝐶
Figure 5.10 The separating hyperplane theorem for a closed set 𝐶, here a
compact set 𝐶 (but 𝐶 can be unbounded) and a point 𝑏 not in 𝐶.
cross the road? The answer is: To get to the other side. (The fact that this is not
really a joke at all is what is meant to be funny about this.) Our variant of this
question is shown in Figure 5.11 and asks: How does the chicken cross the triangle?
The chicken is in one corner 𝑏 of the triangle and wants to get to the other side of
the triangle. As appropriate for appearing in a course on optimisation, it wants to
do so as fast as possible. However, the chicken only crosses the triangle if the angle
at the adjacent corner 𝑐 is acute (less than 90 degrees). Otherwise, it will go along
the side from 𝑏 to 𝑐.
𝑏 𝑏 𝑏
𝑐 𝑎 𝑐 𝑎 𝑐 𝑎
Lemma 5.6. Consider three distinct points 𝑎, 𝑏, 𝑐 in R𝑚 . Then the closest point to 𝑏 on
the line segment [𝑐, 𝑎] is 𝑐 if and only if
(𝑏 − 𝑐)⊤(𝑎 − 𝑐) ≤ 0 . (5.19)
Otherwise, that closest point is a convex combination 𝑐(1 − 𝑝) + 𝑎𝑝 for some 𝑝 ∈ (0, 1].
√
Proof. By (3.16), the (Euclidean) length ∥𝑥∥ of a vector 𝑥 is equal to 𝑥⊤𝑥, and
minimising that length is equivalent to minimising its square 𝑥⊤𝑥. Consider a
point 𝑧 = 𝑐 + (𝑎 − 𝑐)𝑝 for 𝑝 ∈ R on the line through 𝑐 and 𝑎 (see Figure 5.8), which
for 𝑝 ∈ [0, 1] is on the side [𝑐, 𝑎] of the triangle. We minimise ∥𝑏 − 𝑧 ∥ 2 , that is,
(𝑏 − 𝑧)⊤(𝑏 − 𝑧), where
which as a function of 𝑝 is a parabola that tends to infinity for large |𝑝| and thus has
its minimum when its derivative is zero, that is, −2(𝑏 − 𝑐)⊤(𝑎 − 𝑐) + 2∥𝑎 − 𝑐 ∥ 2 𝑝 = 0
5.5. The Lemma of Farkas and Strong LP Duality 143
or
(𝑏 − 𝑐)⊤(𝑎 − 𝑐)
𝑝= .
∥𝑎 − 𝑐 ∥ 2
Hence, 𝑝 has the same sign as (𝑏 − 𝑐)⊤(𝑎 − 𝑐). If 𝑝 = 0 then, by the definition of 𝑧,
the closest point to 𝑏 on the line through 𝑐 and 𝑎 is 𝑐, as in the right picture in
Figure 5.11. If 𝑝 < 0, then that closest point on the line is to the left of 𝑐 but not on
the line segment [𝑐, 𝑎] (the side of the triangle), so the closest point to 𝑏 on [𝑐, 𝑎] is
also 𝑐. These are the cases claimed in (5.19). If 𝑝 > 0 then the closest point to 𝑏 on
the line through 𝑐 and 𝑎 is to the right of 𝑐, which belongs to [𝑐, 𝑎] if 𝑝 ≤ 1, and is
to the right of 𝑎 if 𝑝 > 1 in which case the closest point to 𝑏 on [𝑐, 𝑎] is 𝑎 (so then
the chicken does not cross the triangle either but walks along the side [𝑏, 𝑎]); at
any rate, the closest point to 𝑏 on [𝑐, 𝑎] is not 𝑐 and 𝑝 > 0 as claimed.
𝑏 𝐻
𝑐 𝑎
𝐶
Our aim is to apply Theorem 5.7 to the cone 𝐶 in (5.12). For that we need to show
that 𝐶 is non-empty, convex, and closed. Clearly, 𝐶 is non-empty because 0 ∈ 𝐶.
Convexity is similarly easy, and a good exercise.
The proof that 𝐶 is closed is not difficult but needs further steps that we
postpone to the next two subsections. For now we assume that 𝐶 is closed, and
focus on the definition of the hyperplane 𝐻.
𝑏
𝑣 = 𝑏−𝑐 𝐴1
𝑐 𝐴2
𝐶
0 𝐴3
𝐻
Figure 5.13 shows a picture similar to Figure 5.7 where the point 𝑏 is not in 𝐶
and 𝑐 is the closest point to 𝑏 in 𝑐. The hyperplane 𝐻 that separates 𝑏 from 𝐶 is
defined in (5.20). We want to show that 𝐻 is defined by those points 𝑧 that fulfill
𝑣⊤𝑧 = 0, where 𝑣 = 𝑏 − 𝑐, because this is what we need for the Lemma of Farkas. In
other words, we want to show that 0 belongs to 𝐻, that is, 𝑣⊤𝑐 = 0. This is obvious
when we already have 𝑐 = 0.
⇒ Draw a picture of a cone 𝐶 as in (5.12) and a point 𝑏 not in 𝐶 such that the
closest point to 𝑏 in 𝐶 is 0.
⇒ What is the set 𝐽 in the proof of Theorem 5.8 in Figure 5.7 and in Figure 5.13?
Theorem 5.8 is very nearly the statement of the Lemma of Farkas (Theorem 5.4),
except that we assumed that 𝐶 is closed. We next prove this assumption.
This subsection provides one more step in completing the proof of the Lemma
of Farkas. When looking for a nonnegative solution 𝑥 to the system 𝐴𝑥 = 𝑏,
the solution 𝑥 is in general not unique, as in the typical case where 𝐴 has more
columns than rows. However, when the columns 𝐴 𝑗 where 𝑥 𝑗 is positive are linearly
independent, then these components 𝑥 𝑗 of 𝑥 are unique.
The next lemma states that this can always be achieved even with the additional
requirement 𝑥 ≥ 0, if such solutions exist at all.
Lemma 5.9. Let 𝐴 = [𝐴1 · · · 𝐴𝑛 ] ∈ R𝑚×𝑛 and 𝑏 ∈ R𝑚 . If 𝐴𝑥 = 𝑏, 𝑥 ≥ 0 has a solution 𝑥,
then there is a set 𝐽 ⊆ {1, . . . , 𝑛} such that the vectors 𝐴 𝑗 for 𝑗 ∈ 𝐽 are linearly independent,
and there are unique positive reals 𝑥 𝑗 for 𝑗 ∈ 𝐽 with
Õ
𝐴𝑗 𝑥 𝑗 = 𝑏 . (5.23)
𝑗∈𝐽
Proof. Let 𝐴𝑥 = 𝑏 for some 𝑥 ≥ 0 and 𝐽 = {𝑗 | 𝑥 𝑗 > 0} such that (5.23) holds. The
goal is now to remove elements from 𝐽 until the vectors 𝐴 𝑗 for 𝑗 ∈ 𝐽 are linearly
independent (in which case we simply call 𝐽 independent). Suppose this is not the
case. We change the coefficients 𝑥 𝑗 by keeping them nonnegative but such that at
least one of them becomes zero, which gives a smaller set 𝐽, as follows.
146 Chapter 5. Linear Optimisation
If 𝐽 is not independent then there are scalars 𝑧 𝑗 for 𝑗 ∈ 𝐽, not all zero, so that
Õ
𝐴𝑗 𝑧 𝑗 = 0 ,
𝑗∈𝐽
where we can assume that the set 𝑆 = {𝑗 ∈ 𝐽 | 𝑧 𝑗 > 0} is not empty (otherwise
replace 𝑧 by −𝑧). Then Õ
𝐴 𝑗 (𝑥 𝑗 − 𝑧 𝑗 𝛼) = 𝑏
𝑗∈𝐽
𝑥𝑗 𝑥𝑖
𝛼 = min 𝑗 ∈ 𝑆 =: (5.24)
𝑧𝑗 𝑧𝑖
Figure 5.14 Illustration of Theorem 5.10 for 𝑚 = 2. Any point in the pen-
tagon belongs to one of the three shown triangles (which are not unique
because there are other ways to “triangulate” the pentagon). A triangle is
the set of convex combinations of its corners.
We now go back to the initial consideration for the Lemma of Farkas illustrated
in Figure 5.7. Consider the cone 𝐶 = {𝐴𝑥 | 𝑥 ≥ 0} as in (5.12) generated by the
columns 𝐴1 , . . . , 𝐴𝑛 of 𝐴. If 𝑏 ∉ 𝐶, then the hyperplane 𝐻 that separates 𝑏 from 𝐶
has the normal vector 𝑦 = 𝑐 − 𝑏 where 𝑐 is the closest point to 𝑏 in 𝐶.
In order for 𝑐 to exist, 𝐶 needs to be closed, that is, it contains any point nearby.
Otherwise, 𝑏 could be a point near 𝐶 but not in 𝐶 which would mean that the
distance ∥𝑐 − 𝑏∥ for 𝑐 in 𝐶 can become arbitrarily small. In that case, one could not
define 𝑦 as described.
𝐶 𝐴4
𝐴1
𝐴3 𝑐 (1)
𝐴2
𝑐 (2)
0 𝑏 ← 𝑐 (𝑘)
Figure 5.15 Illustration of the proof of Lemma 5.11 where 𝐽 = {2} since 𝑐 (𝑘)
for large 𝑘 is a positive linear combination of 𝐴2 only.
Lemma 5.11. For an 𝑚 × 𝑛 matrix 𝐴 = [𝐴1 · · · 𝐴𝑛 ], the cone 𝐶 in (5.12) is a closed set.
Proof. Let 𝑏 be a point in R𝑚 near 𝐶, that is, for all 𝜀 > 0 there is a 𝑐 in 𝐶 so
that ∥𝑐 − 𝑏∥ < 𝜀. Consider a sequence 𝑐 (𝑘) (for 𝑘 = 1, 2, . . .) of elements of 𝐶 that
converges to 𝑏. By Lemma 5.9, there exists for each 𝑘 a subset 𝐽 (𝑘) of {1, . . . , 𝑛} and
(𝑘)
unique positive real numbers 𝑥 𝑗 for 𝑗 ∈ 𝐽 (𝑘) so that the columns 𝐴 𝑗 for 𝑗 ∈ 𝐽 (𝑘) are
linearly independent and Õ
(𝑘)
𝑐 (𝑘) = 𝐴𝑗 𝑥 𝑗 .
𝑗∈𝐽 (𝑘)
148 Chapter 5. Linear Optimisation
There are only finitely many different sets 𝐽 (𝑘) , so there is a set 𝐽 that appears
infinitely often among them (see Figure 5.15 for an example). We consider the
subsequence of the vectors 𝑐 (𝑘) that use this set, that is,
Õ
(𝑘) (𝑘) (𝑘)
𝑐 = 𝐴 𝑗 𝑥 𝑗 =: 𝐴 𝐽 𝑥 𝐽 (5.26)
𝑗∈𝐽
(𝑘)
where 𝐴 𝐽 is the matrix with columns 𝐴 𝑗 for 𝑗 ∈ 𝐽 and 𝑥 𝐽 is the vector with
(𝑘) (𝑘)
components 𝑥𝑗 for 𝑗 ∈ 𝐽. Now, 𝑥𝐽
in (5.26) is a continuous function of 𝑐 (𝑘) : In
order to see this, consider a set 𝐼 of |𝐽 | linearly independent rows of 𝐴 𝐽 , let 𝐴𝐼𝐽 be
(𝑘)
the square submatrix of 𝐴 𝐽 with these rows and let 𝑐 𝐼 be the subvector of 𝑐 (𝑘)
(𝑘) (𝑘)
with these rows, so that 𝑥 𝐽 = 𝐴−1 𝑐 in (5.26). Hence, as 𝑐 (𝑘) converges to 𝑏, the
𝐼𝐽 𝐼
(𝑘) (𝑘)
|𝐽 |-vector 𝑥 𝐽 converges to some 𝑥 ∗𝐽 with 𝑏 = 𝐴 𝐽 𝑥 ∗𝐽 , where 𝑥 𝐽 > 0 implies 𝑥 ∗𝐽 ≥ 0,
which shows that 𝑏 ∈ 𝐶. So 𝐶 is closed.
Remark 5.12. In Lemma 5.11, it is important that 𝐶 is the cone generated by finitely
many vectors 𝐴1 , . . . , 𝐴𝑛 . The cone generated by infinitely many vectors may not
be closed. For example, let 𝐶 be the set of nonnegative linear combinations of the
vectors (𝑛, 1) in R2 , for 𝑛 = 0, 1, 2, . . . Then (1, 0) is a vector near 𝐶 that does not
belong to 𝐶.
⇒ Exercise 5.3 asks you to prove Remark 5.12, by giving an exact description
of 𝐶.
Lemma 5.11 completes the proof of the Lemma of Farkas: It shows that the
assumption in Theorem 5.8 that 𝐶 is closed always holds. The conclusions (5.22)
in that theorem then imply Theorem 5.4.
So far, the strong duality Theorem 5.3 makes only a statement when both primal
and dual LP are feasible. In principle, it could be the case that the primal LP has
an optimal solution while its dual is not feasible. The following theorem excludes
this possibility. Its proof is a typical application of Theorem 5.3 itself.
Theorem 5.13 (Boundedness implies dual feasibility). Suppose the primal LP (5.7) is
feasible. Then its objective function is bounded if and only if the dual LP (5.9) is feasible.
Proof. By weak duality (Theorem 5.2), if the dual LP has a feasible solution 𝑦,
then its objective function 𝑦⊤𝑏 provides an upper bound for the primal objective
function 𝑐⊤𝑥. Conversely, suppose that the dual LP (5.9) is infeasible, and consider
5.6. Boundedness and Dual Feasibility 149
the following LP which uses an additional real variable 𝑡 and the vector 1 which
has all components equal to 1:
minimise 𝑡
(5.27)
subject to 𝑦 𝐴 + 𝑡1⊤ ≥ 𝑐⊤,
⊤
𝑦 ≥ 0, 𝑡 ≥ 0.
This LP is clearly feasible by setting 𝑦 = 0 and 𝑡 = max{0, 𝑐1 , . . . , 𝑐 𝑛 }. Also, the
constraints 𝑦⊤𝐴 ≥ 𝑐⊤, 𝑦 ≥ 0 of (5.9) have no solution if and only if the optimum
value of (5.27) has 𝑡 > 0, which we assume to be the case. The LP (5.27) is the dual
LP to the following LP, which we write with variables 𝑧 ∈ R𝑛 :
maximise 𝑐⊤𝑧
subject to 𝐴𝑧 ≤ 0 ,
(5.28)
1⊤𝑧 ≤ 1 ,
𝑧≥0.
This LP is also feasible with 𝑧 = 0. By strong duality, it has the same value as its
dual LP (5.27), which is positive, given by 𝑐⊤𝑧 = 𝑡 > 0 for some 𝑧 that fulfills the
constraints in (5.28). Consider now a feasible solution 𝑥 to the original primal LP,
that is, 𝐴𝑥 ≤ 𝑏, 𝑥 ≥ 0, and let 𝛼 ∈ R, 𝛼 ≥ 0. Then 𝐴(𝑥 +𝑧𝛼) = 𝐴𝑥 +𝐴𝑧𝛼 ≤ 𝑏 +0𝛼 = 𝑏
and 𝑥 + 𝑧𝛼 ≥ 0, so 𝑥 + 𝑧𝛼 is also a feasible solution to (5.7) with objective function
value 𝑐⊤(𝑥 + 𝑧𝛼) = 𝑐⊤𝑥 + (𝑐⊤𝑧)𝛼 which gets arbitrarily large with growing 𝛼. So
the original LP is unbounded. This proves the theorem.
An alternative way of stating the preceding theorem, for the dual LP, is as
follows.
Corollary 5.14. Suppose the dual LP (5.9) is feasible. Then the primal LP (5.7) is
infeasible if and only if the objective function of the dual LP (5.9) is unbounded.
Proof. This is just an application of Theorem 5.13 with dual and primal exchanged:
Rewrite (5.9) as a primal LP in the form: maximise −𝑏⊤𝑦 subject to −𝐴⊤𝑦 ≤ −𝑐,
𝑦 ≥ 0, so that its dual is: minimise −𝑥⊤𝑐 subject to −𝑥⊤𝐴⊤ ≥ −𝑏⊤, 𝑥 ≥ 0, which is
the same as (5.7), and apply Theorem 5.13.
On the other hand, the fact that one LP is infeasible does not imply that its
dual LP is unbounded, because both could be infeasible.
Remark 5.15. It is possible that both the primal LP (5.7) and its dual LP (5.9) are
infeasible.
minimise − 𝑦1 + 𝑦2
subject to 𝑦1 + 𝑦2 ≥ 0
− 𝑦2 ≥ 1
𝑦1 , 𝑦2 ≥ 0 .
primal
optimal unbounded infeasible
dual
optimal yes no no
unbounded no no yes
Table 5.1 The possibilities for primal and dual LP, where “optimal” means
the LP is feasible and bounded and then has an optimal solution, and “un-
bounded” means the LP is feasible but its objective function is unbounded.
Table 5.1 shows the four possibilities that can occur for the primal LP and its
dual: both have optimal solutions, one is infeasible and the other unbounded, or
both are infeasible. If one LP is feasible, its dual cannot be unbounded by weak
duality (Theorem 5.2), and if it has an optimal solution then its dual cannot be
infeasible by Theorem 5.13.
Table 5.1 does not state the equality of primal and dual objective functions
when both have optimal solutions, but it does state Corollary 5.14. We show that
this implies the Lemma of Farkas for inequalities (Theorem 5.5, which we have
used to prove the strong duality Theorem 5.3). Consider the LP
maximise 0
subject to 𝐴𝑥 ≤ 𝑏 , (5.29)
𝑥 ≥ 0.
𝑦≥0 𝐴 ≤ 𝑏 (5.30)
∨ ↩→ min
0⊤ → max
5.7. Equality LP Constraints and Unrestricted Variables 151
Its dual LP: minimise 𝑦⊤𝑏 subject to 𝑦⊤𝐴 ≥ 0⊤, 𝑦 ≥ 0, is feasible with 𝑦 = 0. The LP
(5.29) is feasible if and only if there is a solution 𝑥 to the inequalities 𝐴𝑥 ≤ 𝑏, 𝑥 ≥ 0.
By Corollary 5.14, there is no such solution if and only if the dual is unbounded,
that is, assumes an arbitrarily negative value of its objective function 𝑦⊤𝑏. This
is equivalent to the existence of some 𝑦 ≥ 0 with 𝑦⊤𝐴 ≥ 0⊤ and 𝑦⊤𝑏 < 0 which
can then be made arbitrarily negative by replacing 𝑦 with 𝑦𝛼 for any 𝛼 > 0. This
proves Theorem 5.5. This inequality version of the Lemma of Farkas can therefore
be remembered with the Tucker diagram (5.30) and Corollary 5.14. In that way, the
possibilities described in Table 5.1 capture the important theorems of LP duality.
We have stated the strong duality Theorem 5.3 for the standard inequality form
of an LP where both primal and dual LP have inequalities with separately stated
nonnegativity constraints for the primal and dual variables. In this section, we
consider different constraints for an LP, which offer greater flexibility in applying
the duality theorem in various contexts. Namely, we allow not only inequalities
but also equalities, as well as variables without a sign restriction.
These cases are closely related with respect to the duality property. As we will
see, a primal equality constraint corresponds to a dual variable that is unrestricted
in sign, and a primal variable that is unrestricted in sign gives rise to a dual
constraint that is an equality. The other case, which we have already seen, is
a primal inequality that corresponds to a dual variable that is nonnegative, or
a primal nonnegative variable where the corresponding dual constraint is an
inequality.
In the following, the matrix 𝐴 and vectors 𝑏 and 𝑐 will always have dimensions
𝐴 ∈ R𝑚×𝑛 , 𝑏 ∈ R𝑚 , and 𝑐 ∈ R𝑛 . These data 𝐴, 𝑏, 𝑐 will simultaneously define a
primal LP with variables 𝑥 in R𝑛 , and a dual LP with variables 𝑦 in R𝑚 . In the
primal LP, we write 𝐴𝑥 ≤ 𝑏 (which gives rise to nonnegative dual variables 𝑦
such that 𝑦 ≥ 0) or 𝐴𝑥 = 𝑏 (giving dual variables 𝑦 without sign constraints),
and maximise the objective function 𝑐⊤𝑥. In the dual LP, we state inequalities
𝑦⊤𝐴 ≥ 𝑐⊤ or equations 𝑦⊤𝐴 = 𝑐⊤ (depending on whether the corresponding primal
variables 𝑥 are nonnegative or unconstrained, respectively), and minimise the
objective function 𝑦⊤𝑏.
We first consider a primal LP with nonnegative variables and equality con-
straints, which is often called an LP in equality form:
maximise 𝑐⊤𝑥
subject to 𝐴𝑥 = 𝑏 , (5.31)
𝑥 ≥ 0.
152 Chapter 5. Linear Optimisation
maximise 𝑐⊤𝑥
(5.34)
subject to 𝐴𝑥 ≤ 𝑏.
To find the dual LP to the primal LP (5.34), we can again multiply each inequality in
𝐴𝑥 ≤ 𝑏 with a separate variable 𝑦 𝑖 , with the aim of finding an upper bound to the
primal objective function 𝑐⊤𝑥. The inequality is preserved when 𝑦 𝑖 is nonnegative,
but in order to obtain an upper bound on the primal objective function 𝑐⊤𝑥 we
have to require that 𝑦⊤𝐴 = 𝑐⊤ because the sign of any variable 𝑥 𝑗 is not known.
That, is the dual to (5.34) is
minimise 𝑦⊤𝑏
(5.35)
subject to 𝑦⊤𝐴 = 𝑐⊤, 𝑦 ≥ 0.
Observe that compared to (5.7) the LP (5.34) is missing the nonnegativity constraints
𝑥 ≥ 0, and that compared to (5.9) the dual LP (5.35) states 𝑛 equations 𝑦⊤𝐴 = 𝑐⊤
rather than inequalities.
Again, the choice of primal and dual LP is motivated by weak duality, which
states that for feasible solutions 𝑥 to (5.34) and 𝑦 to (5.35) the corresponding
objective functions are mutual bounds. Including proof, it says
Hence, we have the following types of pairs of a primal LP and its dual LP,
including the original more symmetric situation of LPs in inequality form:
• a primal LP (5.31) with nonnegative variables and equality constraints, and its
dual LP (5.32) with unrestricted variables and inequality constraints;
5.7. Equality LP Constraints and Unrestricted Variables 153
Note that converting the inequality form (5.7) to an LP in equality form (5.37)
defines a new dual LP with unrestricted variables 𝑦1 , . . . , 𝑦𝑚 , but the former
inequalities 𝑦 𝑖 ≥ 0 reappear now explicitly via the identity matrix and objective
function zeros introduced with the slack variables, as shown in (5.38). So the
resulting dual LP is exactly the same as in (5.9).
Even simpler, an LP in inequality form (5.7) can also be seen as the special
case of an LP with unrestricted variables 𝑥 𝑗 as in (5.34) since the condition 𝑥 ≥ 0
can be written in the form 𝐴𝑥 ≤ 𝑏 by explicitly listing the 𝑛 inequalities −𝑥 𝑗 ≤ 0.
That is, 𝐴𝑥 ≤ 𝑏 and 𝑥 ≥ 0 become with unrestricted 𝑥 ∈ R𝑛 the 𝑚 + 𝑛 inequalities
" # " #
𝐴 𝑏
𝑥 ≤
with an 𝑛 × 𝑛 identity matrix 𝐼. As is easily seen with a suitable
−𝐼 0
Tucker diagram, the corresponding dual LP according to (5.35) has an additional
𝑛-vector of slack variables 𝑟, say, with the dual constraints 𝑦⊤𝐴 − 𝑟⊤ = 𝑐⊤, 𝑦 ≥ 0,
154 Chapter 5. Linear Optimisation
To define the LP in general form, we first draw the Tucker diagram, shown in
Figure 5.16. The diagram assumes that columns and rows are arranged so that
those in 𝐽 and 𝐾 come first. The big boxes contain the respective parts of the
constraint matrix 𝐴, the vertical boxes on the right the parts of the right-hand side 𝑏,
and the horizontal box at the bottom the parts of the primal objective function 𝑐⊤.
𝑥 𝑗 ≥ 0 (𝑗 ∈ 𝐽) 𝑥 𝑗 ∈ R (𝑗 ∈ 𝐽)
𝑦𝑖 ≥ 0
≤
(𝑖 ∈ 𝐾)
𝐴 𝑏
𝑦𝑖 ∈ R
=
(𝑖 ∈ 𝐾)
∨
𝑐⊤
Figure 5.16 Tucker diagram for an LP in general form.
In order to state the duality theorem concisely, we define the feasible sets 𝑋
and 𝑌 for the primal and dual LP. The entries of the 𝑚 × 𝑛 matrix 𝐴 are 𝑎 𝑖𝑗 in row 𝑖
and column 𝑗. Let
5.8. General LP Duality * 155
𝑛
Õ
𝑛
𝑋 = {𝑥 ∈ R | 𝑎 𝑖𝑗 𝑥 𝑗 ≤ 𝑏 𝑖 , 𝑖 ∈ 𝐾,
𝑗=1
𝑛
Õ (5.40)
𝑎 𝑖𝑗 𝑥 𝑗 = 𝑏 𝑖 , 𝑖 ∈ 𝐾,
𝑗=1
𝑥𝑗 ≥ 0 , 𝑗 ∈ 𝐽 }.
Any 𝑥 belonging to 𝑋 is called primal feasible, and the primal LP is called feasible
if 𝑋 is not the empty set ∅. The primal LP is the problem
(This results when reading the Tucker diagram in Figure 5.16 horizontally.) The
corresponding dual LP has the feasible set
𝑚
Õ
𝑚
𝑌 ={𝑦 ∈R | 𝑦 𝑖 𝑎 𝑖𝑗 ≥ 𝑐 𝑗 , 𝑗 ∈ 𝐽,
𝑖=1
𝑚
Õ (5.42)
𝑦 𝑖 𝑎 𝑖𝑗 = 𝑐 𝑗 , 𝑗 ∈ 𝐽,
𝑖=1
𝑦𝑖 ≥ 0 , 𝑖 ∈ 𝐾}
and is the problem
minimise 𝑦⊤𝑏 subject to 𝑦 ∈ 𝑌 . (5.43)
(This results when reading the Tucker diagram in Figure 5.16 vertically.) By
reversing signs, one can verify that the dual of the dual LP is again the primal.
Table 5.2 shows the roles of the sets 𝐾, 𝐾, 𝐽, 𝐽.
For an LP in general form, the strong duality theorem states that (a) for any
primal and dual feasible solutions, the corresponding objective functions are
mutual bounds, (b) if the primal and the dual LP both have feasible solutions,
then they have optimal solutions with the same value of their objective functions,
(c) if the primal or dual LP is bounded, the other LP is feasible. This implies the
possibilities shown in Table 5.1.
Theorem 5.16 (General LP duality). For the primal of LP (5.41) and its dual LP (5.43),
(a) (Weak duality) 𝑐⊤𝑥 ≤ 𝑦⊤𝑏 for all 𝑥 ∈ 𝑋 and 𝑦 ∈ 𝑌.
(b) (Strong duality) If 𝑋 ≠ ∅ and 𝑌 ≠ ∅ then 𝑐⊤𝑥 = 𝑦⊤𝑏 for some 𝑥 ∈ 𝑋 and 𝑦 ∈ 𝑌, so
that both 𝑥 and 𝑦 are optimal.
(c) (Boundedness implies dual feasibility) If 𝑋 ≠ ∅ and 𝑐⊤𝑥 for 𝑥 ∈ 𝑋 is bounded
above, then 𝑌 ≠ ∅. If 𝑌 ≠ ∅ and 𝑦⊤𝑏 for 𝑦 ∈ 𝑌 is bounded below, then 𝑋 ≠ ∅.
primal LP dual LP
constraint variable
𝑛
Õ
row 𝑖 ∈ 𝐾 inequality 𝑎 𝑖𝑗 𝑥 𝑗 ≤ 𝑏 𝑖 nonnegative 𝑦𝑖 ≥ 0
𝑗=1
𝑛
Õ
row 𝑖 ∈ 𝐾 equation 𝑎 𝑖𝑗 𝑥 𝑗 = 𝑏 𝑖 unconstrained 𝑦𝑖 ∈ R
𝑗=1
objective function
𝑚
Õ
minimise 𝑦𝑖 𝑏 𝑖
𝑖=1
variable constraint
𝑚
Õ
column 𝑗 ∈ 𝐽 nonnegative 𝑥𝑗 ≥ 0 inequality 𝑦 𝑖 𝑎 𝑖𝑗 ≥ 𝑐 𝑗
𝑖=1
𝑚
Õ
column 𝑗 ∈ 𝐽 unconstrained 𝑥𝑗 ∈ R equation 𝑦 𝑖 𝑎 𝑖𝑗 = 𝑐 𝑗
𝑖=1
objective function
𝑛
Õ
maximise 𝑐𝑗 𝑥𝑗
𝑗=1
Theorems 5.2, 5.3, and 5.13. In order to keep the notation simple, we demonstrate
this first for the special case of an LP (5.31) in equality form and its dual (5.32),
that is, with 𝐽 = {1, . . . , 𝑛} and 𝐾 = {1, . . . , 𝑚}. This LP with constraints 𝐴𝑥 = 𝑏,
𝑥 ≥ 0 is equivalent to
maximise 𝑐⊤𝑥
subject to 𝐴𝑥 ≤ 𝑏,
(5.44)
− 𝐴𝑥 ≤ − 𝑏 ,
𝑥≥ 0.
The corresponding dual LP uses two 𝑚-vectors 𝑦ˆ and 𝑦 and says
or equivalently
5.9. Complementary Slackness 157
minimise ( 𝑦ˆ − 𝑦)⊤𝑏
subject to ( 𝑦ˆ − 𝑦)⊤𝐴 ≥ 𝑐⊤, (5.46)
𝑦,
ˆ 𝑦 ≥ 0.
Any solution 𝑦 to the dual LP (5.32) with unconstrained dual variables 𝑦 can be
written in the form (5.46) where 𝑦ˆ represents the “positive” part of 𝑦 and 𝑦 the
negated “negative” part of 𝑦 according to
is then equivalent to a dual equation for 𝑗 ∈ 𝐽, as stated in (5.42) and Table 5.2. The
claim then follows as before for the known statements for an LP in inequality form.
The optimality condition 𝑐⊤𝑥 = 𝑦⊤𝑏, already stated in the weak duality Theorem 5.2,
is equivalent to a combinatorial condition known as “complementary slackness”.
158 Chapter 5. Linear Optimisation
It states that in each column 𝑗 and row 𝑖 at least one of the associated inequalities
in the dual or primal LP is tight, that is, holds as an equality. In a general LP, this
is only relevant for the inequality constraints, that is, for 𝑗 ∈ 𝐽 and 𝑖 ∈ 𝐾 (see the
Tucker diagram in Figure 5.16).
𝑚 𝑛
!
Õ Õ
𝑦 𝑖 𝑎 𝑖𝑗 − 𝑐 𝑗 𝑥 𝑗 = 0 , 𝑦 𝑖 𝑏 𝑖 − 𝑎 𝑖𝑗 𝑥 𝑗 ® = 0 , (5.49)
© ª
𝑖=1 « 𝑗=1 ¬
that is,
𝑚
Õ 𝑛
Õ
𝑥𝑗 > 0 ⇒ 𝑦 𝑖 𝑎 𝑖𝑗 = 𝑐 𝑗 , 𝑦𝑖 > 0 ⇒ 𝑎 𝑖𝑗 𝑥 𝑗 = 𝑏 𝑖 . (5.50)
𝑖=1 𝑗=1
For an LP in general form (5.41) and its dual (5.43), a feasible pair 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 is also
optimal if and only if (5.48) holds, or equivalently (5.49) or (5.50).
Proof. Suppose 𝑥 and 𝑦 are feasible for (5.7) and (5.9), so 𝐴𝑥 ≤ 𝑏, 𝑥 ≥ 0, 𝑦⊤𝐴 ≥ 𝑐⊤,
𝑦 ≥ 0. They are both optimal if and only if their objective functions are equal,
𝑐⊤𝑥 = 𝑦⊤𝑏. This means that the two inequalities 𝑐⊤𝑥 ≤ 𝑦⊤𝐴 𝑥 ≤ 𝑦⊤𝑏 used to prove
weak duality hold as equalities 𝑐⊤𝑥 = 𝑦⊤𝐴 𝑥 and 𝑦⊤𝐴 𝑥 = 𝑦⊤𝑏, which are equivalent
to (5.48).
The left equation in (5.48) says
𝑛 𝑚
!
Õ Õ
0 = (𝑦⊤𝐴 − 𝑐⊤) 𝑥 = 𝑦 𝑖 𝑎 𝑖𝑗 − 𝑐 𝑗 𝑥 𝑗 . (5.51)
𝑗=1 𝑖=1
Then 𝑦⊤𝐴 ≥ 𝑐⊤ and 𝑥 ≥ 0 imply that the sum over 𝑗 on the right-hand side of (5.51)
is a sum of nonnegative terms, which is zero only if each of them is zero, as stated
on the left in (5.49). Similarly, the second equation 𝑦⊤(𝑏 − 𝐴𝑥) = 0 in (5.48) holds
only if the equations on the right of (5.49) hold for all 𝑗. Clearly, (5.50) is equivalent
to (5.49).
For an LP in general form, the feasibility conditions 𝑦 ∈ 𝑌 and 𝑥 ∈ 𝑋 with
(5.42) and (5.40) imply
𝑚
Õ 𝑛
Õ
𝑦 𝑖 𝑎 𝑖𝑗 = 𝑐 𝑗 for 𝑗 ∈ 𝐽, 𝑎 𝑖𝑗 𝑥 𝑗 = 𝑏 𝑖 for 𝑖 ∈ 𝐾, (5.52)
𝑖=1 𝑗=1
5.9. Complementary Slackness 159
so that (5.49) holds for 𝑗 ∈ 𝐽 and 𝑖 ∈ 𝐾. Hence, the respective terms in (5.49) are
zero in the scalar products (𝑦⊤𝐴 − 𝑐⊤) 𝑥 and 𝑦⊤(𝑏 − 𝐴𝑥). These scalar products
are nonnegative because 𝑚 𝑖=1 𝑦 𝑖 𝑎 𝑖𝑗 ≥ 𝑐 𝑗 and 𝑥 𝑗 ≥ 0 for 𝑗 ∈ 𝐽, and 𝑦 𝑖 ≥ 0 and
Í
Í𝑛
𝑏 𝑖 ≥ 𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 for 𝑖 ∈ 𝐾. So the weak duality proof 𝑐⊤𝑥 ≤ 𝑦⊤𝐴 𝑥 ≤ 𝑦⊤𝑏 applies as
well. As before, optimality 𝑐⊤𝑥 = 𝑦⊤𝑏 is equivalent to (5.48) and thus to (5.49) and
(5.50), which for 𝑗 ∈ 𝐽 and 𝑖 ∈ 𝐾 hold trivially by (5.52) irrespective of the sign of 𝑥 𝑗
or 𝑦 𝑖 .
Consider the standard LP in inequality form (5.7) and its dual LP (5.9). The
dual feasibility constraints imply nonnegativity of 𝑦⊤𝐴 − 𝑐⊤, which is the 𝑛-vector
of “slacks”, that is, of differences in the inequalities 𝑦⊤𝐴 ≥ 𝑐⊤; such a slack is
zero in some column if the inequality is tight. The condition (𝑦⊤𝐴 − 𝑐⊤) 𝑥 = 0
in (5.48) says that this nonnegative slack vector is orthogonal to the nonnegative
vector 𝑥, because the scalar product of these two vectors is zero. The conditions
(5.49) and (5.50) state that this orthogonality can hold only if the two vectors are
complementary in the sense that in each component at least one of them is zero.
Similarly, the nonnegative 𝑚-vector 𝑦 and the 𝑚-vector of primal slacks 𝑏 − 𝐴𝑥 are
orthogonal in the second equation 𝑦⊤(𝑏 − 𝐴𝑥) = 0 in (5.48). In a compact way, we
can write
𝑦⊤𝐴 ≥ 𝑐⊤ ⊥ 𝑥≥0
(5.53)
𝑦≥0 ⊥ 𝐴𝑥 ≤ 𝑏
3𝑦1 + 𝑦2 ≥ 8
4𝑦1 + 𝑦2 ≥ 10
(5.54)
2𝑦1 + 𝑦2 ≥ 5
One feasible primal solution is 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) = (0, 1, 1). Then the first inequality
in (5.8) is not tight so by (5.50) we need 𝑦1 = 0 in an optimal solution. Because
𝑥2 > 0 and 𝑥3 > 0 the second and third inequality in (5.54) have to be tight, which
implies 𝑦2 = 10 and 𝑦2 = 5 which is impossible. So this 𝑥 is not optimal.
Another feasible primal solution is (𝑥1 , 𝑥2 , 𝑥3 ) = (0, 1.75, 0), where 𝑦2 = 0
because the second primal inequality is not tight. Only the second inequality in
(5.54) has to be tight, that is, 4𝑦1 = 10 or 𝑦1 = 2.5. However, this violates the first
dual inequality in (5.54).
Finally, for the primal solution 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) = (1, 1, 0) both primal inequal-
ities are tight, which allows for 𝑦1 > 0 and 𝑦2 > 0. Then the first two dual
inequalities in (5.54) have to be tight, which determines 𝑦 as (𝑦1 , 𝑦2 ) = (2, 2), which
also fulfills the third dual inequality (which is allowed to have positive slack
because 𝑥3 = 0). So here 𝑥 and 𝑦 are optimal.
The complementary slackness condition is a good way to verify that a con-
jectured primal solution is optimal, because the resulting equations for the dual
variables typically determine the values of the dual variables which can then be
checked for dual feasibility (or for equality of primal and dual objective function).
As stated in Theorem 5.17, the complementary slackness conditions charac-
terise optimality of a primal-dual pair 𝑥, 𝑦 also for an LP in general form. However,
in such an LP they only impose constraints for the primal or dual inequalities,
that is, for the columns 𝑗 ∈ 𝐽 and 𝑖 ∈ 𝐾 in the Tucker diagram in Figure 5.16.
The other columns and rows already define dual or primal equations which by
definition have zero slack. This is also the case if such an equality is converted to a
pair of inequalities. For example, for 𝑖 ∈ 𝐾, the primal equation 𝑛𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 = 𝑏 𝑖
Í
with unrestricted dual variable 𝑦 𝑖 can be rewritten as a pair of two inequalities
Í𝑛 Í𝑛
𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 ≤ 𝑏 𝑖 and − 𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 ≤ −𝑏 𝑖 with associated nonnegative dual variables
𝑦ˆ 𝑖 and 𝑦 𝑖 so that 𝑦 𝑖 = 𝑦ˆ 𝑖 − 𝑦 𝑖 . As noted in the proof of Theorem 5.16, we can add
a constant 𝑧 𝑖 to the two variables 𝑦ˆ 𝑖 and 𝑦 𝑖 in any dual feasible solution, so that
they are both positive when 𝑧 𝑖 > 0. By complementary slackness, the two primal
inequalities then have to be tight, but they can anyhow only be fulfilled if they both
hold as an equation. This confirms that for a general LP, complementary slackness
is informative only for the inequality constraints.
In this section we connect linear programming duality with the first-order condi-
tions studied in the previous chapter. These concern a local optimum, but for an
LP that is the same as a global optimum:
Theorem 5.18. Any local optimum (maximum or minimum) of an LP (in general form) is
a global optimum.
5.10. LP Duality and the KKT Theorem * 161
Suppose that 𝑥 is not a global maximum, that is, 𝑐⊤𝑥 > 𝑐⊤𝑥 for some 𝑥 ∈ 𝑋.
Let 0 < 𝛿 ≤ 1 and let 𝑧 = 𝑥(1 − 𝛿) + 𝑥𝛿 = 𝑥 + (𝑥 − 𝑥)𝛿 where 𝑧 ∈ 𝑋 because is 𝑋 is
convex. Then
𝑐⊤𝑧 = 𝑐⊤𝑥 + (𝑐⊤𝑥 − 𝑐⊤𝑥)𝛿 > 𝑐⊤𝑥 . (5.56)
However, ∥𝑧 − 𝑥∥ = ∥𝑥 − 𝑧 ∥𝛿 < 𝜀 for sufficiently small positive 𝛿, and then (5.56)
contradicts (5.55). Hence, there is no 𝑥 ∈ 𝑋 with 𝑐⊤𝑥 > 𝑐⊤𝑥, which shows that 𝑥 is
indeed a global maximum.
We show that the KKT Theorem 4.11 applied to a linear program is essentially
the strong LP duality theorem, applied to an LP (5.34) in inequality form where
any inequalities such as 𝑥 ≥ 0 would have to be written as part of 𝐴𝑥 ≤ 𝑏, so the
variables 𝑥 ∈ R𝑛 are unrestricted. In order to match the notation in Theorem 4.11, let
the number of rows of 𝐴 be ℓ . That is, (5.34) states: maximise 𝑓 (𝑥) = 𝑐⊤𝑥 subject to
ℎ 𝑖 (𝑥) = 𝑛𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 − 𝑏 𝑖 ≤ 0 for 1 ≤ 𝑖 ≤ ℓ . The functions 𝑓 and ℎ 𝑖 are affine functions
Í
that have constant derivatives, with 𝐷 𝑓 (𝑥) = 𝑐⊤ and 𝐷 ℎ 𝑖 (𝑥) = (𝑎 𝑖1 , . . . , 𝑎 𝑖𝑛 ) for
1 ≤ 𝑖 ≤ ℓ . The open set 𝑈 in Theorem 4.11 is R𝑛 .
Suppose that this LP is feasible and that 𝑐⊤𝑥 has a local maximum at 𝑥 = 𝑥,
which by Theorem 5.18 is also a global maximum. By the duality Theorem 5.16
for an LP in general form, there exists an optimal dual vector 𝑦 ∈ Rℓ with 𝑦 ≥ 0
and 𝑦⊤𝐴 = 𝑐⊤ (see also (5.35)), which is equivalent to 𝐷 𝑓 (𝑥) = 𝑐⊤ = 𝑦⊤𝐴 =
Íℓ
𝑖=1 𝑦 𝑖 𝐷 ℎ 𝑖 (𝑥) which is the last equation in (4.51) with 𝜇 𝑖 = 𝑦 𝑖 for 1 ≤ 𝑖 ≤ ℓ .
Moreover, the optimality condition 𝑐⊤𝑥 = 𝑦⊤𝑏 is equivalent to the complementary
slackness conditions (5.49). In (5.49), the first set of equations hold automatically
because 𝑦⊤𝐴 = 𝑐⊤, and the second equations 𝑦 𝑖 (𝑏 𝑖 − ℓ𝑗=1 𝑎 𝑖𝑗 𝑥 𝑗 ) = 0 are equivalent
Í
to 𝑦 𝑖 (−ℎ 𝑖 (𝑥)) = 0 and therefore to 𝜇𝑖 ℎ 𝑖 (𝑥) = 0 as stated in (4.51). So Theorem 4.11
is a consequence of the strong duality theorem, in fact in a stronger form because
it does not require the constraint qualification that the gradients in (4.51) for the
tight constraints are linearly independent.
Conversely, the strong duality theorem for an LP with unrestricted variables
(5.34) can also be seen as a special case of the KKT Theorem 4.11, where one can
argue separately that the constraint qualification is not needed.
162 Chapter 5. Linear Optimisation
basic variables are nonnegative. In the dictionary, the values for the basic variables
in this basic feasible solution are just the constants that follow the equality signs,
with the corresponding value for 𝑧 beneath the horizontal line. In (5.58) these
values are 𝑠 1 = 7, 𝑠2 = 2, 𝑧 = 0.
Starting with an initial basic feasible solution such as (5.58), the simplex
algorithm proceeds in steps that rewrite the dictionary. In our example, we record
the changes of the dictionary, and keep in mind that 𝑧 should be maximised and
all variables should stay nonnegative. In each step, one nonbasic variable becomes
basic (it is said to enter the basis) and a basic variable becomes nonbasic (this variable
is said to leave the basis).
For a given dictionary, the entering variable is chosen so as to improve the value
of the objective function when that variable is increased from zero in the current basic
feasible solution. In (5.58), this will happen by increasing any of 𝑥1 , 𝑥2 , 𝑥3 because
they all have a positive coefficient in the linear equation for 𝑧. Suppose 𝑥 2 is
chosen as the entering variable (for example, because it has the largest coefficient).
Suppose the other nonbasic variables 𝑥 1 and 𝑥3 stay at zero and 𝑥2 increases. Then
𝑧 = 0 + 10𝑥2 (the desired increase), 𝑠 1 = 7 − 4𝑥 2 , and 𝑠2 = 2 − 𝑥2 . In order to
maintain feasibility, we need 𝑠1 = 7 − 4𝑥2 ≥ 0 and 𝑠 2 = 2 − 𝑥2 ≥ 0, where these
two constraints are equivalent to 74 = 1.75 ≥ 𝑥2 and 2 ≥ 𝑥2 . The first of these
is the stronger constraint: when 𝑥2 is increased from 0 to 1.75, then 𝑠1 = 0 and
𝑠2 = 0.25 > 0. For that reason, 𝑠 1 is chosen as the leaving variable, and we rewrite
the first equation in (5.58) so that 𝑥 2 is on the left and 𝑠1 is on the right, giving
4𝑥 2 = 7 − 3𝑥1 − 𝑠1 − 2𝑥3
𝑠2 = 2 − 𝑥1 − 𝑥2 − 𝑥3 (5.59)
𝑧 = 0 + 8𝑥1 + 10𝑥 2 + 5𝑥3
However, this is not a dictionary because 𝑥2 is still on the right-hand side of the
second and third equation, but should appear only on the left. To remedy this, we
first rewrite the first equation so that 𝑥2 has coefficient 1, and then substitute this
equation into the other two equations:
which gives the new dictionary with basic variables 𝑥2 and 𝑠2 and nonbasic
variables 𝑥 1 , 𝑠1 , 𝑥3 :
164 Chapter 5. Linear Optimisation
The basic feasible solution corresponding to (5.61) is 𝑥 2 = 1.75, 𝑠2 = 0.25 and has
objective function value 𝑧 = 17.5. The latter can still be improved by increasing
𝑥1 , which is now the unique choice for entering variable because neither 𝑠 1 nor
𝑥3 have a positive coefficient in this representation of 𝑧. Increasing 𝑥1 from zero
imposes the constraints 𝑥2 = 1.75 − 0.75𝑥1 ≥ 0 and 𝑠2 = 0.25 − 0.25𝑥1 ≥ 0, where
the second is stronger, since 𝑠 2 becomes zero when 𝑥1 = 1 while 𝑥 2 is still positive.
So 𝑥1 enters and 𝑠2 leaves the basis. Similar to the step from (5.58) to (5.59), we
bring 𝑥 1 to the left and 𝑠2 to the right side of the equation,
and substitute the resulting equation 𝑥1 = 1 − 4𝑠 2 + 𝑠1 − 2𝑥3 for 𝑥1 into the other
two equations:
𝑥2 = 1.75 − 0.25𝑠1 − 0.5𝑥3
− 0.75(1 − 4𝑠 2 + 𝑠1 − 2𝑥3 )
𝑥1 = 1 − 4𝑠 2 + 𝑠1 − 2𝑥3 (5.63)
𝑧= 17.5 − 2.5𝑠1 + 0𝑥3
+ 0.5(1 − 4𝑠2 + 𝑠1 − 2𝑥 3 )
which gives the next dictionary with 𝑥 2 and 𝑥1 as basic variables and 𝑠2 , 𝑠1 , 𝑥3 as
nonbasic variables:
𝑥2 = 1 + 3𝑠2 − 𝑠1 + 𝑥3
𝑥1 = 1 − 4𝑠2 + 𝑠1 − 2𝑥3 (5.64)
𝑧 = 18 − 2𝑠2 − 2𝑠 1 − 𝑥3
As always, this dictionary is equivalent to the original system of equations (5.57),
with basic feasible solution 𝑥1 = 1, 𝑥2 = 1, and corresponding objective function
value 𝑧 = 18. In the last line in (5.64), no nonbasic variable has a positive coefficient.
This means that no increase from zero of a nonbasic variable can improve the
objective function. Hence this basic feasible solution is optimal, and the algorithm
terminates.
Converting one dictionary to another by exchanging a nonbasic (entering)
variable with a basic (leaving) variable is commonly referred to as pivoting. The
column of the entering variable and the row of the leaving variable define a
nonzero coefficient of the entering variable known as a pivot element. Pivoting
5.11. The Simplex Algorithm: Example 165
where we now have to remember that in the expression for 𝑧 a potential entering
variable is identified by a negative coefficient. In (5.65) the basic variables are 𝑠1
and 𝑠2 which have a unit vector as a column of coefficients, which has entry 1 in the
row of the basic variable and entry 0 elsewhere.
With the 𝑥2 as the entering and 𝑠1 as the leaving variable in (5.65), pivoting
amounts to creating a unit vector in the column for 𝑥2 . This means to divide the
first (pivot) row by 4 so that 𝑥2 has coefficient 1 in that row. The new first row is
then subtracted from the second row, and 10 times the new first row is added to
the third row, so that the coefficient of 𝑥 2 in those rows becomes zero:
0.75𝑥1 + 𝑥2 + 0.5𝑥3 + 0.25𝑠1 = 1.75
0.25𝑥1 + 0.5𝑥3 − 0.25𝑠1 + 𝑠2 = 0.25 (5.66)
𝑧 − 0.5𝑥1 + 0𝑥3 + 2.5𝑠1 = 17.5
These row operations have the same effect as the substitutions in (5.60). The system
(5.66) is equivalent to the second dictionary (5.61). The basic variables 𝑥2 and 𝑠 2
are identified by their unit-vector columns. Note that 𝑧 is expressed only in terms
of the nonbasic variables.
The entering variable in (5.66) is 𝑥1 and the leaving variable is 𝑠2 , so that the
unit vector for that second row should now appear in the column for 𝑥1 rather
than 𝑠2 . The second row is divided by the pivot element 0.25 (i.e., multiplied by 4)
to give 𝑥1 coefficient 1, and the coefficients of 𝑥 1 in the other rows give the suitable
multiplier to subtract the new second row from the other rows, namely 0.75 for
the first and −0.5 for the third row. This gives
𝑥 2 − 𝑥3 + 𝑠 1 − 3𝑠2 = 1
𝑥1 + 2𝑥3 − 𝑠 1 + 4𝑠2 = 1 (5.67)
𝑧 + 𝑥3 + 2𝑠 1 + 2𝑠2 = 18
166 Chapter 5. Linear Optimisation
This section describes the simplex algorithm in general. We also define in generality
the relevant terms, many of which have already been introduced in the previous
section.
The simplex algorithm applies to an LP (5.31) in equality form: maximise 𝑐⊤𝑥
subject to 𝐴𝑥 = 𝑏, 𝑥 ≥ 0, for given 𝐴 ∈ R𝑚×𝑛 , 𝑏 ∈ R𝑚 , 𝑐 ∈ R𝑛 . We assume that
the 𝑚 rows of the matrix 𝐴 are linearly independent. This is automatically the
case if 𝐴 has been obtained from an LP in inequality form by adding an identity
matrix for the slack variables as in (5.38). In general, if the row vectors of 𝐴 are
linearly dependent, then some row of 𝐴 is a linear combination of the other rows.
The respective equation in 𝐴𝑥 = 𝑏 is then either also the linear combination of the
other equations, that is, it can be omitted, or it contradicts the linear combination
and 𝐴𝑥 = 𝑏 has no solution. Therefore, it is no restriction to assume that 𝐴 has full
row rank 𝑚.
Let 𝐴1 , . . . , 𝐴𝑛 be the 𝑛 columns of 𝐴. A basis (of 𝐴) is an 𝑚-element subset
𝐵 of the column indices {1, . . . , 𝑛} so that the vectors 𝐴 𝑗 for 𝑗 ∈ 𝐵 are linearly
independent (sometimes “basis” also refers to such a set of vectors, that is, to a
basis of the column space of 𝐴). For a basis 𝐵 of 𝐴, a feasible solution 𝑥 to (5.31)
(that is, 𝐴𝑥 = 𝑏 and 𝑥 ≥ 0) where 𝑥 𝑗 > 0 implies 𝑗 ∈ 𝐵 is called a basic feasible
solution. The components 𝑥 𝑗 of 𝑥 for 𝑗 ∈ 𝐵 are called basic variables.
For a basis 𝐵, let 𝑁 = {1, . . . , 𝑛} −𝐵, where 𝑗 ∈ 𝑁 means 𝑥 𝑗 is a nonbasic variable.
Let 𝐴𝐵 denote the 𝑚 × 𝑚 submatrix of 𝐴 that consists of the basic columns 𝐴 𝑗 for
𝑗 ∈ 𝐵, and let 𝐴 𝑁 be the submatrix that consists of the nonbasic columns. Similarly,
let 𝑥 𝐵 and 𝑥 𝑁 be the subvectors of 𝑥 with components 𝑥 𝑗 for 𝑗 ∈ 𝐵 and 𝑗 ∈ 𝑁,
respectively. We write the equations 𝐴𝑥 = 𝑏 in the form 𝐴𝑥 = 𝐴𝐵 𝑥 𝐵 + 𝐴 𝑁 𝑥 𝑁 = 𝑏,
assuming a suitable arrangement of the columns 𝐴 𝑗 into 𝐴𝐵 and 𝐴 𝑁 .
Because the column vectors of 𝐴𝐵 are linearly independent, 𝐴𝐵 is invertible,
and the basic solution 𝑥 to 𝐴𝑥 = 𝐴𝐵 𝑥 𝐵 + 𝐴 𝑁 𝑥 𝑁 = 𝑏 that corresponds to the basis 𝐵
is uniquely given by 𝑥 𝑁 = 0 and 𝑥 𝐵 = 𝐴−1 𝐵
𝑏. By definition, this is a basic feasible
solution if it is nonnegative, that is, if 𝑥 𝐵 ≥ 0.
A basic feasible solution is uniquely specified by a basis 𝐵. The converse does
not hold because 𝑥 𝐵 may have zero components 𝑥 𝑖 for some 𝑖 ∈ 𝐵. Such a basis 𝐵
and its corresponding basic feasible solution is called degenerate. In that case, 𝐵 can
5.12. The Simplex Algorithm: General Description * 167
equally well be replaced by a basis 𝐵 − {𝑖} ∪ {𝑗} (for suitable 𝑗 ∈ 𝑁) that has the
same basic feasible solution 𝑥. This lack of uniqueness requires certain precautions
when defining the simplex algorithm; for the moment, we assume that no feasible
basis is degenerate.
By Lemma 5.9, if the LP (5.31) has a feasible solution, then it also has a basic
feasible solution, because any solution to 𝐴𝑥 = 𝑏 can be iteratively modified until 𝑏
is only a positive linear combination of linearly independent columns of 𝐴. If these
are fewer than 𝑚 columns, they can be extended with suitable further columns 𝐴 𝑗
to form a basis of the column space of 𝐴, with corresponding coefficients 𝑥 𝑗 = 0;
the corresponding basic feasible solution is then degenerate.
The simplex algorithm works exclusively with basic feasible solutions, which
are iteratively changed to improve the objective function. Thereby, it suffices to
change only one basic variable at a time, which is called pivoting.
Assume that a basic feasible solution to the LP (5.31) has been found. In
general, this requires an initialisation phase of the simplex algorithm, which fails
if the LP is infeasible, a case that is discovered at that point. We will describe this
initializing “first phase” later.
Consider a basic feasible solution with basis 𝐵, and let 𝑁 denote the index set
of the nonbasic columns as above. The following equations are equivalent for any
𝑥 ∈ R𝑛 :
𝐴𝑥 = 𝑏
𝐴𝐵 𝑥 𝐵 + 𝐴𝑁 𝑥 𝑁 = 𝑏
𝐴−1 −1 −1
𝐵 𝐴𝐵 𝑥 𝐵 + 𝐴𝐵 𝐴 𝑁 𝑥 𝑁 = 𝐴𝐵 𝑏
𝑥 𝐵 = 𝐴−1 −1
𝐵 𝑏 − 𝐴𝐵 𝐴𝑁 𝑥 𝑁
Õ (5.68)
𝑥𝐵 = 𝐴−1
𝐵 𝑏 − 𝐴−1
𝐵 𝐴𝑗 𝑥𝑗
𝑗∈𝑁
Õ
𝑥𝐵 = 𝑏 − 𝐴𝑗 𝑥 𝑗
𝑗∈𝑁
𝑐⊤𝑥 = 𝑐⊤ ⊤
𝐵 𝑥𝐵 + 𝑐𝑁 𝑥𝑁
= 𝑐⊤ −1 −1 ⊤
𝐵 (𝐴 𝐵 𝑏 − 𝐴 𝐵 𝐴 𝑁 𝑥 𝑁 ) + 𝑐 𝑁 𝑥 𝑁
= 𝑐⊤ −1 ⊤ ⊤ −1
𝐵 𝐴 𝐵 𝑏 + (𝑐 𝑁 − 𝑐 𝐵 𝐴 𝐵 𝐴 𝑁 ) 𝑥 𝑁
(5.69)
Õ
= 𝑐⊤ −1
𝐵 𝐴𝐵 𝑏 + (𝑐 𝑗 − 𝑐⊤ −1
𝐵 𝐴𝐵 𝐴 𝑗 ) 𝑥 𝑗
𝑗∈𝑁
which expresses the objective function 𝑐⊤𝑥 in terms of the nonbasic variables, as
in the equation below the horizontal line in the examples (5.58), (5.61), (5.64). In
(5.69), 𝑐⊤ 𝐴−1 𝑏 is the value of the objective function for the basic feasible solution
𝐵 𝐵
where 𝑥 𝑁 = 0. This is an optimal solution if
𝑐 𝑗 − 𝑐⊤ −1
𝐵 𝐴𝐵 𝐴 𝑗 ≤ 0 for all 𝑗 ∈ 𝑁 , (5.70)
𝑐 𝑗 − 𝑐⊤ −1
𝐵 𝐴𝐵 𝐴 𝑗 > 0 for some 𝑗 ∈ 𝑁 . (5.72)
In that case, the value of the objective function will be increased if 𝑥 𝑗 can assume a
positive value. The simplex algorithm therefore looks for such a 𝑗 in (5.72) and
makes 𝑥 𝑗 a new basic variable, called the entering variable. The index 𝑗 is said to
enter the basis. This has to be done while preserving feasibility, and so that there
are again 𝑚 basic variables. Thereby, some element 𝑖 of 𝐵 leaves the basis, where 𝑥 𝑖
is called the leaving variable.
To demonstrate this change of basis, consider the last equation in (5.68) that
expresses the variables 𝑥 𝐵 in terms of the nonbasic variables 𝑥 𝑁 . Assume that all
components of 𝑥 𝑁 are kept zero except 𝑥 𝑗 . Then (5.68) has the form
𝑥𝐵 = 𝑏 − 𝐴 𝑗 𝑥 𝑗 , (5.73)
𝑐⊤𝑥 = 𝑐⊤ ⊤
𝐵 𝑏 + (𝑐 𝑗 − 𝑐 𝐵 𝐴 𝑗 ) 𝑥 𝑗 with 𝑐 𝑗 − 𝑐⊤
𝐵 𝐴𝑗 > 0 . (5.74)
components, then 𝑥 𝑗 can be made arbitrarily large, where because of (5.74) 𝑐⊤𝑥
increases arbitrarily and the LP is unbounded.
Hence, suppose that some components of 𝐴 𝑗 are positive. It is useful to
consider the 𝑚 rows of the equation (5.73)) as numbered with the elements of 𝐵
because the left-hand side of that equation is 𝑥 𝐵 (in practice, one would record for
each row 1, . . . , 𝑚 of the dictionary represented by the last equation in (5.68) the
respective element of 𝐵, as for example in (5.64)). That is, let the components of 𝐴 𝑗
be 𝑎 𝑖𝑗 for 𝑖 ∈ 𝐵. At least one of them is positive, and any of these positive elements
imposes an upper bound on the choice of 𝑥 𝑗 in (5.73) so that 𝑥 𝐵 stays nonnegative,
by the condition 𝑏 𝑖 − 𝑎 𝑖𝑗 𝑥 𝑗 ≥ 0 or equivalently 𝑏 𝑖 /𝑎 𝑖𝑗 ≥ 𝑥 𝑗 (because 𝑎 𝑖𝑗 > 0). This
defines the maximum choice of 𝑥 𝑗 by the following so-called minimum ratio test
(which we have encountered in similar form before in (5.24)):
For at least one 𝑖 ∈ 𝐵, the minimum ratio is achieved as stated in (5.75). The
corresponding variable 𝑥 𝑖 is made the leaving variable and becomes nonbasic.
This defines the pivoting step: The entering variable 𝑥 𝑗 is made basic and the
leaving variable 𝑥 𝑖 is made nonbasic, and the basis 𝐵 is replaced by 𝐵 − {𝑖} ∪ {𝑗}.
We show that the column vectors 𝐴 𝑘 for 𝑘 ∈ 𝐵 − {𝑖} ∪ {𝑗} are linearly independent.
Consider a linear combination of these vectors that represents the zero vector,
𝑘 𝐴 𝑘 𝑡 𝑘 = 0, which implies 𝑘 𝐴 𝐵 𝐴 𝑘 𝑡 𝑘 = 0. The vectors 𝐴 𝐵 𝐴 𝑘 for 𝑘 ∈ 𝐵 − {𝑖}
Í Í −1 −1
are unit vectors with zeros in all rows except row 𝑘, so their linear combination has
a zero in row 𝑖. On the other hand, 𝐴−1 𝐵
𝐴 𝑘 for 𝑘 = 𝑗 is the vector 𝐴 𝑗 which in row 𝑖
has entry 𝑎 𝑖𝑗 > 0. This implies 𝑡 𝑗 = 0. For all 𝑘 ∈ 𝐵 − {𝑖}, the vectors 𝐴 𝑘 are linearly
independent, so that 𝑡 𝑘 = 0 for all 𝑘. Thus, 𝐵 − {𝑖} ∪ {𝑗} is indeed a new basis.
We have described an iteration of the simplex algorithm. In summary, it
consists of the following steps.
1. Given a basic feasible solution as in (5.68) with basis 𝐵, choose some entering
variable 𝑥 𝑗 according to (5.72). If no such variable exists, stop: the current
solution is optimal.
2. With 𝑏 = 𝐴−1 𝐵
𝑏 and 𝐴 𝑗 = 𝐴−1𝐵
𝐴 𝑗 , determine the maximum value of 𝑥 𝑗 so that
𝑏 − 𝐴 𝑗 𝑥 𝑗 ≥ 0. If there is no such maximum because 𝐴 𝑗 ≤ 0, then stop: the LP
is unbounded. Otherwise, set 𝑥 𝑗 to the minimum ratio in (5.75).
3. Replace the current basic feasible solution 𝑥 𝐵 = 𝑏 by 𝑥 𝐵 = 𝑏 − 𝐴 𝑗 𝑥 𝑗 . At least
one component 𝑥 𝑖 of this vector is zero, which is made the leaving variable and
is replaced by the entering variable 𝑥 𝑗 . Replace the basis 𝐵 by 𝐵 − {𝑖} ∪ {𝑗}.
Go back to Step 1.
A given basis 𝐵 determines a unique basic feasible solution 𝑥 𝐵 . By increasing
the value of the entering variable 𝑥 𝑗 , the feasible solution changes according to
170 Chapter 5. Linear Optimisation
(5.73). During this continuous change, this feasible solution is not a basic solution:
it has 𝑚 + 1 positive variables, namely 𝑥ℓ for ℓ ∈ 𝐵 and 𝑥 𝑗 . Unless the LP is
unbounded, 𝐴 𝑗 has positive components 𝑎ℓ 𝑗 , so the respective variables 𝑥ℓ decrease
while 𝑥 𝑗 increases. The smallest value of 𝑥 𝑗 where at least one component 𝑥 𝑖 of 𝑥 𝐵
becomes zero is given by the minimum ratio in (5.75). At this point, again only 𝑚
(or fewer) variables are nonzero, and the leaving variable 𝑥 𝑖 is replaced by 𝑥 𝑗 so
that indeed a new basic feasible solution and corresponding basis 𝐵 − {𝑖} ∪ {𝑗} is
obtained.
However, the simplex algorithm does not require a continuous change of
the values of 𝑚 + 1 variables. Instead, the value of the entering value 𝑥 𝑗 can
directly “jump” to the minimum ratio in (5.75). What is important is the next
basis. The change of the basis requires an update of the inverse 𝐴−1 𝐵
of the basis
matrix in order to obtain the new dictionary in (5.68). (As demonstrated in the
previous section, this update can be implemented by suitable row operations on
the “tableau” representation of the dictionary, which also determines the new
basic feasible solution.) In this view, the simplex algorithm is a combinatorial
method that computes a sequence of bases, which are certain finite subsets of the
set {1, . . . , 𝑛} that represents the columns of the original system 𝐴𝑥 = 𝑏.
We have made an important assumption, namely that no feasible basis is
degenerate, that is, all basic variables in a basic feasible solution have positive
values. This implies 𝑏 𝑖 > 0 in (5.75), so that the entering variable 𝑥 𝑗 takes on a
positive value and the objective function for the basic feasible solution increases
with each iteration by (5.74). Hence, no basis is revisited, and the simplex algorithm
terminates because there are only finitely many bases. Furthermore, Step 3 of the
above summary shows that in the absence of degeneracy the leaving variable 𝑥 𝑖 ,
and thus the minimum in the minimum-ratio test (5.75), is unique, because if two
variables could leave the basis because they become zero at the same time, then
only one of them leaves and the other remains basic but has value zero in the new
basic feasible solution.
If there are degenerate basic feasible solutions, then the minimum in (5.75)
may be zero because 𝑏 𝑖 = 0 for some 𝑖 where 𝑎 𝑖𝑗 > 0. Then the entering variable
𝑥 𝑗 , which was zero as a nonbasic variable, enters the basis but stays at zero in
the new basic feasible solution. In that case, only the basis has changed but not
the feasible solution and also not the value of the objective function. In fact, it is
possible that this results in a cycle of the simplex algorithm (when the same basis
is revisted) and thus a failure to terminate. This behaviour is rare, and degeneracy
itself an “accident” that only occurs when there are special relationships between
the entries of the payoff matrix. Nevertheless, degeneracy can be dealt with in a
systematic manner, which we do not treat in this guide. For a detailed treatment
see chapter 3 of Chvátal (1983).
5.13. Reminder of Learning Outcomes 171
We also need to find an initial feasible solution to start the simplex algorithm.
For that purpose, we use a “first phase” with a different objective function that
establishes whether the LP (5.31) is feasible, similar to the approach in (5.27). First,
choose an arbitrary basis 𝐵 and let 𝑏 = 𝐴−1 𝐵
𝑏. If 𝑏 ≥ 0, then 𝑥 𝐵 = 𝑏 is already a
basic feasible solution and nothing needs to be done. Otherwise, 𝑏 has at least one
negative component. Define the 𝑚-vector ℎ = 𝐴𝐵 1 where 1 is the all-one vector.
That is, ℎ is just the sum of the columns of 𝐴𝐵 . We add −ℎ as an extra column to
the system 𝐴𝑥 = 𝑏 with a new variable 𝑡 and consider the following LP:
maximise −𝑡
subject to 𝐴𝑥 − ℎ𝑡 = 𝑏 (5.76)
𝑥, 𝑡≥0
We find a basic feasible solution to this LP with a single pivoting step from the
(infeasible) basis 𝐵. Namely, the following are equivalent, similar to (5.68):
𝐴𝑥 − ℎ𝑡 = 𝑏
𝐴𝐵 𝑥 𝐵 + 𝐴 𝑁 𝑥 𝑁 − ℎ𝑡 = 𝑏
(5.77)
𝑥 𝐵 = 𝐴−1 −1 −1
𝐵 𝑏 − 𝐴𝐵 𝐴𝑁 𝑥 𝑁 + 𝐴𝐵 ℎ 𝑡
𝑥𝐵 = 𝑏 − 𝐴−1
𝐵 𝐴𝑁 𝑥 𝑁 + 1 𝑡
where we now let 𝑡 enter the basis and increase 𝑡 such that 𝑏 + 1 𝑡 ≥ 0. For the
smallest such value of 𝑡, at least one component 𝑥 𝑖 of 𝑥 𝐵 is zero and becomes
the leaving variable. After the pivot with 𝑥 𝑖 leaving and 𝑡 entering the basis one
obtains a basic feasible solution to (5.76).
The LP (5.76) is therefore feasible, and its objective function bounded from
above by zero. The original system 𝐴𝑥 = 𝑏, 𝑥 ≥ 0 is feasible if and only if the
optimum in (5.76) is zero. Suppose this is the case, which will be found out
by solving the LP (5.76) with the simplex algorithm. Then this “first phase”
terminates with a basic feasible solution to (5.76) where 𝑡 = 0 which is then also a
feasible solution to 𝐴𝑥 = 𝑏, 𝑥 ≥ 0. The simplex algorithm can then proceed with
maximising the original objective function 𝑐⊤𝑥 as described earlier.
• state the dual LP of an LP in inequality form, and also later (see Section 5.7)
for an LP in equality form or with unconstrained variables
• state the Lemma of Farkas, and apply it to examples (as in Exercise 5.2)
• understand the differences between feasible, infeasible, and unbounded LPs
and how this relates to the dual LP
• state the complementarity slackness condition and apply it to finding optimal
solutions in small examples
• describe the role of dictionaries for the simplex algorithm
• apply the simplex algorithm to small examples.
Exercise 5.1.
(a) Draw a picture of the set of points (𝑥1 , 𝑥2 ) in R2 that fulfill the following
inequalities:
𝑥1 ≥ 0
𝑥2 ≥ 0
−𝑥 1 − 𝑥 2 ≤ − 2
−𝑥 1 + 𝑥 2 ≤ 0
𝑥2 ≤ 3.
Draw a picture of the vectors 𝐴 𝑗 for the first few values of 𝑗. What is the set 𝐶?
Does 𝑏 = (1, 0) belong to 𝐶? Is there a vector 𝑦 ∈ R2 such that 𝑦⊤𝐴 𝑗 ≥ 0 for all 𝑗
and 𝑦⊤𝑏 < 0? Discuss the Lemma of Farkas for this example.
𝑥 1 + 𝑥2 + 2𝑥 3 ≤ 4
2𝑥 1 + 𝑥2 + 4𝑥 3 ≤ 7
2𝑥 1 + 4𝑥 3 ≤ 5
maximise 3𝑥 1 + 2𝑥2 + 4𝑥 3 .
(a) Write down the Tucker diagram for this LP. Explain why this LP is feasible
and bounded.
(b) Write down this LP in equality form with slack variables 𝑠1 , 𝑠2 , 𝑠3 .
(c) Apply the simplex algorithm to the LP in (b). Always choose the entering
variable with the largest coefficient in the current objective function.
(d) Verify the optimal solution found in (c) with an optimal dual solution, and
explain why both primal and dual optimal solution are unique, with the help
of the complementary slackness conditions.
(e) Find a set of three columns of the system in (b) that does not form a basis.