Understanding Artificial Intelligence - Nicolas Sabouret
Understanding Artificial Intelligence - Nicolas Sabouret
INTELLIGENCE
UNDERSTANDING
ARTIFICIAL
INTELLIGENCE
Nicolas Sabouret
Illustrations by Lizete De Assis
Comprendre l’intelligence artificielle – published by Ellipses – Copyright
2019, Édition Marketing S.A.
Reasonable efforts have been made to publish reliable data and information,
but the author and publisher cannot assume responsibility for the validity of
all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish
in this form has not been obtained. If any copyright material has not been
acknowledged please write and let us know so we may rectify in any future
reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be
reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including
photocopying, microfilming, and recording, or in any information storage
or retrieval system, without written permission from the publishers.
Introduction
To understand what a computer is and isn’t capable of, one must first
understand what computer science is. Let’s start there.
Computer science is the science of processing information.1 It’s about
building, creating, and inventing machines that automatically process all
kinds of information, from numbers to text, images, or video.
This started with the calculating machine. Here, the information consists
of numbers and arithmetic operations. For example:
346 + 78 =?
Then, as it was with prehistoric tools, there were advancements over time,
and the information processed became more and more complex. First it was
numbers, then words, then images, then sound. Today, we know how to
make machines that listen to what we say to them (this is “the information”)
and turn it into a concrete action. For example, when you ask your iPhone:
“Siri, tell me what time my doctor’s appointment is,” the computer is the
machine that processes this information.
COMPUTERS AND ALGORITHMS
1 1
3 4 6
+ 7 8
= 4 2 4
Symbol 1 → Write 0,
Move tape one cell to the left,
Resume instruction 1267.
The Turing machine analyzes the symbol in the current cell and carries
out the instruction.
In a way, this principle resembles choose-your-own-adventure books:
Make a note that you picked up a sword and go to page 37. The comparison
ends here. In contrast to the reader of a choose-your-own-adventure book,
the machine does not choose to open the chest or go into the dragon’s lair: it
only does what the book’s author has written on the page, and it does not
make any decision on its own.
It follows exactly what is written in the algorithm.
Alan Turing showed that his “machines” could reproduce any
algorithm, no matter how complicated. And, indeed, a computer works
exactly like a Turing machine: it has a memory (equivalent to the Turing
machine’s “tape”), it reads symbols contained in memory cells, and it
carries out specific instructions with the help of electronic wires. Thus, a
computer, in theory, is capable of performing any algorithm.
PROGRAMS THAT MAKE PROGRAMS
One must accurately describe, step by step, what the machine must do,
using only the operations allowed by the little electronic wires. Writing
algorithms in this manner is very limiting.
That’s why computer engineers have invented languages and programs
to interpret these languages. For example, we can ask the machine to
transform the + symbol in the series of operations described above.
This makes programming much easier, as one can reuse already-written
programs to write other, more complex ones – just like with prehistoric
tools! Once you have the wheel, you can make wheelbarrows, and with
enough time and energy you can even make a machine to make wheels.
AND WHERE DOES ARTIFICIAL
INTELLIGENCE FIT IN ALL THIS?
We are a long way from seeing a computer write a program on its own to
resolve a problem it has not been designed for. It is too difficult. There are
too many differences in the data sets, and too many possible rules to write
for such a program to run correctly. An AI that could resolve any problem
and make you a cup of coffee as a bonus simply does not exist because each
problem requires a specific program and the data to match it.
To write AlphaGo, the champion Go program, the engineers at Google
had to write an initial program to observe portions of Go games. Using
these data, they wrote a second program capable of providing the best move
in each situation. This required years of work! And they had to give the
program good information so that it could learn from the data.
The result obtained is specific to Go: you can’t just reuse it on another
game. You could certainly adapt the analytical program and have it use
different data to build another program capable of providing the best move
in chess or checkers. It has been done by Google. But this requires the
programmers to analyze the generation program, to modify it, to describe
the rules of chess, and to explain how to analyze a chess board, which is, all
in all, much different from a Go board. Your program cannot modify itself
to produce something intelligent.
Would we expect a race car to be capable of beating eggs and washing
laundry?
SO, WHAT IS AI?
_____________
1 Indeed, in some languages, such as French and German, computer science is called “informatics,”
which has the same root as “information.”
The Turing Test 2
Understanding That It Is Difficult to
Measure the Intelligence of
Computers
The first idea that comes to mind is to define intelligence as the opposite of
ignorance. Unfortunately, it’s not that simple. If I ask you what year the city of
Moscow was founded, you most likely won’t know the answer. Yet no one
could accuse you of not being intelligent just for that, even though the answer
can be found in any encyclopedia. And you wouldn’t say the encyclopedia is
intelligent simply because it can tell you the date Moscow was founded.
Intelligence doesn’t consist of knowledge alone. You must also be able to use
what you know.
A second idea would be to associate intelligence with the ability to answer
difficult questions. For example, let’s consider Georg Cantor, the
mathematician who founded set theory in the 19th century. To deal with such
in-depth mathematical problems, he was clearly very intelligent. However,
one’s ability in math is not enough to characterize intelligence. If I ask you
what 26,534 × 347 is, it might take you some time to get the answer. However,
a calculator will give you the result in an instant. You wouldn’t say that the
calculator is intelligent, not any more intelligent than the encyclopedia.
These two examples demonstrate that while computers are particularly
well equipped to handle anything requiring memory and calculation (they were
designed for this purpose), this doesn’t make them intelligent, because human
intelligence consists of many other aspects. For example, we are able to reason
by relying on our past experiences. This is what a doctor does when making a
diagnosis, but this is also what each of us does when we drive a car, when we
plan a vacation, or when we work. We make decisions in situations that would
be impossible to describe accurately enough for a computer. When a baker
tells us he’s out of flour, we understand that he can’t make any more bread.
There’s no need for him to explain.
We also know how to learn new skills at school or at work. We are able to
use previous examples to form new concepts, create new ideas, and imagine
new tools, using our ability to reason. Above all, we are able to communicate
by using words, constructing sentences, and putting symbols together to
exchange complex, often abstract, notions. When you tell your friends about
your favorite novel, they understand the story without needing to read the
exact words in the book. Thanks to our intelligence, we see the world in a new
way every instant of our lives, and we are capable of conveying it to those
around us.
All of this goes to show that if we want to compare a machine’s abilities to
human intelligence, we’re going to have to try something more than just
memory and math.
A TEST, BUT WHICH ONE?
In 1950, Alan Turing proposed a different kind of test to study the ability of
machines. Instead of measuring intelligence, his idea was to simply
differentiate between man and machine. We call this the Turing test, and today
it is still considered as a reference when it comes to AI programs.
The principle is quite simple. It is inspired by a game that was played at
social events, though it has long since gone out of style. Here you have it
anyway, so you can spice up your birthdays and surprise parties. In this game,
a man and a woman each go into a different room. The other guests, who
remain in the living room, can “talk” with the man or the woman by sending
notes to them through the servants. The guests cannot speak directly with the
isolated persons, and they do not know who is in which room. They have to
guess who is the man and who is the woman, and they know that both of them
are trying to convince the guests that they are the woman.
In the Turing test, you have a human in a room and a computer with an
artificial intelligence program in the other. You can communicate using a
keyboard (to type your messages) and a screen (to read their responses). Thus,
you have a keyboard and a screen to speak with the human, and another
keyboard and screen to speak with the program, but you do not know which
one is connected to the AI program imitating a human and which one is
connected to the real human typing the responses. To complicate things a little
more, all the responses are given within the same time interval. The speed of
the answers isn’t taken into consideration: you can only consider the content of
the answers to distinguish between the human and the AI.
IT CHATS…
The Turing test is a challenge that still keeps many computer scientists busy
today. It has given rise to the famous “chatbots,” the artificial intelligence
programs capable of speaking and answering just about any question one
might ask them, from “What do you call cheese that is not yours?” to “Did you
know you have beautiful eyes?” An international competition has even been
created to recognize the best chatbots each year: the Loebner Prize. In 1990,
Hugh Loebner promised $100,000 and a solid gold medal to the first
programmer to develop a chatbot capable of passing the Turing test. As no one
succeeded, an annual competition was started in 2006: chatbots are presented
to judges who attempt to make them respond incorrectly as fast as possible. It
is very amusing. If you want to give it a try, search for any chatbot program.
There are plenty of them on the internet and in smartphone applications. You’ll
very quickly get the chatbot to say something ridiculous. But you might also
be pleasantly surprised by the quality of some responses. That being said,
know that the official Loebner Prize judges, who are experts in the subject, are
capable of identifying a chatbot in fewer than five questions.
A CHATBOT NAMED ELIZA
We must accept a sad truth: machines are not intelligent. At least, not in the
same sense as when we say a person is intelligent. In a way, this is comforting
for our ego: the thing that allows a machine to accomplish difficult tasks is the
intelligence humans put into their algorithms. Well, then, what do AI
algorithms look like?
One might think they are really special algorithms with forms or structures
different from other programs, thus creating an illusion that machines are
intelligent. Nothing of the sort. Like all algorithms, they look like a cooking
recipe. And like all programs, they are applied, step by step, by a machine that
basically operates like the reader of a choose-your-own-adventure book.
However, it takes years of research, by scientists who themselves have
completed years of study, to develop each AI method. Why is it so difficult? It
is important to keep in mind that there is no general technique for making AI,
but rather many different methods and, therefore, many different algorithms.
They don’t look alike at all, but they often have something in common. They
are all designed to get around the two primary limitations computers have:
memory and processing capacity.
LIMITATIONS OF COMPUTERS
Computers are machines equipped with incredible memory and are able to
make a calculation in less time than it takes to say it. However, like all
machines, a computer can reach its limits.
Take the photocopier, for example. It’s a machine that allows you to copy
documents very quickly. That’s what it’s made for. To measure how well it
performs, ask your grandparents what they had to do with carbon paper, or
think about the monk scribes of the Middle Ages. Yet each photocopier can
only produce a limited number of pages per minute, which is determined by its
physical constraints.
For computers, it’s the same. They compute very quickly; that’s why they
were invented. To give you an idea, a personal computer developed in the
2010s can do billions of additions per second. That’s a lot (by contrast, I still
recall you never finished the multiplication problem in chapter 2). And this
number hasn’t stopped increasing since the 1950s! But there comes a time
when it isn’t enough.
If you need to make a calculation that requires 10 billion operations, you’ll
have to wait a few seconds for the result. But if your program needs a trillion
operations to solve the problem you’ve given to your computer, you’ll have to
wait 15 minutes. And if it needs 100 trillion operations, you’ll have to wait an
entire day!
A REAL HEADACHE
You’re going to say to me: 100 trillion operations, that’s impossible, it’s too
much! Don’t be fooled: when we write algorithms, the number of operations
becomes impressive really fast. Let’s use a silly example: imagine we wanted
to write a program that makes schedules for a school. It’s a problem vice-
principals spend a portion of their summer on every year. They would surely
be very happy if a machine did it all for them. First, we’ll start with a program
that calculates all the possible schedule combinations for the students. Next,
we’ll see how to assign the teachers.
OK, we might as well admit it right off the bat, this isn’t going to work.
But we’ll try anyway.
We have a school with 10 rooms for 15 classes (it’s a small school with
only five classes per grade, let’s imagine). For each hour of class, there are
around 3,000 possibilities for choosing the ten classes that can meet then. And
for each eight-hour day of classes, that makes 6 billion billion billion
possibilities (a 6 with 27 zeros). Yikes! If we have to consider them one by
one, we’ll never finish, even with a computer that can do a billion operations
per second.
In computer science, it didn’t take long to come up against this limitation
on computing power. Actually, this question is at the core of computer
program development. The difficulty for a computer scientist isn’t just about
finding an automated process to resolve a given problem. The process must
also produce a result in a reasonable amount of time.
Alan Turing, whom we have already spoken of, was faced with this
difficulty when he worked on deciphering German secret codes during World
War II. At the time, it was already possible to build a machine that could try,
one after another, all the possible codes the Germans might use to encrypt their
messages using the Enigma machine. Combination after combination, the
machine would compare the encoded word with a word from the message until
it found the right one. In theory, this should work. But in practice, these
incredible machines weren’t fast enough to decipher the messages. They were
capable of doing a job that required thousands and thousands of men, but it
wasn’t enough. All the Germans had to do was change the Enigma
combination every 24 hours, and the code breakers had to start all over again.
To “crack” the Enigma machine’s codes, Alan Turing’s team had to find a
different way that didn’t require so many operations.
COUNTING OPERATIONS
For each guest, this algorithm performs two operations (take an egg and
break it open). Additionally, for every six guests, it performs an additional
operation (open a carton). Finally, the eggs have to be scrambled, poured into
the frying pan, and cooked, which takes 122 operations. If N is the number of
guests, then the algorithm performs a total of 122 + (N / 6) + (2 × N)
operations. Voilà, this is its complexity.
As you can see, the complexity of an algorithm isn’t just a number: it’s a
formula that depends on the data of the problem in question. In our example,
the complexity depends on the number of guests: the more guests we have, the
more operations we need. It’s the same for a computer. For a given algorithm,
the more data we give it, the more operations it will have to perform to resolve
the problem.
Computer scientists calculate an algorithm’s complexity based on the
amount of data provided to the algorithm. Thus, the algorithm’s complexity
depends on the “size” of the problem – in other words, the number of zeros
and ones we have to write in the computer’s memory to describe the problem
in question.
THIS IS GETTING COMPLEX
But that isn’t all. Computer scientists deal with another complexity: the
complexity of the problem.
This is defined as the minimum number of operations needed to resolve the
problem with a computer. In other words, this corresponds to the complexity
of the fastest algorithm capable of resolving the problem.
The algorithm isn’t necessarily known: this is a theoretical complexity. If
we knew how to write a super algorithm, the best of all possible algorithms to
resolve this problem, this is what its complexity would be like. This doesn’t
mean, however, that we know how to write such an algorithm. Quite often,
computer scientists have found other algorithms, with even higher
complexities, and they rack their brains trying to find better ones.
In summary, when we give a problem to a group of computer scientists,
they will first use mathematical proofs to determine the minimum number of
operations needed to resolve the problem using a computer (the problem’s
theoretical complexity). Next, they’re going to develop algorithms that will
attempt to come near this complexity limit, if possible.
But where this gets really fun is when there are a whole bunch of problems
for which the theoretical complexity is already way too high to hope to make
an effective algorithm. In other words, even the best possible algorithm,
assuming we know how to write it, couldn’t solve the problem in a reasonable
number of operations – not even with the computers of the future, which will
be millions of times faster!
SQUARING THE CIRCLE?
There are a whole slew of problems that a computer will never be able to
perfectly solve, no matter what happens. Computer scientists already know
this. Yet almost all of the problems AI attempts to solve fall in this category:
playing chess, recognizing a face, preparing a schedule, translating a statement
into a foreign language, driving a car, and so on. The number of operations
needed to find an exact solution to the problem is so large that it goes beyond
what we can imagine in terms of computing power.
Indeed, artificial intelligence algorithms attempt to find a solution to these
“impossible” problems. Obviously, the proposed solution will not be perfect
(as we just saw, this is impossible!). The algorithm will incorrectly recognize a
face from time to time, or it will make an error when playing chess or
translating a sentence. But this algorithm will work in a “reasonable” time – in
other words, with a lot fewer operations than what the theoretical limit requires
for an exact solution.
Very often, making AI consists of developing programs that give us a not-
so-bad solution in a reasonable amount of time, when we know the right
solution is simply not obtainable.
Throughout the rest of this book, we are going to see a whole bunch of
artificial intelligence methods that are not only not intelligent, they also don’t
necessarily provide a good solution to the problem in question. Admit it, AI’s
“intelligence” is starting to look doubtful!
That being said, however, you’re going to see that it still works pretty well.
Lost in the Woods 4
Understanding a First Principles of
AI Method – Exploration
Let us begin our journey into the land of AI with a simple problem humans are
quite good at: “finding their way.”
A SMALL WALK IN PARIS
Imagine that you come out of the Bréguet-Sabin subway stop right in the
middle of Paris, and you need to go to Place des Vosges to visit the Victor
Hugo House. As shown on the following map, you need to take Boulevard
Richard-Lenoir toward Place de la Bastille for a hundred yards and then turn
right onto Rue du Pasteur Wagner. Next, cross Boulevard Beaumarchais and
take Rue du Pas-de-la-Mule until you reach Place des Vosges, the second street
on the left. The Victor Hugo House is on this street.
Building this path is a task that humans know how to do by using their
intelligence. Unfortunately, not everyone has the same ability for such a
perilous mission. If you don’t pay attention, you might very well end up in
Place de la République.
To get around this difficulty, nowadays we use a GPS, a device that allows
us to determine our location on the earth’s surface at any given moment. The
GPS receives signals from twenty-eight satellites hovering a couple thousand
miles over our heads. By measuring the time it takes each satellite to send a
signal, it calculates the distance from these satellites and determines its
position on the earth’s surface.
HOW A GPS WORKS
To help you find your way, the GPS in your car or your telephone also has a
map. This is an important tool that shows you all the places you can go and
how they’re connected. It is important to imagine what all this means in terms
of data: each address, each portion of road, even if just a few yards long, each
intersection … everything must be represented on there, with GPS
coordinates! Each possible position has a corresponding point on the map, and
they are all connected to each other in a giant network.
Thus, on the small portion of the map we used to go from the subway to
the Victor Hugo House, there are already hundreds of points: the subway exit,
the streets (cut up into as many little sections as necessary), all the
intersections between these small portions of street, all the addresses of all the
buildings on these portions of street … and all these points are connected to
each other.
Thanks to this map and the connections between the different points, the
GPS can show how to go from the subway exit to Boulevard Richard-Lenoir,
from the boulevard to the intersection with Rue du Pasteur Wagner, and so on.
FINDING THE PATH
It’s quite unsettling, isn’t it? The computer is like Theseus in the labyrinth: it
can leave a trail behind it to know what path it has taken, it can remember each
intersection it has come to, but it never knows where the corridor it has chosen
will lead to!
This pathfinding problem on a graph is a classic artificial intelligence
problem: computer scientists call it state space exploration.
Contrary to what the name suggests, it has nothing to do with sending a
rocket into space with a flag of the United States. Not at all. In computer
science, state space is the graph that defines all of the machine’s possible
states. Or, for a GPS, all the possible positions. State space exploration
consists of walking through this space of possibilities, from one point to the
next in search of a given point, all the while memorizing the path up to this
point. It’s Theseus in the labyrinth.
SO EASY!
If we look more closely, however, there is a big difference between these two
algorithms. The second algorithm, moving in a spiral from the starting point, is
going to study the points on the graph in order based on their distance from
the starting point. That’s why, when it achieves the objective, it will be certain
to have calculated the shortest path (even though, in doing so, it had to go
around in circles). This is clearly what we’re looking for with a GPS. The first
algorithm does not guarantee that the path it finds will be a “good” one. If it
finds one, it may very well send you to Place de la Bastille, Place d’Italie, and
Place de la République, with three changeovers in the subway, before you ever
arrive at the Victor Hugo House.
Thus, the problem with the GPS isn’t just exploring the state space, it’s
finding the best path possible on this darned graph.
COMPUTER SCIENCE IN THREE QUESTIONS
We owe the first effective algorithm for finding the shortest path on the graph
to Edsger Dijkstra. This algorithm uses the spiral idea while considering the
distances between the points on the graph (for example, the length of a section
of street or of an alley).
On the three computer science questions, Dijkstra’s algorithm scores 10
out of 10. First, it finds a path if there is one: we can prove it mathematically.
Second, it does it in a reasonable amount of time: the algorithm’s complexity
is close to the problem’s complexity. Third, it is proven that the path obtained
is always the shortest of all the possible paths. If there is another path, it is
undoubtedly longer.
Yet this algorithm is not an AI algorithm. First and foremost, it’s an
effective algorithm for exploring any kind of state space.
AI IS HERE!
To begin making AI, one must consider the data shown on the graph and the
computing time.
With a GPS, the data are the points on the graph – that is, the different
places where you can go. To show the Paris neighborhood we used in our
example, we need a graph containing about a hundred points, perhaps more.
For a small city, you’ll quickly need about ten thousand points.
Dijkstra’s algorithm, like most algorithms, requires more and more
computation as the graph gets larger: its complexity depends on the size of the
inputs, as we saw in the previous chapter.
Specifically, the complexity of Dijkstra’s algorithm is the number of points
on the graph squared. If a graph contains 10 points, we’ll need around 100
operations (10 × 10) to calculate the best path. For 100 points, we’ll need
10,000 operations (100 × 100), and so on. The greater the number of points,
the greater the number of operations.
At this rate, we quickly reach the limit a computer can compute in a
reasonable amount of time. And that’s where artificial intelligence comes in.
We need to find an “intelligent” state space exploration algorithm so we won’t
have to wait an eternity for the GPS to tell us what route to take.
A HEAD IN THE STARS
The artificial intelligence algorithm that can resolve this problem is called “A
star” (or “A*” for those in the know). It was invented by Peter Hart, Nils
Nilsson, and Bertram Raphael in 1968.
The idea behind this algorithm is simple: What takes time in Dijkstra’s
algorithm is ensuring that, at all times, the path produced will be the shortest
path possible. What we need is an algorithm that, without exactly wandering
around randomly, doesn’t necessarily have to examine everything like
Dijkstra’s algorithm does.
This algorithm will not always give the best solutions. But we can hope
that, if it doesn’t choose the points completely at random, it will provide a,
“good” solution. Naturally, we will have taken a bit of a stroll, but it will be in
the right direction.
This is the principle of artificial intelligence.
THE BEST IS THE ENEMY OF THE GOOD
The first thing to understand about AI algorithms is that they do not aim to
obtain an exact solution to the problem posed.
When a cooking recipe says “add 1.5 ounces of butter,” you don’t
necessarily take out your scale and weigh your butter to the ounce. You take an
eight-ounce stick of butter, cut off what looks like three-eighths, and that’s it.
It’s a lot faster! By doing it this way, you are consciously forgoing an exact
solution, but you know that your cake won’t necessarily be that bad. It just
won’t be “perfect.”
This is the idea behind artificial intelligence. When it is not possible to
compute the best solution in a low enough number of operations, we must
content ourselves with a faster, “approximate” solution. We have to give up on
exactness to gain computing time for the machine.
ROUGHLY SPEAKING, IT’S THAT WAY!
Computer scientists call the distance as the crow flies (straight line distance) a
heuristic function. It’s a rule of computation that gives a “not so bad” estimate
of the distance to arrive at each point.
This function must be easy to calculate: it is going to be used each time the
algorithm encounters a new point as it explores the state space. If it takes three
minutes to calculate the heuristic function’s output, we’ll never finish! With
the GPS, this works out well: the straight line distance is very easy to evaluate
because you know each point’s coordinates on the graph.
But the heuristic function must also give the best possible estimate of the
actual distance to the destination point – that is, the distance you’ll obtain
when you take the path on foot, as opposed to flying over buildings. You might
have some bad luck: there’s construction work on Rue du Pas-de-la-Mule and
you have to take a huge detour on Rue Saint-Gilles. But, in general, this seems
like a good path to follow.
Actually, with the GPS, the straight line distance is an excellent heuristic.
It is possible to mathematically prove that the path provided by the A*
algorithm is clearly the shortest path. Computer scientists say that this
heuristic is admissible.
This is not always the case. There are many computer science problems for
which one must find a path on a graph. The A* algorithm can be used with
different heuristics, each one corresponding to a specific problem. The special
thing about this algorithm is precisely how generic it is. Even if the heuristic
function provided to the machine to resolve the problem is not admissible (in
other words, it doesn’t guarantee it will find the shortest path on the graph),
the A* algorithm will find a solution rather quickly. It won’t necessarily be the
best, but with a bit of luck it won’t be too bad.
THERE’S A TRICK!
The GPS example allows us to understand that inside each AI algorithm there
is inevitably a trick provided by a human: the heuristic.
Resorting to heuristics stems from the fact that it is impossible for a
computer to guarantee an exact solution to the problem posed in a reasonable
amount of time. AI algorithms always calculate an approximation, and it’s the
heuristic function (and, of course, a well-chosen algorithm) that allows us to
obtain solutions that are “not too bad.” The quality of the solution will give
people the sense that the machine is more or less intelligent.
With the A* algorithm, the programmer writes the heuristic as a function –
for example, the straight line distance for a GPS. In other AI programs, as
we’ll see throughout this book, other heuristics will be considered. In this way,
computer scientists have created an illusion of intelligence by giving the
algorithm instructions that allow it to find a good solution. For its part, the
computer simply applies the heuristic. Undoubtedly, this is very rewarding for
the computer scientists who make AI.
Winning at Chess 5
Understanding How to Build
Heuristics
We now have in our suitcase all the equipment we need to make our journey
to the land of AI. A Turing machine, some algorithms, a bit of complexity,
some graphs, and two or three heuristics: voilà, the computer scientist’s
complete survival kit.
To understand artificial intelligence, let’s take a stroll through computer
science history.
A SHORT HISTORY OF AI
Everything began in 1843, when Ada Lovelace wrote the first algorithm for
Charles Babbage’s analytical engine, a precursor to the computer. In a way,
this is the prehistory of computer science. The first programmable machines
didn’t quite exist yet: Jacquard’s loom, invented in 1801, was already using
perforated cards to guide needles, but it was purely a matter of controlling
mechanical commands. The first machines to use computation and
programming only appeared at the end of the 19th century, during the US
census. The first entirely electronic computer, the ENIAC, was invented in
1945, a 100 years after Ada Lovelace’s algorithm. It is where we entered
year zero in computer science… and everything would go much faster after
this.
In 1950, just five years after the arrival of the ENIAC, Alan Turing used
the term “artificial intelligence” for the first time. This new concept was a
massive hit, and in just a few years numerous AI algorithms were proposed.
In the 1960s, computer science became widespread. The computer
quickly became an essential tool in many industries, given its ability to store
and process data. The A* algorithm we discussed in the previous chapter
dates back to this time period. In the 1970s, computers took over offices:
employees could directly input and consult company data. Everyone was
talking and thinking about AI!
And yet it would take almost 20 more years for computers to beat
humans at chess. What happened?
During this time, AI hit a rough patch. After the first successes in the
1960s, society was full of enthusiasm. People imagined machines would
soon be capable of surpassing humans in every activity. They thought we
would be able to automatically translate Shakespeare to Russian and replace
general practitioners with computers that could make a diagnosis based on a
patient’s information.
The reality check was brutal. Investors who had had a lot of hope for
this promising field realized that AI only provided good results in a few
limited domains. And just as before, a computer scientist had to
systematically give the machine very specific algorithms and data. Without
them, it was no more intelligent than a jar of pickles.
It was very disappointing. At a time when we knew how to calculate a
rocket’s trajectory to go to the moon, we couldn’t even program a robot
capable of crossing the street without getting run over. The fallout with AI
was quite simply on par with hopes it had generated.
Unfortunately, it was all just a big misunderstanding between computer
scientists on one hand, who were all excited by the prospect of having the
first intelligent tasks completed by machines, and the general public on the
other hand, who got carried away by what they saw in science fiction.
Artificial intelligence paid a heavy price. From 1980 to 2010,
approximately, only a few businesses agreed to invest in these technologies.
Instead, most preferred to stick with exact algorithms they could manage
and monitor easily.
Feng-Hsiung Hsu and Murray Campbell, two young IBM engineers,
began their career at this dire time. They set a goal that was both ambitious
for AI and exciting for the general public: to build a computer program
capable of defeating chess’s grand champions.
A NEW CHALLENGE
Quite surprisingly, the algorithm Hsu and Campbell used to win at chess
wasn’t new. It existed before computers were even invented! This algorithm
was proposed by the mathematician John von Neumann, another big name
in computer science. Von Neumann is known, among other things, for
developing the electric circuit that would later be called a computer.
Long before this, in 1928, von Neumann took an interest in
“combinatorial games.” These are two-player games with perfect
information and without randomness, of which chess is but one example:
there are two players who see the same thing (complete information) and
there is no randomness, contrary to snakes and ladders, for example. Chess
and checkers are probably the two best-known combinatorial games in
Western culture. But have you ever played Reversi, where players flip over
discs? And the African game known as Mancala, where each player puts
seeds in small pits to “capture” opponents’ seeds? John von Neumann
demonstrated that, in all of these games, there is an algorithm for
determining the best move. This is known as the minimax theorem.
Note that we cannot just simply apply this algorithm. This is what
makes it such a subtle thing. Consider as an example the board game Hex, a
Parker game that came out in the 1950s. In this game, each player tries to
link two opposing sides with hexagonal pieces, while blocking the
opponent. The minimax theorem proves there is an algorithm that will allow
us to calculate the best move. John Nash, the inventor of Hex and future
Nobel Prize winner in economics in 1994, even succeeded in proving there
is a strategy that allows the first player to win every time! Automatically
computing this strategy (in other words, with the help of an algorithm),
however, is such a complex problem that it is completely impossible for a
computer to do it in a day, even if it were billions of times more powerful
than today’s most powerful computers.
In other words, it has been shown that the first player in Hex will
inevitably win if he chooses the right moves (his opponent cannot stop him
from winning)… but it is impossible to calculate what the moves are!
FROM THEORY TO PRACTICE
The algorithm that allows computers to win at chess against the best players
in the world takes its inspiration from the method defined by the minimax
theorem. Naturally, it’s called the minimax algorithm. The idea is relatively
simple: attempt all the moves possible, then all the opponent’s possible
moves (for each possible move), then all the possible responses, and so on
until one of the two players wins.
This builds a kind of genealogical tree of possible game situations
starting from the initial situation. In this tree, the root is the initial state of
the game (no player has made a move), and each move corresponds to a
branch leading to the state of the game once the move has been made, as
shown in the following drawing.
The idea behind the minimax algorithm is to use this genealogical tree to
choose the best move. Starting at the “final” situations at the top of the tree
(in other words, the endings of the games), we move downward toward the
root.
To begin, we transform all of this into numbers because computers are
machines that manipulate numbers. Then, we assign a score to each final
situation. For example, in Mancala, the score could be the number of seeds
obtained minus the opponent’s seeds. In chess, the score could be 0 if you
lose, 1 if you draw, and 2 if you win.
Imagine now that you are at the last move of the game. If it’s not your
turn, the logical choice for your opponent is to make the winning move. Put
another way, your opponent is going to make a move that achieves the least
optimal final situation for you (unless you’re playing someone totally
stupid). Thus, the other player’s goal is to achieve the minimum score (from
your point of view). Conversely, if it’s your move, you want to choose the
move leads you to the maximum score. Therefore, the score of a situation
when there is only one move left is the final situation’s minimum (or
maximum) score.
And for the move right before it? It’s the same, except the roles are
reversed! You must choose the move that leads to the minimum (if it’s your
opponent’s turn) or the maximum (if it’s your turn) of what we just
calculated for the final move. The score of each situation two moves from
the end is, thus, the maximum of the minimum (or the minimum of the
maximum: this depends on the player).
Thus, von Neumann’s algorithm starts at the game’s final moves at the
top of the tree and moves “downward” toward the root by alternating the
minimums and the maximums. Hence the algorithm’s name: the move you
must choose at the end is the maximum of the minimums of the maximums
of the minimums of the maximums of the minimums. In sum, it’s minimax!
This is somewhat hard to imagine because we aren’t machines. For a
computer, calculating these maximums and minimums is very easy, just like
exploring the state space we saw in the previous chapter. The program
consists of less than ten lines of code!
In all of this there isn’t necessarily any artificial intelligence. The reason
for this is quite simple: when the minimax algorithm was invented in the
late 1920s, a machine’s computing power wasn’t relevant. Computers didn’t
even exist yet!
ANOTHER LIMIT TO COMPUTERS?
To apply the von Neumann method directly, one must make a genealogical
tree of all the possible games. This task varies in difficulty depending on the
game in question.
Let’s start with tic-tac-toe. Von Neumann’s method can be applied
directly. If we consider the symmetries and the fact that the players stop
playing as soon as one of them has won, there are 26,830 possible games in
tic-tac-toe. For a machine, examining these 27,000 configurations and
calculating the maximums and minimums is actually a piece of cake! With
several billion operations per second, your brain would take longer to read
the result than the machine would to calculate the best move! There is no
complexity problem here, so there’s no need to resort to artificial
intelligence. The von Neumann method works.
CHECKMATE, VON NEUMANN!
Now, let’s take a look at chess, which is, you’ll agree, slightly more
interesting than tic-tac-toe. Not all the pieces can move over all the squares,
and each kind of piece is subject to different rules. For this reason,
calculating the number of configurations is more complicated than tic-tac-
toe. According to the mathematician Claude Shannon, there are around
10120 possible chess games – in other words, 10120 different games to
examine to choose the best move. To write this, we use the digit “one”
followed by 120 zeros:
1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00
0,
000,000,000,000,000,000
With this algorithm, you can write a program that plays chess pretty well –
except for a few key details.
For starters, the minimax algorithm can be improved by “cutting” the
tree branches we already know won’t provide a better result. This method is
known as the alpha-beta pruning method. John McCarthy, one of the most
famous researchers in artificial intelligence, proposed this pruning idea at a
Dartmouth workshop in 1956. This is not a new heuristic, because the cut
branches are “clearly” bad according to the minimax theorem. Thus, in
practice we don’t lose anything. On the other hand, this pruning allows us to
work with bigger trees and gain a few moves for your game estimate.
Instead of only considering the next 10 moves, the computer can go up to 15
or 16 moves!
After this, to win against the world chess champion, you need to write a
good heuristic function. This means you need to determine the value of the
threatened pieces, the captured pieces, the different strategic positions, and
so on. Nevertheless, all this requires a certain expertise in chess and quite a
few attempts before you find the right values.
Finally, you need to have a supercomputer like Deep Blue and a whole
team of engineers to work on it for a decade! A small consideration, of
course.
A DISAPPOINTING VICTORY
The Deep Blue victory in the late 1990s did not give AI the renewed
momentum we might have expected. The general public and specialists
alike were not convinced by the quality of Deep Blue’s playing. And for
good reason! Some of the computer’s moves were terrible. The algorithm
does not plan everything, and, unlike humans, it has no strategic knowledge.
It wasn’t exactly a resounding victory either: some matches ended in
draws, and the decisive match appears to have been more the result of a
mistake by Kasparov, disconcerted by an unlikely move by Deep Blue, than
by strategic and tactical qualities worthy of chess’s best masters!
Without knowing everything about the minimax algorithm, people easily
understood that the IBM computer simply applied a crude method and won
by brute force. This computing power, though not actual physical force,
gives pretty much the same impression as if Hulk Hogan (a 275-pound
heavyweight wrestling champion) won a judo competition against an
Olympic champion who competes in the under-105-pound weight class.
Something doesn’t seem right.
But that’s exactly how AI works. And not just in games and GPS.
Remember: this is about allowing machines to perform tasks that are, for
the moment, performed in a more satisfactory way by humans. The method
isn’t important, it’s the result that matters.
Nothing says the techniques used have to resemble what a human would
do. Quite the contrary, the power of computers is their ability to process
large amounts of data very fast. Whereas a human will rely on reasoning
and experience, a machine will test a billion solutions, from the most absurd
to the most relevant, or search through giant databases in the same amount
of time. It doesn’t think, it computes!
So, this is somewhat disappointing. However, it would be absurd to
expect machines to do things “like humans.” They are not physically
capable of it.
_____________
1 The name Deep Thought comes from the supercomputer in Douglas Adams’s humorous novel The
Hitchhiker’s Guide to the Galaxy. In the book, the computer computes the “Answer to the Ultimate
Question of Life, the Universe and Everything,” no less! According to the author, the answer is…
42!
2 In 1975, the engineer and computer scientist Gordon Moore made a hypothesis that the number of
transistors on a microprocessor doubles every two years. This assertion gave life to Moore’s law,
which in plain language is “computer processing power doubles every two years.” This very
imprecise assumption has become a target to aim for in the computer science industry, and it has
pretty much been the norm since the 1970s.
The Journey Continues 6
Understanding That One Graph
Can Conceal Another
Unlike humans, machines do not “think.” That’s why it’s difficult to make true
artificial intelligence.
ON THE INTELLIGENCE OF MACHINES
Over the course of time, several AI algorithms have been proposed. The first
of them has a very original name: the greedy algorithm. This AI method uses a
simple heuristic: at each stage, the traveler chooses the city that is closest to
where he is (and which he hasn’t visited yet, of course).
In computing terms, we just choose the one that’s the closest distance away
from among the remaining unvisited points. We can write the algorithm as
follows:
The problem with the greedy algorithm is that it doesn’t attempt to do any
better than the first path it determines. It’s a real shame because this first
solution isn’t necessarily very good, and it might be easy to fix a few mistakes,
like going to Nantes and back, which isn’t well positioned on the route.
Fortunately, artificial intelligence researchers have better methods. To
understand them, we have to explore a new space.
But first let’s go back to what we’ve already seen: state space exploration.
Remember: state space is the set of possible states. In a GPS problem, we walk
about this space in search of a solution. In the traveling salesman problem, it’s
exactly the same: our greedy sales representative is going to go from point to
point by visiting the nearest neighbor each time.
However, there is also another space: the solution space. This notion is
somewhat hard to understand. Imagine you’re no longer concerned about the
cities you pass through (the states), but instead you’re concerned about the list
of cities that determines your entire path (the solution). With the traveling
salesman problem, any solution is one in which you’ve visited all the cities, no
matter the order. Bordeaux, Toulouse, Marseille, Nantes is one solution.
Marseille, Toulouse, Bordeaux, Nantes is another (better). All of these
solutions, together, are known as the solution space.
The greedy algorithm simply searches, rather hastily, any solution: it
chooses one single value in the solution space.
THE SOLUTION SPACE
The difficulty that arises in the traveling salesman problem comes from the
size of the solution space. If we have N cities to visit in addition to our
departure city, there are N choices for the first city, N − 1 for the next, and so
on. In total, we’ll have
N × (N − 1) × (N − 2) × ⋯ × 3 × 2 × 1
On the other hand, with all these solutions, even if there are a lot of them, we
can build a graph: all we have to do is “connect” the solutions. Computer
scientists love these graphs because they can write lots of algorithms with
them.
To build this solution graph, we’re going to connect two solutions when
they are identical except for two cities. For example:
Limoges →
Marseille →Toulouse →Bordeaux →Nantes
Limoges →
Bordeaux →Toulouse →Marseille →Nantes
Each step presents its own difficulties. The first among them is to make a
cluster with lots of different solutions. How is this possible? It took us an
entire chapter to understand how to make one single solution!
Honestly, this depends on the problem in question. With the traveling
salesman problem, it is very easy to build lots of solutions: all we have to do
is to randomly choose one city after another. Limoges, Marseille, Bordeaux,
Toulouse, Nantes, and there you have it! Done! No need to think. An
algorithm that produces a solution at random can be written as follows:
Repeat.
It’s very fast: for a graph with a hundred cities (N = 100), we can produce a
thousand solutions in under a millisecond on one of today’s computers!
Of course, there is no guarantee the randomly chosen solutions will be
of good quality. Just the opposite, it’s very likely they’ll be totally worthless.
But that doesn’t matter: the algorithm will see to it that they evolve into
something better.
TWO BEAUTIFUL CHILDREN!
Remember the principle: as with the giraffes, the individuals (solutions) are
going to reproduce and make babies, and only the best of them will survive
under the hard law of natural selection in the traveling salesmen jungle.
You’re probably wondering how “solutions” that are basically just
numbers in a computer can reproduce and have children. Why not have an
eHarmony for solutions, while we’re at it? You are right. This is just an
illustration. We’re going to choose solutions two by two at random and
“mix” them, just like when two living beings reproduce and combine their
DNA.
Let’s look at an example. Imagine that we have to visit 15 cities, which
we will name A, B, C, D, E, F, G, H, I, J, K, L, M, N, and O. Now, let’s
choose two solutions at random: Dad-solution, whose route is
ABEDCFHGJMNLKIO, and Mom-solution, whose route is
DFKNMJOACBEIGLO.
To build a “child” solution, we take the first cities in the order indicated
by the first solution (the one we named “Dad”) and the following cities in
the one named “Mom.”
We choose, always at random, the initial number of cities. For example,
we decide to keep Dad’s first six cities: ABEDCF. Next, we take Mom’s
cities in order (except for any we’ve already visited): KNMJOIGLO. This
gives us the following solution:
ABEDCF KNMJOIGLO
The evolutionary algorithm can also use genetic mutations, just like in real
life. From time to time (at random again!), one of the children “mutates,”
inverting two cities. This little operation allows the algorithm to find good
solutions faster, as long as these mutations remain rare. This generates
solutions that are not consistent with the parents: if the result is catastrophic,
the mutation will not be kept. But if it is of good quality, the mutation will
benefit this piece of solutions in future generations!
An evolutionary algorithm is, thus, simply a method for exploring a
solution space in a shrewd way. There is no natural selection because our
“individuals” are just numbers in a machine. When you “mix” two solutions
to make a new one, you’re just choosing a new point in the solution space
that has traits in common with the initial solutions.
If your two initial solutions are good solutions – in other words,
relatively short routes for the traveling salesman problem – you have two
possibilities. One possibility is that the route proposed by the “child”
solution isn’t as good, in which case it will be eliminated at the end of the
trip. The other possibility is that the child’s route is better because one of the
parent’s routes helped shorten the other one, in which case you keep it. This
is how your population progresses.
The mutation, for its part, behaves like the Tabu search we saw at the
end of the previous chapter. It allows us to explore the neighborhood in the
solution space and look for new routes for our traveling salesman, without
losing everything that has been calculated previously.
Researchers have mathematically shown that this type of algorithm
works well. With enough trips, you get a very good solution – provided, of
course, you programmed it right!
SO, WHAT IS THIS GOOD FOR?
Both evolutionary algorithms and the Tabu search method appear easy to
use: You give the program some parameters and that’s it! It figures out the
solutions all by itself.
However, a lot of intelligence actually goes into choosing these
parameters, for example, how they connect to neighbors in the search space
for the Tabu search, how the genome is coded for individuals in
evolutionary algorithms, what the right crossover method is for the
solutions, and what the selection function is…
In every instance, there is always a human who tells the computer how
to handle the problem. Even though the computing method is generic, the
intelligence that allows the problem to be solved is entirely human.
Small but Strong 8
Understanding How Ants Find Their
Way
Multi-agent systems appeared in the early 1990s. They aren’t just used to
make artificial intelligence algorithms. At first, they were a general technique
used in developing computer programs.
Imagine you want to build a city with houses, stores, transportation, and so
on. You certainly aren’t going to make everything all at once. You start by
designing neighborhoods, then you arrange them. In computer science, we
work in the same way. No one writes a program with thousands of lines of
code all in one go. We develop small pieces and assemble them with each
other to build a program gradually. Computer scientists work little bit by little
bit.
Over time, these little bits of program have become more and more
sophisticated, to the point that in the 1990s, they were no longer little bits of
program but instead entire programs assembled with others. Each program in
this assembly is capable of resolving a given problem but cannot function
alone: it receives data from other programs and conveys the result of its
calculations to another program. This small world of programs exchanges data
from one end of our little computerized city to another. We call these little bits
of program agents because they act on data to resolve the problem together.
A computer program written in this manner, with basic programs coupled
together, is called a multi-agent system. This type of system is somewhat
similar to ant colonies: each ant does a very specific task (find a twig, find
food, move the eggs, etc.) that it repeats tirelessly for the colony to function as
a whole. Each ant’s work appears to be very simple and, yet, when combined
together, they produce a structure that functions in a very complex manner.
THE ALGORITHM AND THE ANT
Actually, the agents don’t speak to each other. They just leave traces on the
graph to inform the other agents of their calculations.
JUST LIKE IN THE ANT COLONY
The idea of leaving a trace on the graph is directly inspired by what ants do in
nature. As ants move about, they leave behind their pheromones. These are
small odorous molecules that help guide other ants. For an ant, pheromones
work like a message: “Hey guys, look over here, I went this way.” Ants like to
follow other ants’ pheromones. In nature, it is possible to observe columns of
ants following each other: each ant is guided by the pheromones of its fellow
ants.
When you place a mound of sugar near an anthill, the ants don’t notice it
right away. They move about at random, and one of them eventually comes
across this treasure trove. It then retraces its steps back to the anthill and leaves
behind a trail of pheromones to guide its fellow ants.
However, over time the pheromones evaporate. Thus, ants that take longer
to return to the anthill from the mound of sugar will leave weaker pheromone
trails: the odor will have enough time to evaporate before the ant gets back,
and the other ants will not be so inclined to follow the same path. On the other
hand, on the shortest paths, the pheromones won’t have enough time to
evaporate: the odor will still be strong, and it will attract even more ants. If
there are several paths from the anthill to the sugar, the ants will eventually
take the shortest path.
CALCULATING A PATH WITH ANTS
In computer science, the pheromones are replaced by the numbers that each bit
of computer program (each agent) writes on the graph, directly associated with
the lines connecting the points. In the computer’s memory, we write a list of
all the connections, and for each connection we include a cell to store a
number: a cell for the direct path from Toulouse to Marseille, a cell for the
direct path from Toulouse to Bordeaux, a cell for the direct path from
Bordeaux to Nantes, and so on. And in each cell, the agents that pass through
this connection put the total distance they traveled.
Let’s come back to the dialog between the two agents: the one who
traveled 394 miles and passed through Toulouse and then Marseille, and his
colleague who passed through Bordeaux. The first one is going to record 394
in the Limoges-Toulouse cell, the Toulouse-Marseille cell, and all the other
cells corresponding to this path. The second one is going to record 468 in the
Limoges-Bordeaux cell and in all the other cells corresponding to the path that
he took.
Taking a path at random like our agents doesn’t take much time for a
computer. Thus, we can ask a hundred agents to do this work. Each agent is
going to record its total travel time on all the graph’s connections for all the
paths it took. Once all of the agents have finished, for each connection the
algorithm calculates the average distance in miles traveled by the agents who
passed through it.
Let’s use a concrete example. Let’s assume three agents – Anne, Bernard,
and Celine – traveled from Marseille to Toulouse. In total, Anne traveled 260
miles; Bernard, 275; and Celine, 210. We will, thus, put 400 (the average: 260
+ 275 + 210)/3 = 248) in the Marseille-Toulouse cell.
The algorithm doesn’t stop there. On the next trip, the agents go out on the
road again, but this time, instead of choosing their route completely at random
like the first time, they’re going to take notice of the pheromones, like the
ants! When they have to choose the next point in their route, they look at the
values shown on the connections between the current city and the other cities.
They choose the next destination at random by giving a higher probability to
the cities that have connections with the lowest values. For example, an agent
who is in Marseille and still has to visit Toulouse and Nantes is going to look
at the weights for the Marseille-Toulouse connection and Marseille-Nantes. If
the weight for Marseille-Toulouse is smaller than Marseille-Nantes, it’s more
likely the agent will choose Toulouse.
THE MORE THE MERRIER!
In a multi-agent system, the solution gets built even though the algorithm does
not explicitly describe how it’s supposed to be built: no one has told the agents
they had to find the shortest path. We’re simply asking them to flip a coin
when they move about and to write some numbers down.
For computer scientists, this phenomenon is known as the emergence of
the solution. The programmer has given the agents some rather basic behaviors
(here, choose a solution more or less at random by looking at what your
buddies did on the previous trip), and a solution gradually emerges based on
the calculations and interactions among the system’s agents.
Among the skills that make use of human intelligence, there is one that
machines and data processing cannot do without. It is the ability to classify
objects – that is, to group or separate elements based on their features.
Human beings do not have much difficulty distinguishing between birds
and mammals. We learn early on how to group things that look alike: ducks,
geese, and waterfowl all look like each other more than like a cat or a lion.
This skill allows us to describe the personality of our friends (shy, exuberant,
etc.) without needing to resort to complex reasoning.
However, classification is not as easy as it might seem.
Let’s consider the five animals below: A dog, a cat, a duck, a cow, and a
lion.
If you had to divide them into two clusters of similar animals, how would
you group them? Naturally the cat and the lion… perhaps with the dog? And
the cow and the duck together? Or would you leave the duck out, since it’s not
a mammal? But you could also think of the lion as the odd one out: it’s the
only wild animal. Or perhaps the cow: it’s the only animal humans get milk
from.
As you can see, grouping objects based on their similarities isn’t always
easy.
FROM ANIMALS TO GENES
There are as many unsupervised cluster analysis algorithms as there are ways
of clustering objects. For example, imagine you must divide Christmas
presents among several children. You might want each child to open the same
number of presents. In this case, your clustering algorithm must ensure that
each cluster contains the exact same number of presents. On the other hand, if
you want each child to receive presents of a similar value, regardless of how
many, you’ll choose a different algorithm to ensure the presents are distributed
fairly. Or, if you want each child to receive a big present, you’ll apply yet
another clustering algorithm.
Each clustering algorithm does not sort the presents: they compare the data
to cluster what they have in common. Each one has a different way of
clustering a specific object’s data. These algorithms have strange names: k-
means, DBSCAN, expectation maximization, SLINK, and others. Each one
has its own way of forming clusters.
All these algorithms compare the objects’ features to form clusters of
objects that are similar to each other but different from other clusters.
Naturally, these characteristics (for example, the animal’s size, what it eats) are
represented by numbers in the computer. These are the features.
To describe these features, we need a computer scientist whose role is to
define, for each feature, how to represent them numerically. While it’s pretty
easy to do this for size, choosing a number for features such as diet, habitat, or
coat raises more questions. Indeed, the computer scientist must determine the
distance among the different values to allow the algorithm to compare the
objects. For example, is an animal with feathers “closer” to an animal with hair
or an animal with scales?
Using a number to represent a measurement of distance between feature
values is the basis for a great number of clustering algorithms.
In this kind of representation, each feature corresponds to a dimension in a
multidimensional space, and the objects to be clustered constitute just as many
points in this space. It’s as if you placed your animals in the space based on
their features so you could measure the distance separating them. But you have
to imagine that this space doesn’t just have two or three dimensions like the
image above, but probably hundreds. There are as many dimensions as there
are features to study!
This mathematical modeling is difficult to grasp. In any case, bear in mind
that the points’ relative position in the space determines the distances between
the objects to be clustered. This is what will guide the clustering algorithm and
determine how it regroups them. By representing the clustering problem as a
geometry one, computer scientists have been able to build the majority of
today’s modern clustering algorithms.
STARTING OVER AGAIN AND AGAIN
Cluster analysismethods rely on the idea that the computer uses the data to
build one exclusive decision rule. The result of this calculation is not limited to
reorganizing a list of previously clustered objects: it subsequently allows you
to assign a class to any object.
For instance, let’s use the k-means algorithm, which determines the points
in the feature space. If you give your system a new object to classify, the
computer will immediately be able to put it in a cluster by comparing the
distance to the “mean points.” Thus, the algorithm hasn’t just built two classes,
it has also created a calculation function for classifying any object of the same
family.
This gives the impression that, using the data provided, the computer has
learned to classify the objects on its own. That’s why computer scientists often
speak of unsupervised learning.
Naturally, the computer hasn’t learned anything: it has simply applied a
classification rule built using an algorithm. This classification rule even
behaves like an algorithm: we can use it to classify new objects.
JOY AND GOOD HUMOR!
Admit it, the idea of machine learning is kind of scary. Our natural tendency to
personify machines has led us to imagine that they could be capable, like a
small child, of learning from the world around them and acquiring true
autonomy from their inventors, us, humans. Nothing of the sort. A machine
always works within the bounds of the algorithm and the data we provide.
ASK THE PROGRAM!
In computer science, it’s a little more subtle than this. We don’t exactly build
new machines, but we do produce new data or new programs that allow us to
process data in new ways.
Let’s come back to the k-means algorithm and unsupervised classification.
The computer has iterated on the data to calculate k mean points. In doing so,
it has classified the data around these k points. Now if you give it a new object,
a piece of data that it has never seen, it will still be able to classify it according
to these k points.
Everything happens as if the algorithm has “built” a data-based
classification program. This new program is capable of processing any new
data of the same kind.
However, the machine does not learn on its own. For this to work, a
computer scientist must first write the learning program – in other words, all of
the instructions that will allow the computer to use the data to produce a new
AI system. The computer cannot invent a new artificial intelligence program
on its own: it just follows the procedure it receives from a human.
AN EXAMPLE IS BETTER THAN A LONG
EXPLANATION
The learning algorithm must build a system capable of associating the right
answer with each possible value provided as input. In other words, we want it
to “learn” to reproduce the results provided in the examples.
This technique is referred to as supervised learning.
THE ADVENTURE BEGINS
The idea of training a machine to perform a task, rather than writing how to
perform the task, is not new. At first, however, it was limited to only a few
applications. The machines at the time did not have enough data to use
supervised learning algorithms on anything and everything.
Today, computers are capable of reacting to commands in natural
language, transcribing handwriting, detecting errors in physical systems,
recommending a book for you to read, and distinguishing between a face and a
cat in a picture. All of this is possible thanks to the incredible amounts of
labeled data that have been gathered in databases to tell the machine: “You see,
in this picture, there is a cat!”
You’re going to say to me: an algorithm that recognizes cats, that isn’t very
useful. You might be surprised to learn that kittens fooling around or looking
cute for the camera are among some of the most watched videos on the
internet. So, you never know how useful a computer program might be!
FROM IMAGE TO DATA
The type of variables the learning algorithm must rely on is sometimes subject
to intense debate. As the 1970s came to a close, the Polish researcher Ryszard
Michalski wrote a learning algorithm that worked with so-called symbolic
variables – in other words, well-formed logical formulas. The data model he
used, named VL21, is a description logic. These computer languages were very
popular in the 1980s and 1990s. They let us describe variables or decision
rules such as the following:
The animals with triangle-shaped ears are cats
Michalski’s trains are defined by their features: number of cars, car size,
whether it is open (no roof) or closed, and so on. The researcher gives his
program examples of trains traveling east or west and describes their features.
For example:
There is an eastbound train with three cars. The first car is small and has
an open roof. The second car is a locomotive. The third car is small and
has a closed roof.
Obviously, all this is written in the logic language VL21 with ∃, [], and other
abstruse symbols.
Michalski’s program automatically learns, from examples, rules of the
form “If there are such and such features, then the train is traveling east” (or
west). In other words, if we apply the algorithm to a slightly different domain,
it automatically learns, from examples, the decision-making rules that allow it
to recognize objects, just like with the cat ears:
∃ ear[shape(ear) = triangle] → cat
To obtain this result, the algorithm takes the examples one after the other.
At each step, it tries to build the broadest rule possible based on the preceding
examples. This is possible thanks to the description logic used by Michalski.
This language not only allows us to describe the variables’ values (there are
triangle-shaped ears) and the decision formulas (if there are triangles, then it is
a cat), but also the reasoning rules for the formulas themselves. For example,
when several cars have similar features, Michalski asks the machine to apply a
rule that groups the examples together.
By using these logic rules, the algorithm gradually builds a decision rule
that gives a correct answer for all the examples supplied to it.
TELL ME WHAT YOU READ, AND I’LL TELL
YOU WHO YOU ARE
The logic rules are the essence of Michalski’s algorithm. They allow us to
build these final decision rules, and they have all been defined by the
researcher “by hand.” As always with AI, there is a certain amount of human
expertise supplied to the machine. Here, the expertise has a direct impact on
the learning process: if you want to learn this example, then do this or that to
build your decision formula.
Still, the fact remains that the computer automatically builds decision rules
in a computer language (here, the description logic VL21). As with the “plan”
example we used at the beginning of the chapter, the computer has built a
program using examples.
All of this algorithm’s power stems from the fact that the result is a set of
intelligible decision rules, as long as you’re speaking VL21. When built in this
way, the system (the program resulting from the learning) can then explain
each of its decisions!
This type of approach can be particularly interesting in a recommendation
system. Let’s use book shopping as an example. The k-means algorithm might
recommend for you a book that is “close” to other books you have chosen,
thanks to a distance measurement. However, it won’t be able to explain why
this book is a good choice. On the other hand, a system capable of learning all
the features of the books you read and grouping them together in well-formed
formulas will be able to tell you: “Actually, you like 1980s science fiction
books.” Impressive, isn’t it?
FROM SYMBOLIC TO NUMERIC
You are probably better off consulting a physician instead of relying on this
prediction, because this artificial intelligence program only works with data
and variables we give it. With only two variables, we can’t expect a very
accurate diagnosis!
The ID3 algorithm and its successors, in particular the C4.5 algorithm,
were among the most widely used learning algorithms in the 2000s. Not only
are they capable of building compact trees, they can also deal with exceptions
when an example from a database needs to be handled in a particular manner.
To achieve this feat, at each stage of the tree, the Quinlan algorithm
calculates which variable allows it to best separate all the examples it has been
given. For Akinator, this means choosing the most relevant question to quickly
arrive at the right answer. With the doctor, this means finding the variable and
the value range that gathers the most examples. To do this, the algorithm
measures the variables’ data gains using a mathematical function. Naturally,
we must calculate this value for each variable and for all the examples.
However, the calculation is relatively quick, even with thousands of examples
and lots of variables.
In its calculation, the algorithm also considers a few particular
eventualities. If an example’s attributes cannot be generalized because its
values are too specific, it is excluded from the decision tree. The algorithm,
then, calculates how reliable each result produced by the final decision
program. Thus, it is able to tell you “with 99.1°F and a normal throat, there is
an 82% chance you are not sick.” Impressive, isn’t it?
TOO MUCH, IT’S TOO MUCH!
The weakness of these algorithms is the amount of data you need to build a
“good” classification. The more variables you have, the more examples you
need.
You’re right to suspect we normally don’t just use two or three variables to
write an AI program. For each piece of data, there are usually several dozens
of features, and sometimes many more! To interpret commands in a natural
language like Siri does, for instance, we need thousands of dimensions.
Thus, the number of examples required to build a good learning algorithm
can be mind-boggling. Even with the enormous databases we’ve built since the
1990s, the results are rarely up to the task. This algorithm is not capable of
calculating general rules.
What we need, then, is another method.
THE KITTIES RETURN
The other method we need has been around since 1963. It’s the support vector
machine, or SVM, a model proposed by Vladimir Vapnik and Alexey
Chervonenkis.
For nearly 30 years, these two Russian researchers studied how classes are
built from a statistical standpoint. The two mathematicians were interested in
data distribution. Thus, they proposed an algorithm capable of classifying
data, like ID3 does, but with many fewer examples. The idea is to “separate”
the data so as to have the largest margin possible on each side of a dividing
“line.”
To understand this principle, imagine we have a system capable of
recognizing cat pictures. We’ll only consider two dimensions: the number of
triangles in the picture (a particular feature of cat ears) and the number of
rounded shapes.
We give our algorithm the four following examples: a cat photo, a cat
drawing, a mountain landscape with a hill, and a view of the pyramids.
Each picture is placed higher the more rounded shapes it has. Likewise, the
more triangles it has, the further to the right it is placed.
Naturally, it is impossible to build a system capable of recognizing the
pictures based on these two dimensions alone. But this simple example allows
us to understand how an SVM works.
THE MATHEMATICS OF ARTIFICIAL
INTELLIGENCE
The SVM’s first task is to analyze the examples supplied to the system during
the learning phase and draw a straight line to separate the cat pictures from the
rest. In theory, this is easy. But in practice, there are many ways to separate
points in a space with only one straight line, as demonstrated by our two
robots in the picture.
In the first case, proposed by the robot on the left, the separating bar is
very close to the photo of the cat and the photo of the pyramids, while the
other two pictures are further away. By contrast, in the second case, the bar is
really close to the picture of the mountain and the drawing of the cat. And
between the two, there are many other possible positions. How do we choose?
The first thing we can see is that these two “extreme” bar positions each
pose a problem. Indeed, our objective isn’t simply to separate the pictures
hanging on the board but to build a program capable of sorting through
new cat and noncat pictures, without knowing which category they belong
to ahead of time. If we choose the first dividing line, the one that comes near
the photos, we have a high risk of error. Indeed, a very similar cat photo
having slightly fewer curves or triangles will go on the other side of the line
and end up being classified in the noncat category. By contrast, a picture of
pyramids with slightly more curves would be considered a cat.
Similarly, the solution proposed by the other robot, which comes very
close to the cat drawing, is going to produce an algorithm that also makes
mistakes easily. In the end, Vapnik proposed choosing the solution that
provided the greatest margin possible from one side of the dividing line to the
other, such that the two groups would be as far as possible from the line. Using
mathematical calculations, the algorithm thus finds the straight line with the
greatest margin. This increases its chances of correctly classifying future
pictures.
This algorithm is called the optimal margin separator. In the drawing with
the third robot holding the bar, this means having the largest possible gray strip
from one side of the separator to the other.
TOO MUCH CLASS!
Using calculations, it is thus possible to separate the dimension space into two
parts: the “cats” side and the “noncats” side, in our example.
And the benefit is, once the algorithm has used the examples to separate
the two sides, it can define a decision system. The system is then able to
classify any described object based on its attributes into the “cat” category or
the “noncat” category, even if it is not included among the learning examples.
This calculation even works with very little data: the algorithm
systematically builds the optimal margin, regardless of how many dimensions
and examples are provided. Naturally, the more examples there are, the more
reliable the separator will be.
This learning algorithm is, therefore, particularly interesting. Even if it
requires complex calculations to find the separator, it allows us to handle
problems with many dimensions and relatively little data. By contrast, this is
not the case with algorithms that use decision trees.
KEEP STRAIGHT
However, even in the 1990s, no one really gave much thought to this method.
The SVM algorithm does, indeed, have a serious limitation: the separation
that it establishes between the two clusters is inevitably linear, in the
mathematical sense of the term. In plain language, it’s a straight line.
Let us consider the picture below:
It has a lot of curves and triangles, which means it would be placed in the
top right of our dimension space – that is, next to the cats. And there, it’s no
good: to separate this picture from the cat group, we would need a curved line
– in other words, a nonlinear separator. However, the SVM algorithm is
unable to produce this kind of separator. It’s rather frustrating!
THE SVM STRIKES BACK
The SVM has had a lot of success in computing since the dawn of the 21st
century. It is very effective at recognizing cat pictures, even when there are
dirty tricks like a wolf with its ears up or the Pyramid of Giza (don’t laugh, the
triangular shape tricks the computer very easily). Let’s be serious; SVM is
particularly very useful for a whole ton of other applications, starting with
understanding natural language (human language).
Processing natural language is a difficult problem for AI and, yet, one of
the most ambitious problems to solve. Indeed, language is a very complex
phenomenon that is impossible to describe using symbolic rules alone. As
soon as you leave a very specific framework, as soon as the vocabulary
becomes the slightest bit realistic, the rules become too abundant.
However, just like every AI technique, the SVM has a weakness. To work
well, an SVM not only needs the right data set, it also needs a kernel function
that is well suited to the problem to be solved. This means that a computer
scientist must not only define the problem’s dimensions (the features) but also
what happens inside the kernel. This is a bit like the heuristic in the minimax
method for chess: if you don’t tell the machine how to evaluate the board, it
won’t play correctly.
There are some classic functions for the kernel that are well suited for each
kind of problem. They are easy to control and generally produce good results.
But as soon as you try to increase the AI’s performance, it’s not enough to just
give it more examples: you have to look under the algorithm’s hood. In other
words, you have to hot-rod the kernel.
Many years of higher learning in mathematics and computer science are
needed to understand how an SVM works and to be able to manipulate these
algorithms. Researchers spend several years building one kernel for a specific
problem. However, advancements in AI image processing and natural
language understanding, two major artificial intelligence tasks, have been
lightning fast since the advent of the SVM.
In the early 2010s, however, there was a new revolution in machine
learning: artificial neural networks.
Learning to Read 12
Understanding What a Neural
Network Is
The idea of building artificial neurons is not new. It dates back to the
middle of the 20th century and understanding how the human brain works.
In 1949, Donald Hebb, a Canadian psychologist, hypothesized that brain
cells called neurons are the basis of the brain’s learning mechanism. He
made a fundamental assertion: if two neurons are active at the same time,
the synapses between them are strengthened.
Today, neurobiologists know that it isn’t that simple, but the “Hebb
rule” has paved the way for important neuroscience research and served as
a foundation for artificial neural network development.
DRAW ME A NEURON
An artificial neuron is nothing like a human neuron. It’s simply a term used
to refer to a number-based computing operation. The computed result
depends on several neuron input parameters.
Let’s take a closer look.
To begin with, the “neuron’s” input consists of a list of numbers, which
are usually binary (0 or 1). Each number is multiplied by a coefficient,
called a weight, and the first step of the calculation is to sum everything up.
Mathematicians refer to this as the weighted sum.
The second step is to compare this sum with a value, called a bias. If the
sum is greater, the neuron’s output is a value of 1. If this is the case, we say
that the neuron fires up. If not, its output is 0. Note, however, that there is
no object or light. This is simply a calculation that produces a certain result.
The weight and bias are the neuron’s parameters. They allow us to
control how it works.
For example, let’s take the “neuron” below. We’ve indicated the inputs,
the weight and the bias.
1 × 7 + 1 × (−2) + 0 × 3 + 1 × 5 = 10
The result (10) is smaller than the bias (13). Thus, the neuron’s output is 0
(as a result, we say that it remains “off ”). This is the result of the
calculation.
By way of comparison, here’s a diagram of a human neuron from an
encyclopedia. You have to admit, this is totally different!
MORE TRIANGLES
To the far left of the diagram, sensors transform the picture into
numbers: each pixel1 is transformed into a 0 if it is off or a 1 if it is on (in
gray in our drawing). The machine used by Rosenblatt for his experiments
had a resolution of 400 pixels (20 × 20), which is ridiculously small
compared to today’s photographs. The perceptron’s input is thus 400
numbers (0 or 1), which correspond to the picture’s pixels.
The perceptron’s first layer is composed of several hundred neurons that
each calculate a different result using these numbers. Each neuron has its
own weight and its own bias, which makes all of the results different. From
the first “layer” of calculations, we obtain several hundred numbers, which
are either 0 or 1. For Rosenblatt this operation is the assimilation, and this
first list of calculations is the “assimilation layer.”
All of these results are then fed into a second input layer. Again,
hundreds of different calculations produce hundreds of results, which are
either 0 or 1. We could continue on like this for quite some time, but
Rosenblatt, in an article published in Psychological Review in 1958,
proposed stopping after two assimilation layers.
The third and final layer contains only one neuron. It takes the numbers
calculated by the preceding layer and runs a “neural” calculation to obtain a
0 or a 1. The goal of this little game is for the final output to be a 1 when
the picture contains a triangle and a 0 in all other cases.
JUST A LITTLE FINE TUNING
Unfortunately, it’s not that simple: the real difficulty consists of finding
good values for each parameter and bias. It’s like a gigantic mixing console
with millions of mixing knobs that need to be turned to just the right
position.
For each neuron, you have as many knobs as inputs. Imagine that the
first layer of our perceptron contains 1,000 neurons; that makes 400,000
parameters to adjust, just for the first layer, and each one can be any value.
There is no way we’re going to just happen upon the right configuration by
chance.
This is where supervised learning comes in.
A LONG LEARNING PROCESS
No matter what the value of the third weight is, the sum will always be
10. When the input is 0, the weight isn’t used in the calculation.
Thus, we decrease the neural network’s weights, but we also increase by
1 the bias of each neuron that fired so that they will be a little harder to fire
up with the values provided. The idea is to make the last neuron not fire
anymore if we feed it the same picture again.
Conversely, if the network answers “no” even when the operator fed it a
triangle, we have to increase the weights of the nonzero inputs and decrease
the bias for the neurons that didn’t fire (to reduce how much they filter).
Now, we make the same adjustments (move the sliders by 1 in one
direction or the other) to each example in the learning database. Obviously,
we don’t do this by hand. The weights are just numbers in a computer, and
the algorithm does the calculations all on its own by considering thousands
of previously memorized examples one after another. After a good number
of examples, with a bit of luck, the knobs are in a position that allows them
to recognize triangles, cats, or something else, based on what we decided to
teach it.
WHERE DID MY BRAIN GO?
Neural networks would eventually come back in vogue in the early 1980s.
As so often happens in science when a theory starts to fade into memory, it
was ultimately reinvented. The physicist John Joseph Hopfield rediscovered
neural networks in 1982. He proposed a new model that, in the end, was not
so different from Rosenblatt’s perceptron and breathed new life into this
scientific topic.
In 1986, the psychologist David Rumelhart and the computer scientists
Geoffrey Hinton and Ronald Williams proposed the notion of back-
propagation, a new learning algorithm inspired by the algorithms of the
1960s, to adjust the parameters of mechanical controllers. Using some
calculations, the algorithm determines which parameters have contributed
the most to the network’s right, or wrong, answer. It then adjusts the
weights and biases of these parameters more aggressively, and adjusts less
aggressively for the parameters that contributed “little” to the result.
In this way, the algorithm finds the right parameter values much more
quickly.
NETWORK SUCCESS
_____________
1 In a computer, the picture is cut up into little dots called pixels. To understand what a pixel is, stick
your nose up against a television screen or a poster. You’ll see small colored dots that, from afar,
make up the picture. Telephones, televisions, computer screens, and internet images are all
described by their resolution, which is expressed as a number of pixels. This indicates the picture
quality.
Learning as You Draw 13
Understanding What Deep
Learning Is
Deep neural networks are a win-win. Not only are they no longer limited by
the computer’s processing capacity (thanks to the GPUs, which allow it to
run operations in parallel), but their layered arrangement allows us to obtain
better learning results. What happiness!
Of course, we have to determine how to configure the network – in
other words, the way in which the calculations are performed. For this,
researchers build several configurations with two, three, four, ten, or sixteen
different-sized layers. Then they compare the performance of these different
networks on the learning database so they can choose a configuration that
produces the best results. This very empirical step is done automatically.
The computer compares the outputs of the different configurations, based
on the data provided by the human, by using the exploration algorithms
written by the human.
So, what’s the problem? Why isn’t everything resolved by these
machine-built neurons?
The machine learning system’s performance isn’t so much a result of
the chosen configuration as it is a result of the way the data are built and
analyzed by your algorithm. Let’s look at an example.
To allow a machine to “read” a word written by hand, you can use a
neural network to decode each character one by one. However, your system
will certainly be more efficient if it can also consider neighboring
characters, since the symbols aren’t completely independent. Thus, you
want to build a neural network whose inputs are a combination of letters, as
opposed to a single letter.
This combination is done using mathematical calculations that also use
parameters. Modern algorithms use machine learning to adjust these
parameters. In this way, you have neurons that not only learn the parameters
to answer the question correctly but also learn the hyperparameters to
combine data or to determine the network’s configuration.
All this work performed by the machine must first be thought out by the
computer scientists: the humans determine how to combine the data to
obtain better results based on whether the task to be resolved needs to
understand words, recognize a picture, avoid an obstacle, and so on. Each
problem requires us to build the right solution, and good results are hard to
reproduce on another problem, even if it is similar.
THE ACHIEVEMENTS OF DEEP LEARNING
The support vector machines and decision trees we saw in the preceding
chapters are not susceptible to these kinds of tricks. On the other hand, they
require sophisticated input transformation algorithms to process the inputs’
features. This work is difficult and can only be done by a human who has
spent years studying AI.
A Go board has 361 intersections where you can place a white or a black
stone. That makes 3361 possible game situations. Written out, that number
is:
174089650659031927907188238070564367946602724950
263541194828118706801051676184649841162792889887
149386120969888163207806137549871813550931295148
03369660572893075468180597603
To give you an idea of the magnitude, let’s consider the number of atoms in
the universe (1080), which is pretty much the gold standard for big numbers
in artificial intelligence. Now, imagine that our universe is but one atom in a
larger universe. You make a universe containing as many universes as our
universe contains atoms. And you count the number of atoms in this
universe of universes. That adds up to lots and lots and lots of atoms. It’s
enough to make you dizzy.
To have as many atoms as there are situations in Go, you need to make a
trillion universes like this! In short, we are very far from any modern
computer’s processing capacity. As with chess, we have to use an AI
algorithm.
Unfortunately, the minimax algorithm isn’t going to do it for us.
A BIT OF CALCULATION
Remember, this algorithm relies on the idea of trying all the game
possibilities on a few moves, and choosing the one that leads to the best
result. With 361 possible moves at the start of the game, the number of
situations to examine increases very quickly. For instance, to evaluate all
four-move games alone, the minimax algorithm must consider 361 × 360 ×
359 × 358 possibilities – more than 16 billion configurations. It will take
several seconds to process this with a modern computer.
With only four moves examined, the minimax algorithm will play like a
novice. A game of Go consists of several hundred moves. Given the billions
of possible situations after only four moves, it is not possible to see your
opponent’s strategy at all! We need to come up with a better method.
In Go, players take turns placing black and white stones on the board to
form territories. The white player just made his second move. It is hard to
say exactly how the game will turn out at this point.
WANT TO PLAY AGAIN?
The algorithm used by computers to win at Go has been known since the
late 1940s by the name “Monte Carlo Tree Search,” a reference to the
casino located in Monaco.
What an odd name for an algorithm! It was given this name because this
algorithm relies on chance to resolve problems.
For each of the 361 possible moves, we’re going to play a thousand
games at random. Completely at random. The opponent’s moves and the
computer’s responses are drawn at random. We make note of the game’s
outcome, and we start over one thousand times. Each time, we start over
with the same initial configuration.
To evaluate each of the 361 moves, the algorithm makes a statistical
calculation on the scores obtained for the thousand games played at
random. For example, you can calculate the mean of the scores (in practice,
computer scientists use slightly more complex calculations). The Monte
Carlo algorithm makes the hypothesis that, with some luck, the best move
to make is the one that leads the board to produce the best statistics. This is
the move that will be chosen.
By assuming that each game lasts under 300 moves (which is a
reasonable limit if we base this on games played by humans), we will have
made a maximum of 300,000 calculations for each of the 361 “first moves”
that we have to evaluate, which is around 130 million operations in total.
This is more than reasonable for a modern computer.
RED, ODD, AND LOW!
Monte Carlo algorithms have been used in this way since the 1990s to win
at Go. In 2008, the program MoGo, written by researchers at Inria, the
French Institute for Research in Digital Science and Technology, achieved a
new feat by repeatedly beating professional players. By 2015, programs had
already made a lot of progress in official competitions, although they had
failed to beat the champions.
However, in March 2016, the program AlphaGo defeated the world
champion, Lee Sedol, four matches to one. For the first time, a computer
achieved the ninth Dan, the highest level in Go, and was ranked among the
best players in the world!
To achieve this feat, the engineers at Google used a Monte Carlo
algorithm combined with deep neural networks.
As with any Monte Carlo algorithm, lots of matches are played at
random to determine the best move. However, the computer scientists do
not make the computer play completely at random. Not only is the
computer unable to truly choose at random (there is always some
calculation), but it would also be unfortunate to play completely at random,
since even the greenest Go players know how to spot the worst moves.
This is where AlphaGo’s first neural network comes in. The Google
team refers to this as the “policy network.” This is about teaching a neural
network how to detect the worst moves by giving it game examples as
input. This prevents the Monte Carlo algorithm from producing too many
bad matches and improves the accuracy of its statistical calculations.
A deep neural network doesn’t work much differently than Rosenblatt’s
perceptron: the input is a 19 × 19 three-color grid (black, white, and
“empty”), and the output, instead of a simple “yes” or “no,” is a value for
each possible move. With a deep neural network and some graphics cards to
do some GPU processing, this is feasible.
NO NEED TO BE STUBBORN
Not all problems can be resolved with this method. For example, let’s
consider self-flying drones (NASA has been working on a prototype since
the early 2000s). It is not reasonable to build a million drones, send them up
in the air, and watch them crash to the ground over and over until we
eventually figure out the right parameters. Only the Shadoks1 would use
this method. Computer scientists use simulation instead. However, the
algorithm only learns how to behave in simulation. There’s no guarantee it
will work correctly in real life!
Thus, reinforcement learning can only be used for applications that
simulate no-risk operations, as is the case with Go. Using this reinforcement
learning technique, AlphaGo engineers succeeded at gradually training the
two neural networks: all you have to do is have the computer play against
itself. This produces billions of matches that allow the neural network to
gradually reinforce its confidence to make decisions, whether it be to reject
a move (the “network policy”) or to stop the game (the “value network”).
GIVE ME THE DATA!
_____________
1 The Shadoks are silly animated cartoon characters, very popular in France in the 1960s, that keep
failing in building all sorts of machines.
Strong AI 15
Understanding AI’s Limitations
Our journey in the land of AI will soon be at an end. What we’ve seen so far is
a general overview of some problems and some methods used to solve them.
There are still many other techniques we haven’t covered. Most approaches
used in AI require many years of study before they can be fully mastered.
Despite this, you get the general idea: humans build machines that perform
calculations, transform data, and resolve problems. However, these machines
are not truly intelligent. They apply a procedure created by a human being.
So, is it possible to build machines that are truly intelligent? Machines
capable of resolving all kinds of problems without human intervention?
Machines capable of learning and developing as children do? Machines aware
of the world around them? Machines that can feel emotion, form a society, and
build a future together? Some people, such as French researcher Jacquets
Pitrat, firmly believe so.
We might as well say it now: this is a tricky topic that has more in common
with science fiction than with actual science. It goes without saying, the dream
of a truly intelligent machine made in humans’ image has been an
extraordinary driving force behind all of the developments in artificial
intelligence since the 1950s. Currently, however, there is nothing to suggest it
is possible to build such a machine.
STRONG AI
The artificial intelligence techniques we have seen throughout this book were
systematically developed for a very specific goal: to effectively resolve a given
problem. They do not attempt to imitate humans in every domain.
In the 1970s, the philosopher John Searle hypothesized that the human
brain could behave like a computer, a machine that processes information.
Subsequently, he used Strong AI to refer to artificial intelligence capable of
perfectly imitating the human brain. The term stuck.
The second problem that research into strong AI looks at is how to build an
artificial consciousness – a machine that would be conscious of physical or
immaterial elements in addition to the data it manipulates, a machine that
would be aware that it’s a machine.
Researchers in this domain raise questions that are entirely different from
the ones we’ve studied so far. First and foremost, scientists attempt to define
what consciousness is and what a machine would have to produce to allow us
to say that it is conscious. This is an open question and it relates to subjects
that draw on philosophy and neuroscience just as much as artificial
intelligence.
What would the neural activity of such a consciousness look like? Could
this phenomenon be reproduced in a machine? Under what conditions? An
underlying idea in artificial consciousness research is that a machine capable
of simulating a human brain would be able to create a consciousness, but
philosophers are skeptical of this purely mechanical vision of consciousness.
Numerous researchers believe the body also plays a fundamental role in
consciousness, just as it does with emotion.
AN UNCERTAIN FUTURE
Not with what we know at present. “Weak” AIs have no ability to create or
imagine. For this reason, they are unable to go beyond the framework
programmed for them. Even when we talk of “autonomous vehicles,” or self-
driving cars, the autonomy in question is limited to what the machine has been
designed for. The car will not “turn” against its driver unless it has been
programmed to do so.
Isaac Asimov illustrated this situation very well in his Robot series. The
characters are machines that obey the stories’ three laws of robotics, the first
being to never injure a human being. Throughout the stories, we discover that
the robots can adopt very strange behaviors, and even cause accidents.
However, never do the robots violate the laws of robotics. No robot ever
injures a human or allows a human to come to harm.
Such “autonomous” systems are becoming more and more common, and
we can see they are relatively harmless indeed… as long as we use them
correctly! This is exactly like other tools we use on a daily basis. Lots of
people get hurt each year with a screwdriver, a drill, or a saw, occasionally
because of a defect, but more often because of misuse. However, these objects
are not conspiring against us. This only happens in horror movies.
As for strong AI that would be self-aware and determined to break free of
our control, there is nothing preventing anyone from claiming it may exist one
day. Artificial intelligence, the generalization of “big data,” and recent
advancements in deep learning are undeniable leaps in technology. Yet they
don’t bring us any closer to artificial consciousness. They are still just tools at
the service of humans.
MISUSE OF AI
The main danger with AI today is how these technologies can be used. Let’s
consider search engines like Google and Baidu, for example. They allow
anyone to access any information contained in the giant library that is the
internet. But these same algorithms can be used to automatically filter the
content you access. This is also what Facebook does when it calculates the
information that might interest you based on your profile.
This wouldn’t be so bad if social networks hadn’t become the main source
of information for persons aged 18–24. What’s to prevent an information
dictatorship controlled by tech giants or totalitarian states from deciding
overnight what we can or cannot know?
It’s the same thing for almost every AI technique. What’s to prevent the
image recognition or automated decision-making technology currently used to
develop self-driving cars from one day being used to develop killer robots
capable of identifying their targets and eliminating them without human
intervention? Many researchers including Stuart Russell have already raised
alarms about the malicious use of AI technology.
And they are right: it is urgent that we understand what AI algorithms are
capable of so that we, humans, can decide what we choose to consider
acceptable. We need to understand which AI uses we wish to prevent and how
to guard against them. It is up to us to develop algorithms that do not violate
these ethical rules – not even by accident.
AI SERVING PEOPLE
Computer programmers, users, and politicians are all responsible for how AI
will be used: above all, this is a legislation and ethics problem. It is important
that we guard against potential malicious acts guided by political, religious, or
economic ideologies. More than anything else, we mustn’t forget that artificial
intelligence is there to help us, not to imprison us!
Every day we use dozens of AI algorithms for our greater good, from
braking systems on cranes and subway trains to information search tools that
allow us to find an address, prepare a school presentation, or find our favorite
song. These uses are often so insignificant that we don’t even realize we’re
using AI.
Without a doubt, AI programs will one day be used to commit crime. In
fact, this has most likely already happened, even if only in the domain of
cybersecurity: hackers use programs to accomplish tasks that, to them, are
“intelligent.” This is AI. As a result, cybersecurity specialists develop AI
algorithms to guard against this. They detect server intrusions with the help of
machine learning and isolate infected network computers using multi-agent
systems.
We mustn’t aim at the wrong target: AI can be misused, but it doesn’t act
on its own. It has been programmed by a programmer for a specific purpose. It
has been used by a human with an objective in mind. In every instance, it’s a
human who is ultimately responsible for how it is used.
One mustn’t confuse a tool and its use. An AI can be misused. An AI
cannot, however, become malicious spontaneously, at least not based on
today’s scientific knowledge.
EXPLAINING, ALWAYS EXPLAINING
James Allen is a researcher who isn’t content to make only one major
contribution to artificial intelligence. James Allen is famous for his
temporal reasoning model that allows computers to reason about notions
such as “before,” “after,” “at the same time,” and “during.” He has also
helped develop several natural language dialogue systems that combine
knowledge representation techniques in symbolic artificial intelligence with
statistical learning.1
JOHN MCCARTHY AND PATRICK HAYES
In the 1960s, John McCarthy and Patrick Hayes laid the foundations for an
entire field of symbolic artificial intelligence. They proposed a model for
reasoning about actions and changes: things are no longer always true or
always false. In so doing, they paved the way for cognitive robotics
research – in other words, building robots capable of reasoning about how
to behave in the world. This isn’t quite artificial consciousness, however,
because the machines simply apply rules of mathematical logic within a
framework whose bounds have been set by humans. However, the systems
developed using this technology are able to adapt to different situations by
using logical reasoning.
RICHARD FIKES, NILS NILSSON, AND DREW
MCDERMOTT
Lotfi Zadeh is the inventor of fuzzy logic. Behind the odd name lies a very
particular computing model that allows computers to reason about facts that
are more or less true, such as “it’s nice outside,” “it rained a lot,” or “the car
is going fast.” Artificial intelligence programs using fuzzy logic, with
reasoning rules written by humans, have been used in all kinds of intelligent
control systems since the 1980s: cranes, elevators, meteorology, air traffic
control, and many others.
ALLEN NEWELL
In the early 2000s, symbolic AI specialists invented the semantic web. This
can be defined as a network of web pages written especially for computers
using well-formed formulas so they can read internet content just as humans
can. The semantic web uses computer models based on description logics
and the OWL language invented by Deborah McGuinness and her
colleagues.
Nowadays, computers exchange “semantic” information on the internet
to respond to our needs. For example, when you search for a celebrity using
a search engine, you obtain a text box showing the person’s age, height, and
general information. This information is taken from data in the semantic
web.
_____________
1 These terms are explained in chapters 10 and 11.
2 This term is explained in chapter 14.
3 This term is explained in chapter 10.
Acknowledgments
N. Sabouret
In the beginning, it was just a trio of computer scientists who posted about
big things on an internet blog. The blog continued on its merry way in the
hallways and break rooms at the LIP6 for 11 more years until Nicolas got
back in touch with me about this incredible collaboration. It is with deep
emotion that I would like to thank my former blog sidekicks: Nicolas
Stefanovitch, a.k.a. Zak Galou, and Thibault Opsomer, a.k.a. Glou. I would
also like to acknowledge my art companions: all the members of the CMIJ,
of the AllfanartsV2 forum, and of BD&Paillettes. Thanks to François for
his help with logistics (dishes, cooking, and babysitting Renzo), which
allowed me to put on my artist apron every night in December 2018.
Finally, thank you, Nicolas, for your enthusiasm and availability for this
beautiful collaboration. I am happy I was able to fulfill my lifelong dream
of helping others learn about science through my artwork.
L. De Assis
They made AI
but they were not alone…
Babbage, Ch., 5, 41
Deep Blue, 44
Dijkstra, E., 21, 36
Dorigo, M., 70
Greedy algorithm, 55
Glover, F., 59
Hart, P., 37
Hinton, G., 117
Hopfield, J., 117
Hsu, F.-H., 44
K-means, 81
Loebner, H., 17
Lovelace, A., 41
Raphael, B., 37
Rechenberg, I., 61
Rosenblatt, F., 91, 113
Rumelhart, D., 117
Russell, S., 20, 144
Samuel, A., 90
Searle, J., 19, 138
Sutton, R., 132
Wallace, R., 19
Watson (IBM), 153
Weizenbaum, J., 17
Williams, R., 117
Winograd, T., 90