(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf
(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf
com
DOWLOAD EBOOK
ebooknice.com
https://ebooknice.com/product/hands-on-guide-to-apache-
spark-3-56859512
ebooknice.com
ebooknice.com
(Ebook) Graph Algorithms: Practical Examples in Apache
Spark and Neo4j by Mark Needham; Amy E. Hodler ISBN
9781492047681, 1492047686
https://ebooknice.com/product/graph-algorithms-practical-examples-in-
apache-spark-and-neo4j-24067798
ebooknice.com
ebooknice.com
ebooknice.com
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
ebooknice.com
SECOND EDITION Covers Apache Spark 3
Jean-Georges Perrin
Foreword by Rob Thomas
MANNING
Lexicon
Summary of the Spark terms involved in the deployment process
Term Definition
Application Your program that is built on and for Spark. Consists of a driver program and executors on the cluster.
Application JAR A Java archive (JAR) file containing your Spark application. It can be an uber JAR including all the
dependencies.
Cluster manager An external service for acquiring resources on the cluster. It can be the Spark built-in cluster manager.
More details in chapter 6.
Deploy mode Distinguishes where the driver process runs.
In cluster mode, the framework launches the driver inside the cluster. In client mode, the submitter
launches the driver outside the cluster.
You can find out which mode you are in by calling the deployMode() method. This method returns
a read-only property.
Driver program The process running the main() function of the application and creating the SparkContext.
Everything starts here.
Executor A process launched for an application on a worker node. The executor runs tasks and keeps data in
memory or in disk storage across them. Each application has its own executors.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action
(for example, save() or collect()); check out appendix I).
Stage Each job gets divided into smaller sets of tasks, called stages, that depend on each other
(similar to the map and reduce stages in MapReduce).
Task A unit of work that will be sent to one executor.
Worker node Any node that can run application code in the cluster.
Application processes
and resources
elements
Worker node
Job: parallel tasks triggered
Application JAR after an action is called Cache
Executor
Driver program
Task Task
SparkSession
Cluster manager
(SparkContext)
Worker node
Jobs are split
into stages. Cache
Executor
JEAN-GEORGES PERRIN
FOREWORD BY ROB THOMAS
MANNING
SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books are
printed on paper that is at least 15 percent recycled and processed without the use of elemental
chlorine.
ISBN 9781617295522
Printed in the United States of America
Liz,
Thank you for your patience, support, and love during this endeavor.
dataframe 14
v
vi CONTENTS
4 Fundamentally lazy 68
4.1 A real-life example of efficient laziness 69
4.2 A Spark example of efficient laziness 70
Looking at the results of transformations and actions 70 The ■
further 121
6.2 Building a cluster 121
Building a cluster that works for you 122 ■
Setting up the
environment 123
6.3 Building your application to run on the cluster 126
Building your application’s uber JAR 127 ■
Building your
application by using Git and Maven 129
6.4 Running your application on the cluster 132
Submitting the uber JAR 132 Running the application
■
133
Analyzing the Spark user interface 133
Parquet 167
Elasticsearch 191
9.5 Behind the scenes: Building the data source itself 203
9.6 Using the register file and the advertiser class 204
9.7 Understanding the relationship between the data and
schema 207
The data source builds the relation 207 ■
Inside the relation 210
9.8 Building the schema from a JavaBean 213
9.9 Building the dataframe is magic with the utilities 215
9.10 The other classes 220
transformation 275
x CONTENTS
Lake 389
17.3 Accessing cloud storage services from Spark 392
Other options for accessing files in Spark 407 Hybrid solution for
■
xiii
xiv FOREWORD
workflows, enabling builders to focus on machine learning and the ecosystem around
Spark. As we have seen time and again, an open source project is igniting innovation,
with speed and scale.
This book takes you deeper into the world of Spark. It covers the power of the
technology and the vibrancy of the ecosystem, and covers practical applications for
putting Spark to work in your company today. Whether you are working as a data engi-
neer, data scientist, or application developer, or running IT operations, this book
reveals the tools and secrets that you need to know, to drive innovation in your com-
pany or community.
Our strategy at IBM is about building on top of and around a successful open plat-
form, and adding something of our own that’s substantial and differentiated. Spark is
that platform. We have countless examples in IBM, and you will have the same in your
company as you embark on this journey.
Spark is about innovation—an analytics operating system on which new solutions
will thrive, unlocking the big data scale effect. And Spark is about a community of
Spark-savvy data scientists and data analysts who can quickly transform today’s prob-
lems into tomorrow’s solutions. Spark is one of the fastest-growing open source proj-
ects in history. Welcome to the movement.
—ROB THOMAS
SENIOR VICE PRESIDENT,
CLOUD AND DATA PLATFORM, IBM
preface
I don’t think Apache Spark needs an introduction. If you’re reading these lines, you
probably have some idea of what this book is about: data engineering and data science
at scale, using distributed processing. However, Spark is more than that, which you
will soon discover, starting with Rob Thomas’s foreword and chapter 1.
Just as Obelix fell into the magic potion,1 I fell into Spark in 2015. At that time, I
was working for a French computer hardware company, where I helped design highly
performing systems for data analytics. As one should be, I was skeptical about Spark at
first. Then I started working with it, and you now have the result in your hands. From
this initial skepticism came a real passion for a wonderful tool that allows us to process
data in—this is my sincere belief—a very easy way.
I started a few projects with Spark, which allowed me to give talks at Spark Summit,
IBM Think, and closer to home at All Things Open, Open Source 101, and through
the local Spark user group I co-animate in the Raleigh-Durham area of North Caro-
lina. This allowed me to meet great people and see plenty of Spark-related projects. As
a consequence, my passion continued to grow.
This book is about sharing that passion.
Examples (or labs) in the book are based on Java, but the only repository contains
Scala and Python as well. As Spark 3.0 was coming out, the team at Manning and I
1
Obelix is a comics and cartoon character. He is the inseparable companion of Asterix. When Asterix, a Gaul,
drinks a magic potion, he gains superpowers that allow him to regularly beat the Romans (and pirates). As a
baby, Obelix fell into the cauldron where the potion was made, and the potion has an everlasting effect on him.
Asterix is a popular comic in Europe. Find out more at www.asterix.com/en/.
xv
xvi PREFACE
decided to make sure that the book reflects the latest versions, and not as an after-
thought.
As you may have guessed, I love comic books. I grew up with them. I love this way
of communicating, which you’ll see in this book. It’s not a comic book, but its nearly
200 images should help you understand this fantastic tool that is Apache Spark.
Just as Asterix has Obelix for a companion, Spark in Action, Second Edition has a
reference companion supplement that you can download for free from the resource
section on the Manning website; a short link is http://jgp.net/sia. This supplement
contains reference information on Spark static functions and will eventually grow to
more useful reference resources.
Whether you like this book or not, drop me a tweet at @jgperrin. If you like it,
write an Amazon review. If you don’t, as they say at weddings, forever hold your peace.
Nevertheless, I sincerely hope you’ll enjoy it.
Alea iacta est.2
2
The die is cast. This sentence was attributed to Julius Caesar (Asterix’s arch frenemy) as Caesar led his army
over the Rubicon: things have happened and can’t be changed back, like this book being printed, for you.
acknowledgments
This is the section where I express my gratitude to the people who helped me in this
journey. It’s also the section where you have a tendency to forget people, so if you feel
left out, I am sorry. Really sorry. This book has been a tremendous effort, and doing it
alone probably would have resulted in a two- or three-star book on Amazon, instead of
the five-star rating you will give it soon (this is a call to action, thanks!).
I’d like to start by thanking the teams at work who trusted me on this project, start-
ing with Zaloni (Anupam Rakshit and Tufail Khan), Lumeris (Jon Farn, Surya
Koduru, Noel Foster, Divya Penmetsa, Srini Gaddam, and Bryce Tutt; all of whom
almost blindly followed me on the Spark bandwagon), the people at Veracity Solu-
tions, and my new team at Advance Auto Parts.
Thanks to Mary Parker of the Department of Statistics at the University of Texas at
Austin and Cristiana Straccialana Parada. Their contributions helped clarify some
sections.
I’d like to thank the community at large, including Jim Hughes, Michael Ben-
David, Marcel-Jan Krijgsman, Jean-Francois Morin, and all the anonymous posting
pull requests on GitHub. I would like to express my sincere gratitude to the folks at
Databricks, IBM, Netflix, Uber, Intel, Apple, Alluxio, Oracle, Microsoft, Cloudera,
NVIDIA, Facebook, Google, Alibaba, numerous universities, and many more who con-
tribute to making Spark what it is. More specifically, for their work, inspiration, and
support, thanks to Holden Karau, Jacek Laskowski, Sean Owen, Matei Zaharia, and
Jules Damji.
xvii
xviii ACKNOWLEDGMENTS
xix
xx ABOUT THIS BOOK
■
Extending Spark with new data sources.
■
Linearizing data from JSON so you can use SQL.
■
Performing aggregations and unions on dataframes.
■
Extending aggregation with user-defined aggregate functions (UDAFs).
■
Understanding the difference between caching and checkpointing, and
increasing performance of your Spark applications.
■
Exporting data to files and databases.
■
Understanding deployment on AWS, Azure, IBM Cloud, GCP, and on-premises
clusters.
■
Ingesting data from files in CSV, XML, JSON, text, Parquet, ORC, and Avro.
■
Extending data sources, with an example on how to ingest photo metadata
using EXIF, focusing on the Data Source API v1.
■
Using Delta Lake with Spark while you build pipelines.
Spark
Compo- Database/
File compo-
nent datastore
nent
In part 2, you will start diving into practical and pragmatic examples around inges-
tion. Ingestion is the process of bringing data into Spark. It is not complex, but there
are a lot of possibilities and combinations.
■
Chapter 7 describes data ingestion from files: CSV, text, JSON, XML, Avro,
ORC, and Parquet. Each file format has its own example.
■
Chapter 8 covers ingestion from databases: data will be coming from relational
databases and other data stores.
■
Chapter 9 is about ingesting anything from custom data sources.
■
Chapter 10 focuses on streaming data.
Part 3 is about transforming data: this is what I would call heavy data lifting. You’ll
learn about data quality, transformation, and publishing of your processed data. This
largest part of the book talks about using the dataframe with SQL and with its API,
aggregates, caching, and extending Spark with UDF.
■
Chapter 11 is about the well-known query language SQL.
■
Chapter 12 teaches you how to perform transformation.
■
Chapter 13 extends transformation to the level of entire documents. This chap-
ter also explains static functions, which are one of the many great aspects of
Spark.
■
Chapter 14 is all about extending Spark using user-defined functions.
■
Aggregations are also a well-known database concept and may be the key to ana-
lytics. Chapter 15 covers aggregations, both those included in Spark and cus-
tom aggregations.
Finally, part 4 is about going closer to production and focusing on more advanced
topics. You’ll learn about partitioning and exporting data, deployment constraints
(including to the cloud), and optimization.
■
Chapter 16 focuses on optimization techniques: caching and checkpointing.
■
Chapter 17 is about exporting data to databases and files. This chapter also
explains how to use Delta Lake, a database that sits next to Spark’s kernel.
■
Chapter 18 details reference architectures and security needed for deployment.
It’s definitely less hands-on, but so full of critical information.
The appendixes, although not essential, also bring a wealth of information: installing,
troubleshooting, and contextualizing. A lot of them are curated references for
Apache Spark in a Java context.
All the code is freely available on GitHub under an Apache 2.0 license. The data
may have a different license. Each chapter has its own repository: chapter 1 will be in
https://github.com/jgperrin/net.jgp.books.spark.ch01, while chapter 15 is in https://
github.com/jgperrin/net.jgp.books.spark.ch15, and so on. Two exceptions:
■
Chapter 6 uses the code of chapter 5.
■
Chapter 18, which talks about deployment in detail, does not have code.
As source control tools allow branches, the master branch contains the code against
the latest production version, while each repository contains branches dedicated to
specific versions, when applicable.
Labs are numbered in three digits, starting at 100. There are two kinds of labs: the
labs that are described in the book and the extra labs available online:
■
Labs described in the book are numbered per section of the chapter. There-
fore, lab #200 of chapter 12 is covered in chapter 12, section 2. Likewise, lab
#100 of chapter 17 is detailed in the first section of chapter 17.
■
Labs that are not described in the book start with a 9, as in 900, 910, and so on.
Labs in the 900 series are growing: I keep adding more. Labs numbers are not
contiguous, just like the line numbers in your BASIC code.
In GitHub, you will find the code in Python, Scala, and Java (unless it is not applica-
ble). However, to maintain clarity in the book, only Java is used.
In many cases, the original source code has been reformatted; we’ve added line
breaks and reworked indentation to accommodate the available page space in the
book. In rare cases, even this was not enough, and listings include line-continuation
markers (➥). Additionally, comments in the source code have often been removed
from the listings when the code is described in the text. Code annotations accompany
many of the listings, highlighting important concepts.
xxv
about the cover illustration
The figure on the cover of Spark in Action is captioned “Homme et Femme de Hous-
berg, près Strasbourg” (Man and Woman from Housberg, near Strasbourg). Housberg
has become Hausbergen, a natural region and historic territory in Alsace now divided
between three villages: Niederhausbergen (lower Hausbergen), Mittelhausbergen
(middle Hausbergen), and Oberhausbergen (upper Hausbergen). The illustration is
from a collection of dress costumes from various countries by Jacques Grasset de Saint-
Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797.
Each illustration is finely drawn and colored by hand.
This particular illustration has special meaning to me. I am really happy it could be
used for this book. I was born in Strasbourg, Alsace, currently in France. I immensely
value my Alsatian heritage. When I decided to immigrate to the United States, I knew
I was leaving behind a bit of this culture and my family, particularly my parents and sis-
ters. My parents live in a small town called Souffelweyersheim, directly neighboring
Niederhausbergen. This illustration reminds me of them every time I see the cover
(although my dad has a lot less hair).
The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how
culturally separate the world’s towns and regions were just 200 years ago. Isolated
from each other, people spoke different dialects (here, Alsatian) and languages. In
the streets or in the countryside, it was easy to identify where someone lived and what
their trade or station in life was just by their dress.
The way we dress has changed since then, and the diversity by region, once so rich,
has faded away. It’s now hard to distinguish the inhabitants of different continents, let
xxvi
ABOUT THE COVER ILLUSTRATION xxvii
alone different towns, regions, or countries. Perhaps we have traded cultural diversity
for a more varied personal life—certainly for a more varied and fast-paced technolog-
ical life.
At a time when it’s hard to tell one computer book from another, Manning cele-
brates the inventiveness and initiative of the computer business with book covers
based on the rich diversity of regional life of two centuries ago, brought back to life by
Grasset de Saint-Sauveur’s pictures.
Exploring the Variety of Random
Documents with Different Content
Fig. 78.
“On revient toujours à ses premiers amours,” says the French song.
But who would have thought that, after rising step by step above the
fish, and tracing the history of the backboned animals through their
development in the air and over the land, till we brought them to a
stage of intelligence second only to man, we should have to follow
them back again to the water and find the highly gifted milk-givers
taking on the form and appearance of fishes? Nevertheless it is so, for
seals and whales are as truly flesh-eating milk-givers as bears and
wolves; nor are they much behind them in intelligence, for we all know
how teachable and affectionate seals and sea-lions are, while what
little is known of the life of whales shows that they are devoted
mothers, and their well convoluted though small brains are a proof that
they are by no means wanting in intelligence.
Yet the whales and dolphins, at any rate, have not only adopted a
sea life, but have limbs so like a fish’s fins that we can scarcely call
them by any other name, and they are so completely water animals
that they cannot even return to the land.
Now we should be quite puzzled to account for such curious forms
as these warm-blooded animals, half transformed into fish, if it were
not that we know of several land animals belonging to different groups
which have gone part of the way towards a fish life. Thus among the
reptiles we have the oceanic turtles and the sea snakes; among birds
the penguins, whose wings have almost become fins. Then among the
milk-givers we have the web-footed Duck-billed Platypus, the Yapock
or web-footed opossum of South America, the Desman and the
Beaver, the Polar Bear, and last but not least the Otters, web-footed
animals nearly allied to the weasels, which seek their food entirely in
the water.
The common Otter of Europe and America though he moves
quickly and actively on land, has webbed toes with only short claws
standing out beyond the swimming foot, and he spends the greater
part of his life in the river, making his home in a hollow of the bank
beneath the overhanging roots of trees. There he may still be seen in
many of our English rivers, his soft brown fur shining as he swims
along, diving under water for a fish, which he brings out on to the bank
to eat, holding it in his fore paws.
But there is an otter which has deserted the old land life much
more completely than this, for the great Sea-Otters of the North
Pacific, about four or five feet long (see Fig. 79), never care even to
come on shore, but, when they have dived for their prey, turn on their
backs and float while they eat it, holding the sea-urchins, crabs, or fish,
in their front paws. They even nurse their young ones in the same
fashion, dandling them in their arms as they lie face upwards on the
sea; and they rear them entirely on the thick beds of kelp off the coasts
of the North Pacific Ocean, never bringing them on land.
These sea-otters may be seen in hundreds off the coasts of Alaska
and California, basking on the wet rocks, playing, leaping, and
plunging in the water, till some alarm makes each mother seize her
little one in her teeth and dive under in an instant.
They are twice the size of the River Otter, and in many points more
like seals, for though their front paws are short and cat-like, their hind
feet are flat flippers, with a long outer toe; their face too is broad and
short, and their teeth are neither cutting like the weasels nor flattened
like the bears, but covered with rounded knobs, well fitted for crushing
crab-shells and the bones of the fish on which they feed.
Fig. 79.
Sea-Otter.180—(From Wolf.)
Showing the front paws, and the hind webbed feet.
These creatures, although they have “flippers,” and are truly fin-
footed, are much more like land animals than the smaller seals, for
they plant their whole foot on the ground as a bear does, and walk, or,
more properly, “flop along” on all fours. A mere glance at the skeleton
of the sea lion, which is one of these higher kind of seals with a slight
182
outer ear, shows that it is a four-footed animal, with five toes to
each foot, the great toes and the thumbs being the largest. We can
see distinctly the short thighs and the long shanks, which give the hind
flippers their lanky appearance, and we see, too, the broad stumpy
arms, which give such strength to the front flippers in swimming. For
the eared seals and walruses use their fore flippers very much in the
water, while the true seals swim almost entirely with the hind flippers,
and use the front ones chiefly for guiding themselves.
And now if we turn to the living fur seal we find that the reasons
are twofold which make us forget that his limbs are legs. In the first
place, the skin of his body comes down very low over his arms (see
Fig. 81), while the hand is encased in skin, with only mere traces of
nails upon it Then as regards his hind legs, not only are the feet made
into flippers, in which the toes are joined by a loose flexible skin, so
that they can move them freely when swimming, but the legs
themselves are strapped back by a skin passing right across his tail,
so that his thighs are kept flat against his side, and only the lower part
of the legs has power to move. We lose sight, then, of the limbs, and
see very little more than the feet, which are disguised by being turned
into flippers.
Now if we once think what is the object of a seal’s life, this curious
change in its body is at once explained. For seals are the hunters of
the sea; fish-food is to them what flesh-food is to lions, wolves, and
bears, only that they have a much wider field to hunt in, for they have
the whole ocean for their feeding ground, and no one to dispute it with
them but the sea-otter in places near the land, and the porpoises and
other fish-feeding whales out at sea. In consequence of this we find
seals of some kind in almost all parts of the world, except the Indian
Ocean, though they evidently prefer the cooler regions. Even the large
sea lions live in the North Pacific, as far up as the Aleutian Isles, and in
the South Pacific down to the Falkland Islands and Kerguelen’s Land,
and play about the shores of the Cape, New Zealand, and Australia.
Fig. 81.
Showing how the Sea Lion walks on the flat hind feet,
while the seal’s flippers lie back in a line with the body; note
also the absence of an external ear in the seal.
Fig. 82.
* * * * *
Thus the seals are bold ocean lovers, feeding entirely on animal
food, and finding plenty of it in the wide sea as they roam. But there is
another family of warm-blooded animals, pure vegetable-feeders,
which also must have found their way in distant ages into the water; for
they too are milk-givers, and though they have lost their hind legs,
have still the front legs with all their proper bones, with the hands
turned into flippers.
These animals are the curious sea-cows or Manatees, which
wander under water along the east coast of Africa and west coast of
South America, feeding in the bays and often up the rivers, on the
seaweeds and water-plants of all kinds; while another kind with tusks,
called the Dugong, feeds all along the shores of the Indian Ocean and
Australia.
It is strange that while every child knows something about seals,
very few people have heard of these gentle grazing manatees and
dugongs, the only large vegetable-feeders of the sea. Yet they are
curious, interesting animals, and seem to be the forms which have
185
given rise to the popular stories of mermaids, for they suckle their
young ones at the breast, clasping them with their flippers, and when
they raise their heads in the water have something the appearance of
an uncouth mother nursing her child.
Fig. 83.
* * * * *
And now that we have firmly grasped the fact that the seals and
manatees, however altered in shape, belong to the four-footed and
milk-giving group, perhaps we shall be prepared to understand how it
186
is that the whales are not fish, though this popular delusion is one
of the most difficult to overcome. “Do you really mean then,” exclaim
nearly all people who are not naturalists, “that a whale is not a huge
fish?” Certainly I do! A whale is no more a fish than crocodiles,
penguins, or seals, are fish although they too live chiefly in the water.
A whale is a warm-blooded, air-breathing, milk-giving animal. Its
fins are hands with finger-bones, having a large number of joints (see
Fig. 84); its tail is a piece of cartilage or gristle, and not a fish’s fin with
bones and rays; it has teeth in its gums even if it never cuts them; and
it gives suck to its little one just as much as a cow does to her calf (see
Fig. 85). Nay! the whalebone whales have even the traces of hind legs
entirely buried under the skin (see Fig. 84), and in the Greenland
whale the hip-joint and knee-joint can be distinguished with some of
their muscles, though the bones are quite hidden and useless.
Fig. 84.
b, blowhole; a, upper arm; fa, fore arm; h, hand; p, th, l, small remains of
pelvis or hip-bone, thigh, and leg; r, roof of the palate; w, w, plates of
whalebone; f, whalebone fringe.
We see then that the whale undoubtedly belongs to the same type
as the four-footed land animals, although it branched off into the water
so long ago that it may have come from some very early milk-giver. But
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com