100% found this document useful (9 votes)
387 views

(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf

The document provides information about various ebooks available for download, particularly focusing on Apache Spark and its applications in Java, Python, and Scala. It includes details about the content and authors of specific titles, as well as links to purchase or download them. Additionally, it features a summary of Spark terms and concepts relevant to deploying applications using Spark.

Uploaded by

nothintengou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (9 votes)
387 views

(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf

The document provides information about various ebooks available for download, particularly focusing on Apache Spark and its applications in Java, Python, and Scala. It includes details about the content and authors of specific titles, as well as links to purchase or download them. Additionally, it features a summary of Spark terms and concepts relevant to deploying applications using Spark.

Uploaded by

nothintengou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Download Full Version ebook - Visit ebooknice.

com

(Ebook) Spark in Action - Second Edition: Covers


Apache Spark 3 with Examples in Java, Python, and
Scala by Jean-Georges Perrin ISBN 9781617295522,
1617295523
https://ebooknice.com/product/spark-in-action-second-
edition-covers-apache-spark-3-with-examples-in-java-python-
and-scala-11045384

Click the button below to download

DOWLOAD EBOOK

Discover More Ebook - Explore Now at ebooknice.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason;


Viles, James ISBN 9781459699816, 9781743365571,
9781925268492, 1459699815, 1743365578, 1925268497
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Beginning Apache Spark 3: With DataFrame, Spark


SQL, Structured Streaming, and Spark Machine Learning
Library by Hien Luu ISBN 9781484273821, 1484273826
https://ebooknice.com/product/beginning-apache-spark-3-with-dataframe-
spark-sql-structured-streaming-and-spark-machine-learning-
library-35191130
ebooknice.com

(Ebook) Hands-on Guide to Apache Spark 3 by --

https://ebooknice.com/product/hands-on-guide-to-apache-
spark-3-56859512

ebooknice.com

(Ebook) Graph Algorithms: Practical Examples in Apache


Spark and Neo4j by Mark Needham, Amy E. Hodler ISBN
9781492047681, 1492047686
https://ebooknice.com/product/graph-algorithms-practical-examples-in-
apache-spark-and-neo4j-10436928

ebooknice.com
(Ebook) Graph Algorithms: Practical Examples in Apache
Spark and Neo4j by Mark Needham; Amy E. Hodler ISBN
9781492047681, 1492047686
https://ebooknice.com/product/graph-algorithms-practical-examples-in-
apache-spark-and-neo4j-24067798

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena


Alfredsson, Hans Heikne, Sanna Bodemyr ISBN 9789127456600,
9127456609
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) Stream Processing with Apache Spark: Mastering


Structured Streaming and Spark Streaming by Gerard Maas,
Francois Garillot ISBN 9781491944240, 1491944242
https://ebooknice.com/product/stream-processing-with-apache-spark-
mastering-structured-streaming-and-spark-streaming-10998176

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT


II Success) by Peterson's ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018

ebooknice.com

(Ebook) Cambridge IGCSE and O Level History Workbook 2C -


Depth Study: the United States, 1919-41 2nd Edition by
Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044

ebooknice.com
SECOND EDITION Covers Apache Spark 3

With examples in Java,


Python, and Scala

Jean-Georges Perrin
Foreword by Rob Thomas

MANNING
Lexicon
Summary of the Spark terms involved in the deployment process

Term Definition
Application Your program that is built on and for Spark. Consists of a driver program and executors on the cluster.
Application JAR A Java archive (JAR) file containing your Spark application. It can be an uber JAR including all the
dependencies.
Cluster manager An external service for acquiring resources on the cluster. It can be the Spark built-in cluster manager.
More details in chapter 6.
Deploy mode Distinguishes where the driver process runs.
In cluster mode, the framework launches the driver inside the cluster. In client mode, the submitter
launches the driver outside the cluster.
You can find out which mode you are in by calling the deployMode() method. This method returns
a read-only property.
Driver program The process running the main() function of the application and creating the SparkContext.
Everything starts here.
Executor A process launched for an application on a worker node. The executor runs tasks and keeps data in
memory or in disk storage across them. Each application has its own executors.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action
(for example, save() or collect()); check out appendix I).
Stage Each job gets divided into smaller sets of tasks, called stages, that depend on each other
(similar to the map and reduce stages in MapReduce).
Task A unit of work that will be sent to one executor.
Worker node Any node that can run application code in the cluster.

Application processes
and resources
elements

Worker node
Job: parallel tasks triggered
Application JAR after an action is called Cache
Executor

Driver program
Task Task

SparkSession
Cluster manager
(SparkContext)
Worker node
Jobs are split
into stages. Cache
Executor

The driver can access Task Task


its deployment mode.

Your code in a Nodes


JAR package
Apache Spark components
Spark in Action
SECOND EDITION

JEAN-GEORGES PERRIN
FOREWORD BY ROB THOMAS

MANNING
SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]

©2020 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in


any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books are
printed on paper that is at least 15 percent recycled and processed without the use of elemental
chlorine.

Manning Publications Co. Development editor: Marina Michaels


20 Baldwin Road Technical development editor: Al Scherer
PO Box 761 Review editor: Aleks Dragosavljevic´
Shelter Island, NY 11964 Production editor: Lori Weidert
Copy editor: Sharon Wilkey
Proofreader: Melody Dolab
Technical proofreader: Rambabu Dosa and
Thomas Lockney
Typesetter: Gordan Salinovic
Cover designer: Marija Tudor

ISBN 9781617295522
Printed in the United States of America
Liz,
Thank you for your patience, support, and love during this endeavor.

Ruby, Nathaniel, Jack, and Pierre-Nicolas,


Thank you for being so understanding about my lack of availability during this venture.

I love you all.


contents
foreword xiii
preface xv
acknowledgments xvii
about this book xix
about the author xxv
about the cover illustration xxvi

PART 1 THE THEORY CRIPPLED BY AWESOME EXAMPLES .............1

1 So, what is Spark, anyway? 3


1.1 The big picture: What Spark is and what it does
What is Spark? 4 ■
The four pillars of mana 6
4

1.2 How can you use Spark? 8


Spark in a data processing/engineering scenario 8 ■
Spark in a
data science scenario 9
1.3 What can you do with Spark? 10
Spark predicts restaurant quality at NC eateries 11 Spark allows■

fast data transfer for Lumeris 11 Spark analyzes equipment logs


for CERN 12 Other use cases 12


1.4 Why you will love the dataframe 12


The dataframe from a Java perspective 13 The dataframe from

an RDBMS perspective 13 A graphical representation of the


dataframe 14

v
vi CONTENTS

1.5 Your first example 14


Recommended software 15 Downloading the code 15

Running your first application 15 Your first code 17


2 Architecture and flow


2.1
19
Building your mental model 20
2.2 Using Java code to build your mental model 21
2.3 Walking through your application 23
Connecting to a master 24 Loading, or ingesting, the CSV

file 25 Transforming your data 28 Saving the work done


■ ■

in your dataframe to a database 29

3 The majestic role of the dataframe 33


3.1 The essential role of the dataframe in Spark
Organization of a dataframe 35 ■
34
Immutability is not a swear
word 36
3.2 Using dataframes through examples 37
A dataframe after a simple CSV ingestion 39 Data is stored in ■

partitions 44 Digging in the schema 45 A dataframe after a


■ ■

JSON ingestion 46 Combining two dataframes 52


3.3 The dataframe is a Dataset<Row> 57


Reusing your POJOs 58 Creating a dataset of strings

59
Converting back and forth 60
3.4 Dataframe’s ancestor: the RDD 66

4 Fundamentally lazy 68
4.1 A real-life example of efficient laziness 69
4.2 A Spark example of efficient laziness 70
Looking at the results of transformations and actions 70 The ■

transformation process, step by step 72 The code behind the


transformation/action process 74 The mystery behind the


creation of 7 million datapoints in 182 ms 77 The mystery ■

behind the timing of actions 79


4.3 Comparing to RDBMS and traditional applications 83
Working with the teen birth rates dataset 83 Analyzing ■

differences between a traditional app and a Spark app 84


4.4 Spark is amazing for data-focused applications 86
4.5 Catalyst is your app catalyzer 86
CONTENTS vii

5 Building a simple app for deployment 90


5.1 An ingestionless example

91
Calculating p 91 The code to approximate p 93 What are ■

lambda functions in Java? 99 Approximating p by using


lambda functions 101


5.2 Interacting with Spark 102
Local mode 103 Cluster mode

104 ■
Interactive mode in
Scala and Python 107

6 Deploying your simple app


6.1
114
Beyond the example: The role of the components 116
Quick overview of the components and their interactions 116
Troubleshooting tips for the Spark architecture 120 Going ■

further 121
6.2 Building a cluster 121
Building a cluster that works for you 122 ■
Setting up the
environment 123
6.3 Building your application to run on the cluster 126
Building your application’s uber JAR 127 ■
Building your
application by using Git and Maven 129
6.4 Running your application on the cluster 132
Submitting the uber JAR 132 Running the application

133
Analyzing the Spark user interface 133

PART 2 INGESTION. ........................................................137

7 Ingestion from files


7.1
139
Common behaviors of parsers 141
7.2 Complex ingestion from CSV 141
Desired output 142 ■
Code 143
7.3 Ingesting a CSV with a known schema 144
Desired output 145 ■
Code 145
7.4 Ingesting a JSON file 146
Desired output 148 ■
Code 149
7.5 Ingesting a multiline JSON file 150
Desired output 151 ■
Code 152
7.6 Ingesting an XML file 153
Desired output 155 ■
Code 155
viii CONTENTS

7.7 Ingesting a text file 157


Desired output 158 ■
Code 158
7.8 File formats for big data 159
The problem with traditional file formats 159 Avro is a schema- ■

based serialization format 160 ORC is a columnar storage


format 161 Parquet is also a columnar storage format 161


Comparing Avro, ORC, and Parquet 161


7.9 Ingesting Avro, ORC, and Parquet files 162
Ingesting Avro 162 Ingesting ORC 164 Ingesting
■ ■

Parquet 165 Reference table for ingesting Avro, ORC, or


Parquet 167

8 Ingestion from databases 168


8.1 Ingestion from relational databases
Database connection checklist 170 Understanding the data ■
169

used in the examples 170 Desired output 172 Code 173


■ ■

Alternative code 175


8.2 The role of the dialect 176
What is a dialect, anyway? 177 JDBC dialects provided with

Spark 177 Building your own dialect 177


8.3 Advanced queries and ingestion 180


Filtering by using a WHERE clause 180 Joining data in the ■

database 183 Performing Ingestion and partitioning 185


Summary of advanced features 188


8.4 Ingestion from Elasticsearch 188
Data flow 189 The New York restaurants dataset digested by

Spark 189 Code to ingest the restaurant dataset from


Elasticsearch 191

9 Advanced ingestion: finding data sources and building


your own 194
9.1 What is a data source? 196
9.2 Benefits of a direct connection to a data source 197
Temporary files 198 ■
Data quality scripts 198 ■
Data on
demand 199
9.3 Finding data sources at Spark Packages 199
9.4 Building your own data source 199
Scope of the example project 200 ■
Your data source API and
options 202
CONTENTS ix

9.5 Behind the scenes: Building the data source itself 203
9.6 Using the register file and the advertiser class 204
9.7 Understanding the relationship between the data and
schema 207
The data source builds the relation 207 ■
Inside the relation 210
9.8 Building the schema from a JavaBean 213
9.9 Building the dataframe is magic with the utilities 215
9.10 The other classes 220

10 Ingestion through structured streaming


10.1 What’s streaming? 224
222

10.2 Creating your first stream 225


Generating a file stream 226 Consuming the records

229
Getting records, not lines 234
10.3 Ingesting data from network streams 235
10.4 Dealing with multiple streams 237
10.5 Differentiating discretized and structured streaming 242

PART 3 TRANSFORMING YOUR DATA . .................................245

11 Working with SQL


11.1
247
Working with Spark SQL 248
11.2 The difference between local and global views 251
11.3 Mixing the dataframe API and Spark SQL 253
11.4 Don’t DELETE it! 256
11.5 Going further with SQL 258

12 Transforming your data


12.1
260
What is data transformation? 261
12.2 Process and example of record-level transformation 262
Data discovery to understand the complexity 264 Data mapping ■

to draw the process 265 Writing the transformation code 268


Reviewing your data transformation to ensure a quality process 274


What about sorting? 275 Wrapping up your first Spark

transformation 275
x CONTENTS

12.3 Joining datasets 276


A closer look at the datasets to join 276 ■
Building the list of higher
education institutions per county 278 ■
Performing the
joins 283
12.4 Performing more transformations 289

13 Transforming entire documents 291


13.1 Transforming entire documents and their structure 292
Flattening your JSON document 293 ■
Building nested documents
for transfer and storage 298
13.2 The magic behind static functions 301
13.3 Performing more transformations 302
13.4 Summary 303

14 Extending transformations with user-defined functions


14.1 Extending Apache Spark 305
304

14.2 Registering and calling a UDF 306


Registering the UDF with Spark 309 Using the UDF with the

dataframe API 310 Manipulating UDFs with SQL 312


Implementing the UDF 313 Writing the service itself 314


14.3 Using UDFs to ensure a high level of data quality 316


14.4 Considering UDFs’ constraints 318

15 Aggregating your data


15.1
320
Aggregating data with Spark 321
A quick reminder on aggregations 321 ■
Performing basic
aggregations with Spark 324
15.2 Performing aggregations with live data 327
Preparing your dataset 327 ■
Aggregating data to better
understand the schools 332
15.3 Building custom aggregations with UDAFs 338

PART 4 GOING FURTHER. ................................................345

16 Cache and checkpoint: Enhancing Spark’s performances


16.1 Caching and checkpointing can increase performance
The usefulness of Spark caching 350 The subtle effectiveness of

347
348

Spark checkpointing 351 Using caching and checkpointing 352



CONTENTS xi

16.2 Caching in action 361


16.3 Going further in performance optimization 371

17 Exporting data and building full data pipelines


17.1 Exporting data 374
Building a pipeline with NASA datasets 374 Transforming

373

columns to datetime 378 Transforming the confidence


percentage to confidence level 379 Exporting the data 379


Exporting the data: What really happened? 382


17.2 Delta Lake: Enjoying a database close to your system 383
Understanding why a database is needed 384 Using Delta Lake

in your data pipeline 385 Consuming data from Delta


Lake 389
17.3 Accessing cloud storage services from Spark 392

18 Exploring deployment constraints: Understanding the


ecosystem 395
18.1 Managing resources with YARN, Mesos, and Kubernetes 396
The built-in standalone mode manages resources 397 YARN ■

manages resources in a Hadoop environment 398 Mesos is a■

standalone resource manager 399 Kubernetes orchestrates


containers 401 Choosing the right resource manager 402


18.2 Sharing files with Spark 403


Accessing the data contained in files 404 Sharing files through

distributed filesystems 404 Accessing files on shared drives or file


server 405 Using file-sharing services to distribute files 406


Other options for accessing files in Spark 407 Hybrid solution for

sharing files with Spark 408


18.3 Making sure your Spark application is secure 408
Securing the network components of your infrastructure 408
Securing Spark’s disk usage 409
appendix A Installing Eclipse 411
appendix B Installing Maven 418
appendix C Installing Git 422
appendix D Downloading the code and getting started with Eclipse 424
appendix E A history of enterprise data 430
appendix F Getting help with relational databases 434
appendix G Static functions ease your transformations 438
appendix H Maven quick cheat sheet 446
appendix I Reference for transformations and actions 450
xii CONTENTS

appendix J Enough Scala 460


appendix K Installing Spark in production and a few tips 462
appendix L Reference for ingestion 476
appendix M Reference for joins 488
appendix N Installing Elasticsearch and sample data 499
appendix O Generating streaming data 505
appendix P Reference for streaming 510
appendix Q Reference for exporting data 520
appendix R Finding help when you’re stuck 528
index 533
foreword
The analytics operating system
In the twentieth century, scale effects in business were largely driven by breadth and dis-
tribution. A company with manufacturing operations around the world had an inher-
ent cost and distribution advantage, leading to more-competitive products. A retailer
with a global base of stores had a distribution advantage that could not be matched by
a smaller company. These scale effects drove competitive advantage for decades.
The internet changed all of that. Today, three predominant scale effects exist:

Network—Lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, and
so forth)

Economies of scale—Lower unit cost, driven by volume (Apple, TSMC, and so
forth)

Data—Superior machine learning and insight, driven from a dynamic corpus of
data
In Big Data Revolution (Wiley, 2015), I profiled a few companies that are capitalizing
on data as a scale effect. But, here in 2019, big data is still largely an unexploited asset
in institutions around the world. Spark, the analytics operating system, is a catalyst to
change that.
Spark has been a catalyst in changing the face of innovation at IBM. Spark is the
analytics operating system, unifying data sources and data access. The unified pro-
gramming model of Spark makes it the best choice for developers building data-rich
analytic applications. Spark reduces the time and complexity of building analytic

xiii
xiv FOREWORD

workflows, enabling builders to focus on machine learning and the ecosystem around
Spark. As we have seen time and again, an open source project is igniting innovation,
with speed and scale.
This book takes you deeper into the world of Spark. It covers the power of the
technology and the vibrancy of the ecosystem, and covers practical applications for
putting Spark to work in your company today. Whether you are working as a data engi-
neer, data scientist, or application developer, or running IT operations, this book
reveals the tools and secrets that you need to know, to drive innovation in your com-
pany or community.
Our strategy at IBM is about building on top of and around a successful open plat-
form, and adding something of our own that’s substantial and differentiated. Spark is
that platform. We have countless examples in IBM, and you will have the same in your
company as you embark on this journey.
Spark is about innovation—an analytics operating system on which new solutions
will thrive, unlocking the big data scale effect. And Spark is about a community of
Spark-savvy data scientists and data analysts who can quickly transform today’s prob-
lems into tomorrow’s solutions. Spark is one of the fastest-growing open source proj-
ects in history. Welcome to the movement.
—ROB THOMAS
SENIOR VICE PRESIDENT,
CLOUD AND DATA PLATFORM, IBM
preface
I don’t think Apache Spark needs an introduction. If you’re reading these lines, you
probably have some idea of what this book is about: data engineering and data science
at scale, using distributed processing. However, Spark is more than that, which you
will soon discover, starting with Rob Thomas’s foreword and chapter 1.
Just as Obelix fell into the magic potion,1 I fell into Spark in 2015. At that time, I
was working for a French computer hardware company, where I helped design highly
performing systems for data analytics. As one should be, I was skeptical about Spark at
first. Then I started working with it, and you now have the result in your hands. From
this initial skepticism came a real passion for a wonderful tool that allows us to process
data in—this is my sincere belief—a very easy way.
I started a few projects with Spark, which allowed me to give talks at Spark Summit,
IBM Think, and closer to home at All Things Open, Open Source 101, and through
the local Spark user group I co-animate in the Raleigh-Durham area of North Caro-
lina. This allowed me to meet great people and see plenty of Spark-related projects. As
a consequence, my passion continued to grow.
This book is about sharing that passion.
Examples (or labs) in the book are based on Java, but the only repository contains
Scala and Python as well. As Spark 3.0 was coming out, the team at Manning and I

1
Obelix is a comics and cartoon character. He is the inseparable companion of Asterix. When Asterix, a Gaul,
drinks a magic potion, he gains superpowers that allow him to regularly beat the Romans (and pirates). As a
baby, Obelix fell into the cauldron where the potion was made, and the potion has an everlasting effect on him.
Asterix is a popular comic in Europe. Find out more at www.asterix.com/en/.

xv
xvi PREFACE

decided to make sure that the book reflects the latest versions, and not as an after-
thought.
As you may have guessed, I love comic books. I grew up with them. I love this way
of communicating, which you’ll see in this book. It’s not a comic book, but its nearly
200 images should help you understand this fantastic tool that is Apache Spark.
Just as Asterix has Obelix for a companion, Spark in Action, Second Edition has a
reference companion supplement that you can download for free from the resource
section on the Manning website; a short link is http://jgp.net/sia. This supplement
contains reference information on Spark static functions and will eventually grow to
more useful reference resources.
Whether you like this book or not, drop me a tweet at @jgperrin. If you like it,
write an Amazon review. If you don’t, as they say at weddings, forever hold your peace.
Nevertheless, I sincerely hope you’ll enjoy it.
Alea iacta est.2

2
The die is cast. This sentence was attributed to Julius Caesar (Asterix’s arch frenemy) as Caesar led his army
over the Rubicon: things have happened and can’t be changed back, like this book being printed, for you.
acknowledgments
This is the section where I express my gratitude to the people who helped me in this
journey. It’s also the section where you have a tendency to forget people, so if you feel
left out, I am sorry. Really sorry. This book has been a tremendous effort, and doing it
alone probably would have resulted in a two- or three-star book on Amazon, instead of
the five-star rating you will give it soon (this is a call to action, thanks!).
I’d like to start by thanking the teams at work who trusted me on this project, start-
ing with Zaloni (Anupam Rakshit and Tufail Khan), Lumeris (Jon Farn, Surya
Koduru, Noel Foster, Divya Penmetsa, Srini Gaddam, and Bryce Tutt; all of whom
almost blindly followed me on the Spark bandwagon), the people at Veracity Solu-
tions, and my new team at Advance Auto Parts.
Thanks to Mary Parker of the Department of Statistics at the University of Texas at
Austin and Cristiana Straccialana Parada. Their contributions helped clarify some
sections.
I’d like to thank the community at large, including Jim Hughes, Michael Ben-
David, Marcel-Jan Krijgsman, Jean-Francois Morin, and all the anonymous posting
pull requests on GitHub. I would like to express my sincere gratitude to the folks at
Databricks, IBM, Netflix, Uber, Intel, Apple, Alluxio, Oracle, Microsoft, Cloudera,
NVIDIA, Facebook, Google, Alibaba, numerous universities, and many more who con-
tribute to making Spark what it is. More specifically, for their work, inspiration, and
support, thanks to Holden Karau, Jacek Laskowski, Sean Owen, Matei Zaharia, and
Jules Damji.

xvii
xviii ACKNOWLEDGMENTS

During this project, I participated in several podcasts. My thanks to Tobias Macey


for Data Engineering Podcast (http://mng.bz/WPjX), IBM’s Al Martin for “Making
Data Simple” (http://mng.bz/8p7g), and the Roaring Elephant by Jhon Masschelein
and Dave Russell (http://mng.bz/EdRr).
As an IBM Champion, it has been a pleasure to work with so many IBMers during
this adventure. They either helped directly, indirectly, or were inspirational: Rob
Thomas (we need to work together more), Marius Ciortea, Albert Martin (who,
among other things, runs the great podcast called Make Data Simple), Steve Moore,
Sourav Mazumder, Stacey Ronaghan, Mei-Mei Fu, Vijay Bommireddipalli (keep this
thing you have in San Francisco rolling!), Sunitha Kambhampati, Sahdev Zala, and,
my brother, Stuart Litel.
I want to thank the people at Manning who adopted this crazy project. As in all
good movies, in order of appearance: my acquisition editor, Michael Stephens; our
publisher, Marjan Bace; my development editors, Marina Michaels and Toni Arritola;
and production staff, Erin Twohey, Rebecca Rinehart, Bert Bates, Candace Gillhool-
ley, Radmila Ercegovac, Aleks Dragosavljevic, Matko Hrvatin, Christopher Kaufmann,
Ana Romac, Cheryl Weisman, Lori Weidert, Sharon Wilkey, and Melody Dolab.
I would also like to acknowledge and thank all of the Manning reviewers: Anupam
Sengupta, Arun Lakkakulam, Christian Kreutzer-Beck, Christopher Kardell, Conor
Redmond, Ezra Schroeder, Gábor László Hajba, Gary A. Stafford, George Thomas,
Giuliano Araujo Bertoti, Igor Franca, Igor Karp, Jeroen Benckhuijsen, Juan Rufes, Kel-
vin Johnson, Kelvin Rawls, Mario-Leander Reimer, Markus Breuer, Massimo Dalla Rov-
ere, Pavan Madhira, Sambaran Hazra, Shobha Iyer, Ubaldo Pescatore, Victor Durán,
and William E. Wheeler. It does take a village to write a (hopefully) good book. I also
want to thank Petar Zečević and Marco Banaći, who wrote the first edition of this
book. Thanks to Thomas Lockney for his detailed technical review, and also to Ram-
babu Posa for porting the code in this book. I’d like to thank Jon Rioux (merci, Jona-
than!) for starting the PySpark in Action adventure. He coined the idea of “team Spark
at Manning.”
I’d like to thank again Marina. Marina was my development editor during most of
the book. She was here when I had issues, she was here with advice, she was tough on
me (yeah, you cannot really slack off), but instrumental in this project. I will remem-
ber our long discussions about the book (which may or may not have been a pretext
for talking about anything else). I will miss you, big sister (almost to the point of start-
ing another book right away).
Finally, I want to thank my parents, who supported me more than they should have
and to whom I dedicate the cover; my wife, Liz, who helped me on so many levels,
including understanding editors; and our kids, Pierre-Nicolas, Jack, Nathaniel, and
Ruby, from whom I stole too much time writing this book.
about this book
When I started this project, which became the book you are reading, Spark in Action,
Second Edition, my goals were to

Help the Java community use Apache Spark, demonstrating that you do not
need to learn Scala or Python

Explain the key concepts behind Apache Spark, (big) data engineering, and
data science, without you having to know anything else than a relational data-
base and some SQL

Evangelize that Spark is an operating system designed for distributed comput-
ing and analytics
I believe in teaching anything computer science with a high dose of examples. The
examples in this book are an essential part of the learning process. I designed them to
be as close as possible to real-life professional situations. My datasets come from real-
life situations with their quality flaws; they are not the ideal textbook datasets that
“always work.” That’s why, when combining both those examples and datasets, you will
work and learn in a more pragmatic way than a sterilized way. I call those examples
labs, with the hope that you will find them inspirational and that you will want to
experiment with them.
Illustrations are everywhere. Based on the well-known saying, A picture is worth a
thousand words, I saved you from reading an extra 183,000 words.

xix
xx ABOUT THIS BOOK

Who should read this book


It is a difficult task to associate a job title to a book, so if your title is data engineer,
data scientist, software engineer, or data/software architect, you’ll certainly be happy.
If you are an enterprise architect, meh, you probably know all that, as enterprise archi-
tects know everything about everything, no? More seriously, this book will be helpful if
you look to gather more knowledge on any of these topics:

Using Apache Spark to build analytics and data pipelines: ingestion, transfor-
mation, and exporting/publishing.

Using Spark without having to learn Scala or Hadoop: learning Spark with Java.

Understanding the difference between a relational database and Spark.

The basic concepts about big data, including the key Hadoop components you
may encounter in a Spark environment.

Positioning Spark in an enterprise architecture.

Using your existing Java and RDBMS skills in a big data environment.

Understanding the dataframe API.

Integrating relational databases by ingesting data in Spark.

Gathering data via streams.

Understanding the evolution of the industry and why Spark is a good fit.

Understanding and using the central role of the dataframe.

Knowing what resilient distributed datasets (RDDs) are and why they should
not be used (anymore).

Understanding how to interact with Spark.

Understanding the various components of Spark: driver, executors, master and
workers, Catalyst, Tungsten.

Learning the role of key Hadoop-derived technologies such as YARN or HDFS.

Understanding the role of a resource manager such as YARN, Mesos, and the
built-in manager.

Ingesting data from various files in batch mode and via streams.

Using SQL with Spark.

Manipulating the static functions provided with Spark.

Understanding what immutability is and why it matters.

Extending Spark with Java user-defined functions (UDFs).
ABOUT THIS BOOK xxi


Extending Spark with new data sources.

Linearizing data from JSON so you can use SQL.

Performing aggregations and unions on dataframes.

Extending aggregation with user-defined aggregate functions (UDAFs).

Understanding the difference between caching and checkpointing, and
increasing performance of your Spark applications.

Exporting data to files and databases.

Understanding deployment on AWS, Azure, IBM Cloud, GCP, and on-premises
clusters.

Ingesting data from files in CSV, XML, JSON, text, Parquet, ORC, and Avro.

Extending data sources, with an example on how to ingest photo metadata
using EXIF, focusing on the Data Source API v1.

Using Delta Lake with Spark while you build pipelines.

What will you learn in this book?


The goal of this book is to teach you how to use Spark within your applications or
build specific applications for Spark.
I designed this book for data engineers and Java software engineers. When I started
learning Spark, everything was in Scala, nearly all documentation was on the official
website, and Stack Overflow displayed a Spark question every other blue moon. Sure,
the documentation claimed Spark had a Java API, but advanced examples were scarce.
At that time, my teammates were confused, between learning Spark and learning
Scala, and our management wanted results. My team members were my motivation for
writing this book.
I assume that you have basic Java and RDBMS knowledge. I use Java 8 in all exam-
ples, even though Java 11 is out there.
You do not need to have Hadoop knowledge to read this book, but because you
will need some Hadoop components (very few), I will cover them. If you already know
Hadoop, you will certainly find this book refreshing. You do not need any Scala knowl-
edge, as this is a book about Spark and Java.
When I was a kid (and I must admit, still now), I read a lot of bandes dessinées, a
cross between a comic book and a graphic novel. As a result, I love illustrations, and I
have a lot of them in this book. Figure 1 shows a typical diagram with several compo-
nents, icons, and legends.
xxii ABOUT THIS BOOK

Components are simple squares,


sometimes rectangles. They drop
the “UML-like plug” in the top-right A legend is always in bold,
corner that you can find in some with a curved line to the
classic architecture diagrams. term it describes.

Spark
Compo- Database/
File compo-
nent datastore
nent

Files have an odd Arrows indicate I dropped the cylinder


top-right corner, flows, not dependen- for databases. They
like a paperclip. cies as in a classic have evolved from
architecture diagram. Sets of components those meaningless
are grouped within a (nowadays) cylinders
dotted box. If there is to a box with a nice
Stream an ambiguity, Spark wedge.
components are
A streaming usually associated
component uses with the Spark logo.
a flow of data
symbolized by
three circles.

Figure 1 Iconography used in a typical illustration in this book

How this book is organized


This book is divided into four parts and 18 appendices.
Part 1 gives you the keys to Spark. You will learn the theory and general concepts,
but do not despair (yet); I present a lot of examples and diagrams. It almost reads like
a comic book.

Chapter 1 is an overall introduction with a simple example. You will learn why
Spark is a distributed analytics operating system.

Chapter 2 walks you through a simple Spark process.

Chapter 3 teaches about the magnificence of the dataframe, which combines
both the API and storage capabilities of Spark.

Chapter 4 celebrates laziness, compares Spark and RDBMS, and introduces the
directed acyclic graph (DAG).

Chapters 5 and 6 are linked: you’ll build a small application, build a cluster, and
deploy your application. Chapter 5 is about building a small application, while
chapter 6 is deploying the application.
ABOUT THIS BOOK xxiii

In part 2, you will start diving into practical and pragmatic examples around inges-
tion. Ingestion is the process of bringing data into Spark. It is not complex, but there
are a lot of possibilities and combinations.

Chapter 7 describes data ingestion from files: CSV, text, JSON, XML, Avro,
ORC, and Parquet. Each file format has its own example.

Chapter 8 covers ingestion from databases: data will be coming from relational
databases and other data stores.

Chapter 9 is about ingesting anything from custom data sources.

Chapter 10 focuses on streaming data.
Part 3 is about transforming data: this is what I would call heavy data lifting. You’ll
learn about data quality, transformation, and publishing of your processed data. This
largest part of the book talks about using the dataframe with SQL and with its API,
aggregates, caching, and extending Spark with UDF.

Chapter 11 is about the well-known query language SQL.

Chapter 12 teaches you how to perform transformation.

Chapter 13 extends transformation to the level of entire documents. This chap-
ter also explains static functions, which are one of the many great aspects of
Spark.

Chapter 14 is all about extending Spark using user-defined functions.

Aggregations are also a well-known database concept and may be the key to ana-
lytics. Chapter 15 covers aggregations, both those included in Spark and cus-
tom aggregations.
Finally, part 4 is about going closer to production and focusing on more advanced
topics. You’ll learn about partitioning and exporting data, deployment constraints
(including to the cloud), and optimization.

Chapter 16 focuses on optimization techniques: caching and checkpointing.

Chapter 17 is about exporting data to databases and files. This chapter also
explains how to use Delta Lake, a database that sits next to Spark’s kernel.

Chapter 18 details reference architectures and security needed for deployment.
It’s definitely less hands-on, but so full of critical information.
The appendixes, although not essential, also bring a wealth of information: installing,
troubleshooting, and contextualizing. A lot of them are curated references for
Apache Spark in a Java context.

About the code


As I’ve said, each chapter (except 6 and 18) has labs that combine code and data. Source
code is in numbered listings and in line with normal text. In both cases, source code is
formatted in a fixed-width font like this to separate it from ordinary text. Some-
times code is also in bold to highlight code that is more important in a block of code.
xxiv ABOUT THIS BOOK

All the code is freely available on GitHub under an Apache 2.0 license. The data
may have a different license. Each chapter has its own repository: chapter 1 will be in
https://github.com/jgperrin/net.jgp.books.spark.ch01, while chapter 15 is in https://
github.com/jgperrin/net.jgp.books.spark.ch15, and so on. Two exceptions:

Chapter 6 uses the code of chapter 5.

Chapter 18, which talks about deployment in detail, does not have code.
As source control tools allow branches, the master branch contains the code against
the latest production version, while each repository contains branches dedicated to
specific versions, when applicable.
Labs are numbered in three digits, starting at 100. There are two kinds of labs: the
labs that are described in the book and the extra labs available online:

Labs described in the book are numbered per section of the chapter. There-
fore, lab #200 of chapter 12 is covered in chapter 12, section 2. Likewise, lab
#100 of chapter 17 is detailed in the first section of chapter 17.

Labs that are not described in the book start with a 9, as in 900, 910, and so on.
Labs in the 900 series are growing: I keep adding more. Labs numbers are not
contiguous, just like the line numbers in your BASIC code.
In GitHub, you will find the code in Python, Scala, and Java (unless it is not applica-
ble). However, to maintain clarity in the book, only Java is used.
In many cases, the original source code has been reformatted; we’ve added line
breaks and reworked indentation to accommodate the available page space in the
book. In rare cases, even this was not enough, and listings include line-continuation
markers (➥). Additionally, comments in the source code have often been removed
from the listings when the code is described in the text. Code annotations accompany
many of the listings, highlighting important concepts.

liveBook discussion forum


Purchase of Spark in Action includes free access to a private web forum run by Manning
Publications where you can make comments about the book, ask technical questions, and
receive help from the author and from other users. To access the forum, go to https://
livebook.manning.com/#!/book/spark-in-action-second-edition/discussion. You can
also learn more about Manning’s forums and the rules of conduct at https://livebook
.manning.com/#!/discussion.
Manning’s commitment to our readers is to provide a venue where a meaningful
dialogue between individual readers and between readers and the author can take
place. It is not a commitment to any specific amount of participation on the part of
the author, whose contribution to the forum remains voluntary (and unpaid). We sug-
gest you try asking the author some challenging questions lest his interest stray! The
forum and the archives of previous discussions will be accessible from the publisher’s
website as long as the book is in print.
about the author
Jean-Georges Perrin is passionate about software engineering and all things data. His
latest projects have driven him toward more distributed data engineering, where he
extensively uses Apache Spark, Java, and other tools in hybrid cloud settings. He is
proud to have been the first in France to be recognized as an IBM Champion, and to
have been awarded the honor for his twelfth consecutive year. As an awarded data and
software engineering expert, he now operates worldwide with a focus in the United
States, where he resides. Jean-Georges shares his more than 25 years of experience in
the IT industry as a presenter and participant at conferences and through publishing
articles in print and online media. You can visit his blog at http://jgp.net.

xxv
about the cover illustration
The figure on the cover of Spark in Action is captioned “Homme et Femme de Hous-
berg, près Strasbourg” (Man and Woman from Housberg, near Strasbourg). Housberg
has become Hausbergen, a natural region and historic territory in Alsace now divided
between three villages: Niederhausbergen (lower Hausbergen), Mittelhausbergen
(middle Hausbergen), and Oberhausbergen (upper Hausbergen). The illustration is
from a collection of dress costumes from various countries by Jacques Grasset de Saint-
Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797.
Each illustration is finely drawn and colored by hand.
This particular illustration has special meaning to me. I am really happy it could be
used for this book. I was born in Strasbourg, Alsace, currently in France. I immensely
value my Alsatian heritage. When I decided to immigrate to the United States, I knew
I was leaving behind a bit of this culture and my family, particularly my parents and sis-
ters. My parents live in a small town called Souffelweyersheim, directly neighboring
Niederhausbergen. This illustration reminds me of them every time I see the cover
(although my dad has a lot less hair).
The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how
culturally separate the world’s towns and regions were just 200 years ago. Isolated
from each other, people spoke different dialects (here, Alsatian) and languages. In
the streets or in the countryside, it was easy to identify where someone lived and what
their trade or station in life was just by their dress.
The way we dress has changed since then, and the diversity by region, once so rich,
has faded away. It’s now hard to distinguish the inhabitants of different continents, let

xxvi
ABOUT THE COVER ILLUSTRATION xxvii

alone different towns, regions, or countries. Perhaps we have traded cultural diversity
for a more varied personal life—certainly for a more varied and fast-paced technolog-
ical life.
At a time when it’s hard to tell one computer book from another, Manning cele-
brates the inventiveness and initiative of the computer business with book covers
based on the rich diversity of regional life of two centuries ago, brought back to life by
Grasset de Saint-Sauveur’s pictures.
Exploring the Variety of Random
Documents with Different Content
Fig. 78.

Polar Bear177 and Walrus.178


Showing how the Bear walks with the heel flat on the ground, and the
Walrus also.

A glance at a bear’s mouth will tell at once that he is partly a


vegetarian, for his hind teeth are smoothed down, and as he eats he
can move his lower jaw slightly from side to side, so as to chew
vegetable food. Even the Polar Bear, which eats little else but fish and
seals, has these same grinding teeth, and he can be fed for a long
time upon bread; while it is found that he keeps in better health when
in zoological gardens if he has some grass occasionally. Still it is only
the Sun Bears and Sloth Bears in India and Malacca which never eat
flesh, for the Bruin of our northern countries often varies his food with
deer or sheep, and grows more ferocious and flesh-feeding as he
grows in years. It would almost seem as if his very laziness and
awkward gait may have led him to take to vegetarianism as a
convenient change, when animal food was not handy. For though a
bear can trot along at a good pace, yet his heavy lumbering body and
179
long foot with the whole heel touching the ground (see Fig. 78),
make him decidedly not well fitted for a hunting animal.
How different he looks from the slim wolf running on the tips of his
toes, and the graceful tiger bending his long hind legs for a leap! Yet
he is a formidable animal too, for his muscles are tremendously strong,
and his firmly-planted foot enables him to rise upon his hind legs and
give that deadly embrace which drives the breath out of the body of his
victim.
The wolf attacks with his teeth, the lion strikes with his paw, but the
bear hugs his enemy to death; and here his long stiff claws serve him
well, for though he cannot draw them in to keep them sharp, yet they
are rough and jagged, and inflict dreadful wounds. The great Grizzly
Bear of America, which is sometimes nine feet long, and strong
enough to drag along the carcase of a bison, sticks his front claws into
his prey while he tears the flesh with the hind feet; he is the only one,
except the polar bear, which lives principally upon animal food.
In fact, the bears take much the same place in the animal world
that heavy phlegmatic men do among ourselves; easy-going, but
dangerous if roused, they seem to have succeeded in life more by
accommodating themselves to things as they have found them, than
by conquering and taking by force like the wolves and tigers. Thus a
bear roams leisurely through the thick forest, for few animals care to
meddle with him and he feeds wherever food comes easy, especially
in the autumn when fruits abound and he can grow fat; and then he
lies down to sleep in a cave or hollow tree, or in a nest of moss and
leaves, till spring comes round again. Why should he trouble himself to
struggle with difficulties? Unless, indeed, food is scarce, and then he
sometimes has an uneasy winter, or attacks animals he would
otherwise leave alone.
But if once he is roused, or if a she-bear is afraid that her cubs
may be attacked, then you see that under the lazy good-nature there is
plenty of pluck and ferocity. He would rather be let alone, for he looks
upon life as a thing to enjoy and take leisurely, but if you will have a
struggle then he will see who is master. And this kind of philosophy,
somewhat easy for strong powerful creatures, has stood Bruin in good
stead, for he has spread over all countries where there are thick
forests, except Africa and Australia; and with his great strength and
shaggy coat must have been very safe from attack till man came to
annoy and worry him.
Even the polar bear, living amidst perpetual snow and ice on the
shores of Spitzbergen, Nova Zembla, and Greenland, has not, on the
whole, a bad life of it, for he is master of the situation, and conquers
and devours even the tusked walrus. The polar bear is a most
interesting animal, because he shows us the bear tribe becoming
adapted to a watery life. His body is much longer and more flexible
than that of most bears, giving him the power to twist and turn in the
water, as he swims with strong broad feet; and his long neck, narrow
head, and small ears, are all fitted for a watery fishing life, while he
fights entirely with his teeth and does not hug his prey. Again, the soles
of his feet, instead of being bare, are covered with long stout hairs,
giving him foothold upon the slippery ice, over which he travels very
quickly, climbing up from time to time on the icy hummocks to see
where seals are to be found, or to scent a dead whale from afar. He is
an inveterate seal-hunter, chasing them in the water or out of it with
equal ease and great cunning, though they are quick too, and often
escape him just when he thinks he has caught them. It is when they
are asleep with their noses upon the ice or the land, that he has his
best chance, for then he will swim warily behind them, coming up
close, till, even if they wake, they have no choice but to be killed where
they are, or to leap out on the solid ice where he will soon overtake
them.
The polar bear, unlike his brown cousins, fishes and hunts all the
winter through, and it is only the mothers which take refuge in caves
hollowed out of the snow, where their little ones are born in early
spring, and nestle down by her side in their icy home. And when the
cubs can run, both father and mother care for them with true devotion,
defending them against all attacks, and pushing them before them
when pursued, even going so far as to take them in their teeth and
swim away with them when they cannot otherwise save them.
So we see that the polar bear has become more than half a water-
animal, and gives us the first hint that some milk-givers may take to a
thoroughly sea life. Neither among the wolves nor the felines do we
find any animals taking entirely to the water; but in the weasel family,
which comes near to the bears, we have the otters, and among the
bears themselves their polar cousin, which reminds us that there is
another great division of flesh-feeders which we must study in the next
chapter—the walruses, seals, and sea-bears, the porpoises, dolphins,
and whales, which with finned paddles have struck out quite a new line
of life, and imitated the fish so well that they are often wrongly classed
among them.
EUROPE IN THE AGE OF ICE
CHAPTER XI.
HOW THE BACKBONED ANIMALS HAVE
RETURNED TO THE WATER, AND LARGE MILK-
GIVERS IMITATE THE FISH.

“On revient toujours à ses premiers amours,” says the French song.
But who would have thought that, after rising step by step above the
fish, and tracing the history of the backboned animals through their
development in the air and over the land, till we brought them to a
stage of intelligence second only to man, we should have to follow
them back again to the water and find the highly gifted milk-givers
taking on the form and appearance of fishes? Nevertheless it is so, for
seals and whales are as truly flesh-eating milk-givers as bears and
wolves; nor are they much behind them in intelligence, for we all know
how teachable and affectionate seals and sea-lions are, while what
little is known of the life of whales shows that they are devoted
mothers, and their well convoluted though small brains are a proof that
they are by no means wanting in intelligence.
Yet the whales and dolphins, at any rate, have not only adopted a
sea life, but have limbs so like a fish’s fins that we can scarcely call
them by any other name, and they are so completely water animals
that they cannot even return to the land.
Now we should be quite puzzled to account for such curious forms
as these warm-blooded animals, half transformed into fish, if it were
not that we know of several land animals belonging to different groups
which have gone part of the way towards a fish life. Thus among the
reptiles we have the oceanic turtles and the sea snakes; among birds
the penguins, whose wings have almost become fins. Then among the
milk-givers we have the web-footed Duck-billed Platypus, the Yapock
or web-footed opossum of South America, the Desman and the
Beaver, the Polar Bear, and last but not least the Otters, web-footed
animals nearly allied to the weasels, which seek their food entirely in
the water.
The common Otter of Europe and America though he moves
quickly and actively on land, has webbed toes with only short claws
standing out beyond the swimming foot, and he spends the greater
part of his life in the river, making his home in a hollow of the bank
beneath the overhanging roots of trees. There he may still be seen in
many of our English rivers, his soft brown fur shining as he swims
along, diving under water for a fish, which he brings out on to the bank
to eat, holding it in his fore paws.
But there is an otter which has deserted the old land life much
more completely than this, for the great Sea-Otters of the North
Pacific, about four or five feet long (see Fig. 79), never care even to
come on shore, but, when they have dived for their prey, turn on their
backs and float while they eat it, holding the sea-urchins, crabs, or fish,
in their front paws. They even nurse their young ones in the same
fashion, dandling them in their arms as they lie face upwards on the
sea; and they rear them entirely on the thick beds of kelp off the coasts
of the North Pacific Ocean, never bringing them on land.
These sea-otters may be seen in hundreds off the coasts of Alaska
and California, basking on the wet rocks, playing, leaping, and
plunging in the water, till some alarm makes each mother seize her
little one in her teeth and dive under in an instant.
They are twice the size of the River Otter, and in many points more
like seals, for though their front paws are short and cat-like, their hind
feet are flat flippers, with a long outer toe; their face too is broad and
short, and their teeth are neither cutting like the weasels nor flattened
like the bears, but covered with rounded knobs, well fitted for crushing
crab-shells and the bones of the fish on which they feed.
Fig. 79.

Sea-Otter.180—(From Wolf.)
Showing the front paws, and the hind webbed feet.

We see, then, that it is quite possible for land-animals to have near


relations specially adapted for a sea life. But the otter is still distinctly a
four-footed creature, with free arms and legs, and we can trace his
connection with the weasel tribe. It is quite different with the three
groups of real fin-footed animals—the Seals and Walruses, the
Manatees, and the Whales. Though we can trace their likeness bone
by bone to the land animals, yet they have become so different as to
show that they must have branched off long long ago; so long indeed
that we cannot even guess at the relations of the whales, while the
seals have only a distant resemblance to the bear family, and the sea-
cows or manatees to the ancestors of the hoofed animals and
elephants. Nor shall we wonder to find the whales so much the most
fitted for the sea, when we learn that they were already living in the
water when we first meet with the great army of milk-givers (see p.
211) just after the Chalk Period, so that they have probably had a
much longer spell of watery life than the seals and sea-cows, whose
remains we only find later.
Yet even the seals are so much altered from anything we see on
land, that few people would believe at first sight that they have the
same skeleton as a bear. We need not leave the British shores to
study these pretty creatures, for they still come to the coasts of Wales,
Cornwall, and Ireland; while in the Hebrides they may be seen lying
fast asleep on the rocks at low tide out at sea, one, placed higher than
the rest, keeping awake as sentinel to give warning at the least
approach of danger.
But if we begin our study with the common seal we shall be much
puzzled, for he is very unlike a land animal. His round neckless body
tapering away to the tail, where the hind flippers stretch out behind like
fish’s fins, reminds us far more of a tunny fish than of a four-footed
milk-giver; while the front flippers, coming out so finlike from his side,
give us very little idea of legs (see Fig. 81). No! in order to compare
181
these fin-footed creatures with land animals we shall do far better to
travel up to the Aleutian Islands at the entrance of Behring’s Straits,
and visit the Fur Seals and Sea Lions, from which we get our seal-
skins, and the Walruses which sometimes lie there sleeping on the
rocks, though their real home is farther north within the Arctic Circle,
round the coasts of Nova Zembla, Spitzbergen, and Greenland.
Fig. 80.

Skeleton of a Sea Lion.


Showing how the whole foot rests on the ground, as in the Bear Family.

th, thigh; l, leg; h, heel; f, foot; a, upper-arm; fa, fore arm;


ha, hand.

These creatures, although they have “flippers,” and are truly fin-
footed, are much more like land animals than the smaller seals, for
they plant their whole foot on the ground as a bear does, and walk, or,
more properly, “flop along” on all fours. A mere glance at the skeleton
of the sea lion, which is one of these higher kind of seals with a slight
182
outer ear, shows that it is a four-footed animal, with five toes to
each foot, the great toes and the thumbs being the largest. We can
see distinctly the short thighs and the long shanks, which give the hind
flippers their lanky appearance, and we see, too, the broad stumpy
arms, which give such strength to the front flippers in swimming. For
the eared seals and walruses use their fore flippers very much in the
water, while the true seals swim almost entirely with the hind flippers,
and use the front ones chiefly for guiding themselves.
And now if we turn to the living fur seal we find that the reasons
are twofold which make us forget that his limbs are legs. In the first
place, the skin of his body comes down very low over his arms (see
Fig. 81), while the hand is encased in skin, with only mere traces of
nails upon it Then as regards his hind legs, not only are the feet made
into flippers, in which the toes are joined by a loose flexible skin, so
that they can move them freely when swimming, but the legs
themselves are strapped back by a skin passing right across his tail,
so that his thighs are kept flat against his side, and only the lower part
of the legs has power to move. We lose sight, then, of the limbs, and
see very little more than the feet, which are disguised by being turned
into flippers.
Now if we once think what is the object of a seal’s life, this curious
change in its body is at once explained. For seals are the hunters of
the sea; fish-food is to them what flesh-food is to lions, wolves, and
bears, only that they have a much wider field to hunt in, for they have
the whole ocean for their feeding ground, and no one to dispute it with
them but the sea-otter in places near the land, and the porpoises and
other fish-feeding whales out at sea. In consequence of this we find
seals of some kind in almost all parts of the world, except the Indian
Ocean, though they evidently prefer the cooler regions. Even the large
sea lions live in the North Pacific, as far up as the Aleutian Isles, and in
the South Pacific down to the Falkland Islands and Kerguelen’s Land,
and play about the shores of the Cape, New Zealand, and Australia.
Fig. 81.

A Fur Seal,183 one of the Sea Lions; and a common


Seal.184

Showing how the Sea Lion walks on the flat hind feet,
while the seal’s flippers lie back in a line with the body; note
also the absence of an external ear in the seal.

They have evidently been very successful in exchanging flesh-


feeding for fish-feeding, and if we consider for a moment what changes
a four-footed land animal would wish to make in its body in order to
swim and dive in the water, we shall see that these changes have
taken place in the seals.
First, a flexible body is required to wind and twist rapidly in the
water, and this the seal arrives at by having the cushions of gristle
between its joints very large and thick, while even its ribs are joined to
its back by gristly rods, making its whole body very lissom. Next, a
small head, offering little resistance to the water is an advantage, and
this we find in all seals, while the short neck and extremely sloping
narrow shoulders well encased in fat, make the body slope away
gently with no jutting angles, but a round smooth surface from head to
tail where it narrows like the tail of a fish. The next step is to do away
with long angular arms and legs, which would impede it in diving and
swimming, and here the seal meets the difficulty, not by losing its leg
and arm bones, but by having them so shortened and encased in the
skin that only the useful broad flippers are free, while the hind legs are
set upon a very narrow hip joint (see Fig. 80), so that they bend
backwards and work close to the body. Lastly, such a warm-blooded
animal would want clothing to prevent it from being chilled in icy cold
water, and here we find two protections. First, under the skin is a layer
of oily fat, which, while it reminds us of the fat accumulated by bears
before they settle down to their winter’s sleep, has become in the seals
a dense oily mass, acting like a thick blanket in keeping up the warmth
of the body; and secondly, the seal, like its distant relations the bears,
has a dense furry covering, and over this a number of coarse long
hairs, which give it that shining oily look we notice in all seals. No
doubt every one has wondered, when watching seals in zoological
gardens, where the fur can be which makes our sealskin muffs and
jackets. The fact is, that this under fur is quite out of sight in the living
seal, being covered by the coarse hairs; but if we could turn these
aside, even in common seals, we should see the soft undergrowth
beneath, and in the fur seals it is much thicker. Now the roots of these
coarse hairs are deeper in the flesh than the roots of the soft
undergrowth, and when the uppermost layer of the skin on which the
fur grows is sliced off, the coarse hairs are cut away from their deep
roots below, and can then be pulled out, leaving only the fur behind.
The seals then, while they are in all main points constructed like
land animals, have gained many advantages, not by having new parts,
but by the old ones becoming so modified as to make them admirably
fitted for a watery life; and when we add that they have large eyes well
adapted for seeing under water, keen ears with little or no outer ear,
which would be useless, but a very acute hearing apparatus within,
and nostrils which will close firmly and keep the air in and the water out
when they dive, we must acknowledge that they make good use of all
parts of their body. Indeed, their breathing apparatus is the most
curious of all, for they can remain under water sometimes for twenty
minutes, and meanwhile the circulation of their blood is probably
controlled by large reservoirs in the veins, which prevent it going back
to the heart and lungs till it can be purified by fresh breath.
Now, if all these changes from a land to a water-frequenting animal
have been made gradually, we shall expect to find some forms less
altered than others, and so it is. The Walrus, which is not a seal, but a
creature with a thick hide having no fur and only a few scattered hairs
upon it, and long tusks in his mouth, is much more of a land-animal
than the seals. He passes a great part of his life sauntering along on
the low shores of the Arctic seas, digging up mussels, cockles, and
clams with his long canine teeth or tusks; and in accordance with this
we find that his hind legs are much freer than even those of the sea-
lions, for the skin binding them to his body is broader and his hips are
stronger, so that, as he throws his front flippers forwards, he can also
throw out his feet and walk on all fours in a strange straddling manner.
He is remarkably fierce and strong, and Captain Scoresby caught one
once in the act of killing and eating a large narwhal, so that they are
evidently not afraid of attacking even large animals. The walrus is even
said to stand at bay on shore and fight his great destroyer the polar
bear, throwing up his head so as to strike forcibly with his sharp tusks,
but in this battle he is generally defeated. His tusks alone would
suggest that he lives a good deal on land exposed to dangers, for his
more aquatic relations the seals are without tusks, and though their
teeth are sharp enough, and they fight among themselves, yet their
way of escaping the great tyrant of the ice-fields is to slip into the
water.
Beyond his tusks, and the fact that by sleeping many weeks on the
ice in autumn he reminds us of the bears, the walrus’s life is not very
interesting. They live in large shoals in the Arctic sea, climbing the
rocks and ice with the help of their tusks, which they drive into the
crevices and so haul themselves up. During the colder times just
before our own, they came down into much lower latitudes than now,
and we find their bones as far south as England in Europe and Virginia
in America, and even in our day one has been seen off the west coast
of Skye; but we know very little of their daily life or how they bring up
their young ones.
Of Fur Seals and Sea Lions, however, we know a good deal, and a
singular history it is. They spend the greater part of the year in huge
shoals in the sea, rising and falling, gambolling and diving in the water,
feeding on the fish, and probably migrating from colder to warmer seas
in the winter from either pole. But the interesting time of their life is in
the spring, when the northern eared seals have often been watched as
they come to the shores of the Aleutian Isles to bring up their families.
For then begins the fight which seals shall get the most wives.
Early in May the fathers begin to arrive—strong old seals, which have
gone through the battle many years before and know the rules. They
are huge fellows six or seven feet long, with enormous eye-teeth and
cutting teeth next to them, which together grip like a vice. They come
up at first singly and then in greater numbers, swimming powerfully
and laying hold of the rocks with their flippers so as to haul themselves
up on land, taking the best positions they can find on the edge of the
water to watch for the arrival of the mothers. Yet still more and more
fathers arrive as time goes on, and these are obliged to go farther
inland, for all the shore stations are soon occupied, and each sea lion
defends his own plot of ground with tooth and flipper.

Fig. 82.

Sea Lions gathered on one of the Pribylov Islands,


watching for wives.
Thus, in about a month’s time, from the shore right inland, the
whole island is covered with male seals. And now the mothers arrive,
coming to the islands that their little ones may be born. They are very
much smaller, not much more than four feet long, lighter in colour than
the fathers, gentle and inoffensive; and as they swim up to the island
each father seal tries, by coaxing, pulling, and tugging, to persuade a
mate to come on to his rock. If he succeeds he has then to keep her,
for the sea lions behind, which cannot reach the sea, are on the watch
to steal her.
Now he might make quite sure of his prize if he would be content
with one, but he wants several; and the next young mother swimming
up calls off his attention, and while he is courting her his neighbour
behind tries to carry his first wife away, lifting her by the back of the
neck as a cat does a kitten. Then often a terrible battle begins, and the
poor mothers are pulled hither and thither till one male seal secures
her, and then the whole thing begins again. This constant fighting and
lovemaking go on for several days till all the sea lions have wives—
those on the shore many, those behind perhaps very few. Then all
settle down quietly, the little sea lions are born, bleating like young
lambs, and family life begins. But the peace does not last long, for no
sooner are mothers able to leave their little ones than the old contest
begins again, and happy the father who can keep his wives together
through a whole season!
And now comes the most remarkable point. As a rule, seals are
immense eaters, and they become very fat. But from the time that the
fathers land upon the rocks till they go back to the water after about
two months, they have never been known to leave their position to
take food, so busy are they defending their wives. And when the two
months are over, during which the little ones have been trying their
strength in the waves and learning to swim, the fathers, which have
grown thin and meagre, having used up all their fat, swim away and do
not come back. The mothers, however, with the children, and those
young bachelors, which have not yet taken wives, remain on the
islands sporting and enjoying themselves till autumn, when they, too,
start off for the open sea till spring comes round again.
Such is the history of the eared seals. And now that we have
studied their form, and seen that their skeleton is like that of other
animals, though their arms and legs are disguised as flippers, we shall
understand our own home seals better; for the chief difference
between them and the higher seals is merely that their front legs are
much shorter, and that their hind legs are turned back so as to lie in a
line with the body (see Fig. 81), while they are closely bound to the tail
down right as far as the heel, so that they cannot throw their hind
flippers forward nor use them in walking. Thus they have become still
more completely aquatic animals, using their hind legs entirely in
swimming, when they serve as great oars, working something like the
screw of a steamer. The consequence is that they are terribly awkward
on land, though they get along very fast by jerking their body forward,
or sometimes by dragging themselves by their front flippers.
This, however, matters very little to them, for their home is the sea.
True, they may often be seen lying asleep on sandbanks or on rocks
jutting out of the water, but they rarely venture far up the land, always
remaining where they can slip back into their true home at the least
alarm. So they live in the seas almost all over the world. They may be
known from the higher seals chiefly by their want of outer ears, their
backward-turned legs, and their feet with both the great and little toes
larger than the inner ones; but their life is much the same. Some live
near our own shores, especially in Scotland; some are peculiar to
Australia and New Zealand; others crowd the icy seas of Greenland,
sleeping in large herds on the ice-fields, where the polar bear makes
them his prey; while others again live on the pack ice round the South
Pole, the huge Elephant seal, with its long tapir-like nose, basking on
the shores of Kerguelen’s Land and the islands of the southern seas—
a monster twelve feet or more long, with his smaller wives beside him.

* * * * *
Thus the seals are bold ocean lovers, feeding entirely on animal
food, and finding plenty of it in the wide sea as they roam. But there is
another family of warm-blooded animals, pure vegetable-feeders,
which also must have found their way in distant ages into the water; for
they too are milk-givers, and though they have lost their hind legs,
have still the front legs with all their proper bones, with the hands
turned into flippers.
These animals are the curious sea-cows or Manatees, which
wander under water along the east coast of Africa and west coast of
South America, feeding in the bays and often up the rivers, on the
seaweeds and water-plants of all kinds; while another kind with tusks,
called the Dugong, feeds all along the shores of the Indian Ocean and
Australia.
It is strange that while every child knows something about seals,
very few people have heard of these gentle grazing manatees and
dugongs, the only large vegetable-feeders of the sea. Yet they are
curious, interesting animals, and seem to be the forms which have
185
given rise to the popular stories of mermaids, for they suckle their
young ones at the breast, clasping them with their flippers, and when
they raise their heads in the water have something the appearance of
an uncouth mother nursing her child.

Fig. 83.

The Manatee or Sea Cow grazing.

But very uncouth indeed! for they are long barrel-shaped


creatures, with a thick skin like the elephant’s, with short stiff hairs
upon it. Their head is small, with no outer ears, and very insignificant
eyes surrounded with wrinkles; their lips are thick, heavy, and covered
with short bristles, and above them two narrow nostrils open and close
according as they are above or under water. Their front flippers, which
are all they have, are long and broad, with faintly-marked flat nails
upon them, and behind these their body tapers away gradually into a
thin, wide, shovel-shaped tail, not set edgewise as in a fish, but across
the body, so as to lie like a broad leaf in the water.
Who would think that a creature like this had anything in common
with land animals? Yet so it is, for not only do we know that his
ancestors had traces of hind legs, but his front limbs are quite as true
arms and hands as those of any of the seals. Moreover, he has large
broad grinding back teeth like the elephant, and in front he has small
cutting teeth as a baby, though these are covered up by the gum as he
grows older. In the Australian dugong, however, these teeth continue
to grow and form good-sized tusks in the fathers.
What, then, is this curious animal? Simply a vegetable-feeder
which has become fitted for a watery life—a gentle, peaceable animal,
which keeps near the shore and grasps the seaweed with the sides of
its upper lip, and then nips it off by a set of horny plates, which grow
down from the roof of its mouth, and answer to the rough wrinkles on a
cow’s palate. They may often be seen together, father, mother, and
child, wandering up the river Congo in Africa, or the Amazons in South
America, feeding entirely under water, and only raising their heads
from time to time with a snort to take in fresh air. In olden times they
probably thronged all the coasts on the sea-margin, for a hundred and
fifty years ago there was another group of them, the Rhytinas, right up
in the cold seas of Behring’s Straits, where the vast submarine forests
of seaweed afforded them plenty of food. But the sailors found them
such good eating, and the fatty blubber on their bodies was so
valuable, that they were all killed twenty-five years after Behring first
discovered them, and unless some care is taken, the more southern
sea-cows may some day be exterminated in the same way.

* * * * *
And now that we have firmly grasped the fact that the seals and
manatees, however altered in shape, belong to the four-footed and
milk-giving group, perhaps we shall be prepared to understand how it
186
is that the whales are not fish, though this popular delusion is one
of the most difficult to overcome. “Do you really mean then,” exclaim
nearly all people who are not naturalists, “that a whale is not a huge
fish?” Certainly I do! A whale is no more a fish than crocodiles,
penguins, or seals, are fish although they too live chiefly in the water.
A whale is a warm-blooded, air-breathing, milk-giving animal. Its
fins are hands with finger-bones, having a large number of joints (see
Fig. 84); its tail is a piece of cartilage or gristle, and not a fish’s fin with
bones and rays; it has teeth in its gums even if it never cuts them; and
it gives suck to its little one just as much as a cow does to her calf (see
Fig. 85). Nay! the whalebone whales have even the traces of hind legs
entirely buried under the skin (see Fig. 84), and in the Greenland
whale the hip-joint and knee-joint can be distinguished with some of
their muscles, though the bones are quite hidden and useless.

Fig. 84.

Skeleton of a Whalebone Whale (Mysticete), and Section of the Mouth


with Whalebone.

b, blowhole; a, upper arm; fa, fore arm; h, hand; p, th, l, small remains of
pelvis or hip-bone, thigh, and leg; r, roof of the palate; w, w, plates of
whalebone; f, whalebone fringe.

We see then that the whale undoubtedly belongs to the same type
as the four-footed land animals, although it branched off into the water
so long ago that it may have come from some very early milk-giver. But
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebooknice.com

You might also like