Ebook Fast Data Architectures For Streaming Applications 2
Ebook Fast Data Architectures For Streaming Applications 2
m
pl
im
en
ts
Fast Data
of
Architectures for
Streaming Applications
Getting Answers Now from
Data Sets That Never End
2nd
Edition
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fast Data Archi‐
tectures for Streaming Applications, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and Lightbend. See our state‐
ment of editorial independence.
978-1-492-04679-0
[LSI]
Table of Contents
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A Brief History of Big Data 2
Batch-Mode Architecture 4
5. Real-World Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Some Specific Recommendations 40
6. Example Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Other Machine Learning Considerations 47
iii
7. Recap and Where to Go from Here. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Additional References 50
iv | Table of Contents
CHAPTER 1
Introduction
Until recently, big data systems have been batch oriented, where data
is captured in distributed filesystems or databases and then pro‐
cessed in batches or studied interactively, as in data warehousing
scenarios. Now, it is a competitive disadvantage to rely exclusively
on batch-mode processing, where data arrives without immediate
extraction of valuable information.
Hence, big data systems are evolving to be more stream oriented,
where data is processed as it arrives, leading to so-called fast data
systems that ingest and process continuous, potentially infinite data
streams.
Ideally, such systems still support batch-mode and interactive pro‐
cessing, because traditional uses, such as data warehousing, haven’t
gone away. In many cases, we can rework batch-mode analytics to
use the same streaming infrastructure, where we treat our batch data
sets as finite streams.
This is an example of another general trend, the desire to reduce
operational overhead and maximize resource utilization across the
organization by replacing lots of small, special-purpose clusters with
a few large, general-purpose clusters, managed using systems like
Kubernetes or Mesos. While isolation of some systems and work‐
loads is still desirable for performance or security reasons, most
applications and development teams benefit from the ecosystems
around larger clusters, such as centralized logging and monitoring,
universal CI/CD (continuous integration/continuous delivery) pipe‐
1
lines, and the option to scale the applications up and down on
demand.
In this report, I’ll make the following core points:
I’ll finish this chapter with a review of the history of big data and
batch processing, especially the classic Hadoop architecture for big
data. In subsequent chapters, I’ll discuss how the changing land‐
scape has fueled the emergence of stream-oriented, fast data archi‐
tectures and explore a representative example architecture. I’ll
describe the requirements these architectures must support and the
characteristics of specific tools available today. I’ll finish the report
with a look at an example IoT (Internet of Things) application that
leverages machine learning.
2 | Chapter 1: Introduction
At its core, a big data architecture requires three components:
Storage
A scalable and available storage mechanism, such as a dis‐
tributed filesystem or database
Compute
A distributed compute engine for processing and querying the
data at scale
Control plane
Tools for managing system resources and services
Other components layer on top of this core. Big data systems come
in two general forms: databases, especially the NoSQL variety, that
integrate and encapsulate these components into a database system,
and more general environments like Hadoop, where these compo‐
nents are more exposed, providing greater flexibility, with the trade-
off of requiring more effort to use and administer.
In 2007, the now-famous Dynamo paper accelerated interest in
NoSQL databases, leading to a “Cambrian explosion” of databases
that offered a wide variety of persistence models, such as document
storage (XML or JSON), key/value storage, and others. The CAP
theorem emerged as a way of understanding the trade-offs between
data consistency and availability guarantees in distributed systems
when a network partition occurs. For the always-on internet, it often
made sense to accept eventual consistency in exchange for greater
availability. As in the original Cambrian explosion of life, many of
these NoSQL databases have fallen by the wayside, leaving behind a
small number of databases now in widespread use.
In recent years, SQL as a query language has made a comeback as
people have reacquainted themselves with its benefits, including
conciseness, widespread familiarity, and the performance of mature
query optimization techniques.
But SQL can’t do everything. For many tasks, such as data cleansing
during ETL (extract, transform, and load) processes and complex
event processing, a more flexible model was needed. Also, not all
data fits a well-defined schema. Hadoop emerged as the most popu‐
lar open-source suite of tools for general-purpose data processing at
scale.
Batch-Mode Architecture
Figure 1-1 illustrates the “classic” Hadoop architecture for batch-
mode analytics and data warehousing, focusing on the aspects that
are important for our discussion.
4 | Chapter 1: Introduction
and Spark SQL, the same job submission process is used when the
actual queries are executed as jobs.
Table 1-1 gives an idea of the capabilities of such batch-mode
systems.
So, the newly arrived data waits in the persistence tier until the next
batch job starts to process it.
In a way, Hadoop is a database deconstructed, where we have explicit
separation between storage, compute, and management of resources
and compute processes. In a regular database, these subsystems are
hidden inside the “black box.” The separation gives us more flexibil‐
ity and reduces cost, but requires us to do more work for adminis‐
tration.
Batch-Mode Architecture | 5
CHAPTER 2
The Emergence of Streaming
7
Streaming also introduces new semantics for analytics. A big sur‐
prise for me is how SQL, the quintessential tool for batch-mode
analysis and interact exploration, has emerged as a popular language
for streaming applications, too, because it is concise and easier to
use for nonprogrammers. Streaming SQL systems rely on window‐
ing, usually over ranges of time, to enable operations like JOIN and
GROUP BY to be usable when the data set is never-ending.
For example, suppose I’m analyzing customer activity as a function
of location, using zip codes. I might write a classic GROUP BY query
to count the number of purchases, like the following:
SELECT zip_code, COUNT(*) FROM purchases GROUP BY zip_code;
This query assumes I have all the data, but in an infinite stream, I
never will, so I can never stop waiting for all the records to arrive.
Of course, I could always add a WHERE clause that looks at yesterday’s
data, for example, but when can I be sure that I’ve received all of the
data for yesterday, or for any time window I care about? What about
a network outage that delays reception of data for hours?
Hence, one of the challenges of streaming is knowing when we can
reasonably assume we have all the data for a given context. We have
to balance this desire for correctness against the need to extract
insights as quickly as possible. One possibility is to do the calcula‐
tion when I need it, but have a policy for handling late arrival of
data. For some applications, I might be able to ignore the late arriv‐
als, while for other applications, I’ll need a way to update previously
computed results.
Streaming Architecture
Because there are so many streaming systems and ways of doing
streaming, and because everything is evolving quickly, we have to
narrow our focus to a representative sample of current systems and
a reference architecture that covers the essential features.
Figure 2-1 shows this fast data architecture.
There are more parts in Figure 2-1 than in Figure 1-1, so I’ve num‐
bered elements of the figure to aid in the discussion that follows.
Mini-clusters for Kafka, ZooKeeper, and HDFS are indicated by
dashed rectangles. General functional areas, such as persistence and
low-latency streaming engines, are indicated by the dotted, rounded
rectangles.
Let’s walk through the architecture. Subsequent sections will explore
the details:
Streaming Architecture | 9
On the other hand, strategic colocation of some other services
can eliminate network overhead. In fact, this is how Kafka
Streams works, as a library on top of Kafka (see also number 6).
2. REST (Representational State Transfer) requests are often syn‐
chronous, meaning a completed response is expected “now,” but
they can also be asynchronous, where a minimal acknowledg‐
ment is returned now and the completed response is returned
later, using WebSockets or another mechanism. Normally REST
is used for sending events to trigger state changes during ses‐
sions between clients and servers, in contrast to records of data.
The overhead of REST means it is less ideal as a data ingestion
channel for high-bandwidth data flows. Still, REST for data
ingestion into Kafka is still possible using custom microservices
or through Kafka Connect’s REST interface.
3. A real environment will need a family of microservices for man‐
agement and monitoring tasks, where REST is often used. They
can be implemented with a wide variety of tools. Shown here
are the Lightbend Reactive Platform (RP), which includes Akka,
Play, Lagom, and other tools, and the Go and Node.js ecosys‐
tems, as examples of popular, modern tools for implementing
custom microservices. They might stream state updates to and
from Kafka, which is also a good way to integrate our time-
sensitive analytics with the rest of our microservices. Hence, our
architecture needs to handle a wide range of application types
and characteristics.
4. Kafka is a distributed system and it uses ZooKeeper (ZK) for
tasks requiring consensus, such as leader election and storage of
some state information. Other components in the environment
might also use ZooKeeper for similar purposes. ZooKeeper is
deployed as a cluster, often with its own dedicated hardware, for
the same reasons that Kafka is often deployed this way.
5. With Kafka Connect, raw data can be persisted from Kafka to
longer-term, persistent storage. The arrow is two-way because
data from long-term storage can also be ingested into Kafka to
provide a uniform way to feed downstream analytics with data.
When choosing between a database or a filesystem, keep in
mind that a database is best when row-level access (e.g., CRUD
operations) is required. NoSQL provides more flexible storage
and query options, consistency versus availability (CAP) trade-
offs, generally better scalability, and often lower operating costs,
1 For a comprehensive list of Apache-based streaming projects, see Ian Hellström’s arti‐
cle, “An Overview of Apache Streaming Technologies”. Since this post and the first edi‐
tion of my report were published, some of these projects have faded away and new ones
have been created!
Streaming Architecture | 11
other sources and these tools, the durability and reliability of
Kafka ingestion and the benefits of having one access method
make it an excellent default choice despite the modest extra
overhead of going through Kafka. For example, if a process fails,
the data can be reread from Kafka by a restarted process. It is
often not an option to requery an incoming data source directly.
7. Stream processing results can also be written to persistent stor‐
age and data can be ingested from storage. This is useful when
O(1) access for particular records is desirable, rather than O(N)
to scan a Kafka topic. It’s also more suitable for longer-term
storage than storing in a Kafka topic. Reading from storage ena‐
bles analytics that combine long-term historical data and
streaming data.
8. The mini-batch model of Spark, called Spark Streaming, is the
original way that Spark supported streaming, where data is cap‐
tured in fixed time intervals, then processed as a “mini batch.”
The drawback is longer latencies are required (100 milliseconds
or longer for the intervals), but when low latency isn’t required,
the extra window of time is valuable for more expensive calcula‐
tions, such as training machine learning models using Spark’s
MLlib or other libraries. As before, data can be moved to and
from Kafka. However, Spark Streaming is becoming obsolete
now that Structured Streaming is mature, so consider using the
latter instead.
9. Since you have Spark and a persistent store, like HDFS or a
database, you can still do batch-mode processing and interactive
analytics. Hence, the architecture is flexible enough to support
traditional analysis scenarios too. Batch jobs are less likely to
use Kafka as a source or sink for data, so this pathway is not
shown.
10. All of the above can be deployed in cloud environments like
AWS, Google Cloud Environment, and Microsoft Azure, as well
as on-premise. Cluster resources and job management can be
managed by Kubernetes, Mesos, and Hadoop/YARN. YARN is
most mature, but Kubernetes and Mesos offer much greater
flexibility for the heterogeneous nature of this architecture.
• Storage: Kafka
• Compute: Spark, Flink, Akka Streams, and Kafka Streams
• Control plane: Kerberos and Mesos, or YARN with limitations
Let’s see where the sweet spots are for streaming jobs compared to
batch jobs (Table 2-1).
While the fast data architecture can store the same petabyte data
sets, a streaming job will typically operate on megabyte to terabyte at
any one time. A terabyte per minute, for example, would be a huge
volume of data! The low-latency engines in Figure 2-1 operate at
subsecond latencies, in some cases down to microseconds.
However, you’ll notice that the essential components of a big data
architecture like Hadoop are also present, such as Spark and HDFS.
In large clusters, you can run your new streaming workloads and
microservices, along with the batch and interactive workloads for
which Hadoop is well suited. They are still supported in the fast data
architecture, although the wealth of third-party add-ons in the
Hadoop ecosystem isn’t yet matched in the newer Kubernetes and
Mesos communities.
15
cates event sequencing and state transitions. While we often
associate logs with files, this is just one possible storage mechanism.
The metaphor of a log generalizes to a wide class of data streams,
such as these examples:
Service logs
These are the logs that services write to capture implementation
details as processing unfolds, especially when problems arise.
These details may be invisible to users and not directly associ‐
ated with the application’s logical state.
Write-ahead logs for database CRUD transactions
Each insert, update, and delete that changes state is an event.
Many databases use a WAL (write-ahead log) internally to
append such events durably and quickly to a filesystem before
acknowledging the change to clients, after which time in-
memory data structures and other, more permanent files are
updated with the current state of the records. That way, if the
database crashes after the WAL write completes, the WAL can
be used to reconstruct and complete any in-flight transactions,
once the database is running again.
Other state transitions
User web sessions and automated processes, such as manufac‐
turing and chemical processing, are examples of systems that
routinely transition from one state to another. Logs are a popu‐
lar way to capture and propagate these state transitions so that
downstream consumers can process them as they see fit.
Telemetry from IoT devices
Many widely deployed devices, including cars, phones, network
routers, computers, airplane engines, medical devices, home
automation devices, and kitchen appliances, are now capable of
sending telemetry back to the manufacturer for analysis. Some
of these devices also use remote services to implement their
functionality, like location-aware and voice-recognition applica‐
tions. Manufacturers use the telemetry to better understand
how their products are used; to ensure compliance with licen‐
ses, laws, and regulations (e.g., obeying road speed limits); and
for predictive maintenance, where anomalous behavior is mod‐
eled and detected that may indicate pending failures, so that
proactive action can prevent service disruption.
1 Jay Kreps doesn’t use the term CQRS, but he discusses the advantages and disadvan‐
tages in practical terms in his Radar post, “Why Local State Is a Fundamental Primitive
in Stream Processing”.
2 See Tyler Treat’s blog post, “You Cannot Have Exactly-Once Delivery”.
3 You can always concoct a failure scenario where some data loss will occur.
4 In 2015, LinkedIn’s Kafka infrastructure surpassed 1.1 trillion messages per day, and it’s
been growing since then.
Alternatives to Kafka
You might have noticed in Chapter 2 that we showed five options for
streaming engines and three for microservice frameworks, but only
one log-oriented data backplane option, Kafka. In 1979, the only
relational database in the world was Oracle, but of course many
alternatives have come and gone since then. Similarly, Kafka is by
far the most widely used system of its kind today, with a vibrant
community and a bright future. Still, there are a few emerging alter‐
natives you might consider, depending on your needs: Apache Pul‐
sar, which originated at Yahoo! and is now developed by Streaml.io,
and Pravega, developed by Dell EMC.
I don’t have the space here to compare these systems in detail, but to
provide motivation for your own investigation, I’ll just mention two
advantages of Pulsar compared to Kafka, at least as they exist today.
First, if you prefer a message queue system, one designed for big
data loads, Pulsar is actually implemented as a queue system that
also supports the log model.
Second, in Kafka, each partition is explicitly tied to one file on one
physical disk, which means that the maximum possible partition
size is bounded by the hard drive that stores it. This explicit map‐
ping also complicates scaling a Kafka topic by splitting it into more
partitions, because of the data movement to new files and possibly
new disks that is required. It also makes scaling down, by consolidat‐
25
supports. There are even semantics defined that Beam and Google
Cloud Dataflow themselves don’t yet support!
Usually, when a runner supports a Beam construct, the runner also
provides access to the feature in its “native” API. So, if you don’t
need runner portability, you might use that API instead.
For space reasons, I can only provide a sketch of these advanced
semantics here, but I believe that every engineer working on stream‐
ing pipelines should take the time to understand them in depth.
Tyler Akidau, the leader of the Beam/Dataflow team, has written
two O’Reilly Radar blog posts and co-wrote an O’Reilly book
explaining these details:
If you follow no other links in this report, at least read those two
blog posts!
Streaming Semantics
Suppose we set out to build our own streaming engine. We might
start by implementing two “modes” of processing, to cover a large
fraction of possible scenarios: single-record processing and what I’ll
call “aggregation processing” over many records, including sum‐
mary statistics and operations like GROUP BY and JOIN queries.
Single-record processing is the simplest case to support, where we
process each record individually. We might trigger an alarm on an
error record, or filter out records that aren’t interesting, or trans‐
form each record into a more useful format. Lots of ETL (extract,
transform, and load) scenarios fall into this category. All of these
actions can be performed one record at a time, although for effi‐
ciency reasons we might do small batches of them at a time. This
simple processing model is supported by all the low-latency tools
introduced in Chapter 2 and discussed in more depth shortly.
1 Sometimes the word window is used in a slightly different context in streaming engines
and the word interval is used instead for the idea we’re discussing here. For example, in
Spark Streaming’s mini-batch system, I define a fixed mini-batch interval to capture
data, but I process a window of one or more intervals together at a time.
Streaming Semantics | 27
Another complication is the fact that times across servers will never
be completely synchronized. The widely used Network Time Proto‐
col for clock synchronization is accurate to a few milliseconds, at
best. Precision Time Protocol is submicrosecond, but not as widely
deployed. If I’m trying to reconstruct a sequence of events based on
event times for activity that spans servers, I have to account for the
inevitable clock skew between the servers. Still, it may be sufficient if
we accept the timestamp assigned to an event by the server where it
was recorded.
Another implication of the difference between event and processing
time is the fact that I can’t really know that my processing system
has received all of the records for a particular window of event time.
Arrival delays could be significant, due to network delays or parti‐
tions, servers that crash and then reboot, mobile devices that leave
and then rejoin the network, and so on. Even under normal opera‐
tions, data travel is not instantaneous between systems.
Because of inevitable latencies, however small, the actual records in
the mini-batch will include stragglers from the previous window of
event time. After a network partition is resolved, events from the
other side of the partition could be significantly late, perhaps many
windows of time late.
Figure 4-1 illustrates event versus processing times.
Streaming Semantics | 29
accepts that an improved result will be forthcoming, if and when
additional late records arrive. A dashboard might show a result that
is approximate at first and grows increasingly correct over time. It’s
probably better to show something early on, as long as it’s clear that
the results aren’t final.
However, suppose instead that these point-in-time analytics are
written to yet another stream of data that feeds a billing system.
Simply overwriting a previous value isn’t an option. Accountants
never modify an entry in bookkeeping. Instead, they add a second
entry that corrects the earlier mistake. Similarly, our streaming tools
need to support a way of correcting results, such as issuing retrac‐
tions, followed by new values.
Recall that we had a second task to do: segregate transactions by
geographic region. As long as our persistent storage supports incre‐
mental updates, this second task is less problematic; we just add late-
arriving records as they arrive. Hence, scenarios that don’t have an
explicit time component are easier to handle.
We’ve just scratched the surface of the important considerations
required for processing infinite streams in a timely fashion, while
preserving correctness, robustness, and durability.
Now that you understand some of the important concepts required
for effective stream processing, let’s examine several of the currently
available tools and how they support these concepts, as well as other
potential requirements.
If you run separate data analytics (for example, Spark jobs), and sep‐
arate microservices that use the results, you’ll need to exchange the
results between the services (for example, via a Kafka topic). This is
a great pattern, of course, as I argued in Chapter 3, but also I pointed
out that sometimes you don’t want the extra overhead of going
through Kafka. Nothing is faster with lower overhead than making a
function call in the same process, which you can do if your data
transformation is done by a library rather than a separate service!
Also, a drawback of microservices is the increase in the number of
different services you have to manage. You may want to minimize
that number.
Because Akka Streams and Kafka Streams are libraries, the draw‐
back of using them is that you have to handle some details yourself
that Spark or Flink would do for you, like automatic task manage‐
ment and data partitioning. Now you really need to write all the
logic for creating a runnable process and actually running it.
Despite these drawbacks, because Kafka Streams runs as a library on
top of the Kafka cluster, it requires no additional cluster setup and
modest additional management overhead to use.
Fast data architectures raise the bar for the “ilities” of distributed
data processing. Whereas batch jobs seldom last more than a few
hours, a streaming pipeline is designed to run for weeks, months,
even years. If you wait long enough, even the most obscure problem
is likely to happen.
The umbrella term reactive systems embodies the qualities that real-
world systems must meet. These systems must be:
Responsive
The system can always respond in a timely manner, even when
it’s necessary to respond that full service isn’t available due to
some failure.
Resilient
The system is resilient against failure of any one component,
such as server crashes, hard drive failures, or network parti‐
tions. Replication prevents data loss and enables a service to
keep going using the remaining instances. Isolation prevents
cascading failures.
Elastic
You can expect the load to vary considerably over the lifetime of
a service. Dynamic, automatic scalability, both up and down,
allows you to handle heavy loads while avoiding underutilized
resources in less busy times.
39
Message driven
While fast data architectures are obviously focused on data, here
we mean that all services respond to directed commands and
queries. Furthermore, they use messages to send commands and
queries to other services as well.
Classic big data systems, focused on batch and offline interactive
workloads, have had less need to meet these qualities. Fast data
architectures are just like other online systems where these qualities
are necessary to avoid costly downtime and data loss. If you come
from a big data engineering background, you are suddenly forced to
learn new skills for distributed systems programming and opera‐
tions.
• Ingest all inbound data into Kafka first, then consume it with
the stream processors and microservices. You get durable, scala‐
ble, resilient storage. You get support for multiple, decoupled
consumers, replay capabilities, and the simplicity and power of
event log semantics and topic organization as the backplane of
your architecture.
• For the same reasons, write data back to Kafka for consumption
by downstream services. Avoid direct connections between
services, which are less resilient, unless latency concerns require
direct connections.
• When using direct connections between microservices, use
libraries that implement the Reactive Streams standard, for the
resiliency provided by back pressure as a flow-control mecha‐
nism.
• Deploy to Kubernetes, Mesos, YARN, or a similar resource
management infrastructure with proven scalability, resiliency,
and flexibility. I don’t recommend Spark’s standalone-mode
deployments, except for relatively simple deployments that
43
Figure 6-1. IoT anomaly detection example
There are three main segments of this diagram. After the telemetry
is ingested (label 1), the first segment is for model training with
periodic updates (labels 2 and 3), with access to persistent stores for
saving models and reading historical data (labels 4 and 5). The sec‐
ond segment is for model serving—that is, scoring the telemetry
with the latest model to detect potential anomalies (labels 6 and 7)—
and the last segment is for handling detected anomalies (labels 8 and
9).
Let’s look at the details of this figure:
1. Telemetry data from the devices in the field are streamed into
Kafka, typically over asynchronous socket connections. The
telemetry may include low-level machine and operating system
metrics, such as component temperatures, CPU and memory
utilization, and network and disk I/O performance statistics.
Application-specific metrics may also be included, such as met‐
rics for service requests, user interactions, state transitions, and
so on. Various logs may be ingested, too. This data is captured
into one or more Kafka topics.
2 A real system might train and use several models for different purposes, but we’ll just
assume one model here for simplicity.
Example Application | 45
quent. So, we don’t really need the scalability of Kafka to hold
this output, but we’ll still write these records to a Kafka topic to
gain the benefits of decoupling from downstream consumers
and the uniformity of doing all communications using one tech‐
nique.
8. For the IoT systems we’re describing, they may already have
general microservices that manage sessions with the devices,
used for handling requests for features, downloading and instal‐
ling upgrades, and so on. We leverage these microservices to
handle anomalies, too. They monitor the Kafka topic with
anomaly records.
9. When a potential anomaly is reported, the microservice sup‐
porting the corresponding device will begin the recovery pro‐
cess. Suppose a hard drive appears to be failing. It can move
data off the hard drive (if it’s not already replicated), turn off the
drive, and notify the customer’s administrator to replace the
hard drive when convenient.
49
Additional References
The following references, some of which were mentioned already in
the report, are very good for further exploration:
Additional References | 51
About the Author
Dean Wampler, PhD, is Vice President, Fast Data Engineering, at
Lightbend. With over 25 years of experience, Dean has worked
across the industry, most recently focused on the exciting big data/
fast data ecosystem. Dean is the author of Programming Scala, Sec‐
ond Edition, and Functional Programming for Java Developers, and
the coauthor of Programming Hive, all from O’Reilly. Dean is a con‐
tributor to several open source projects and a frequent speaker at
several industry conferences, some of which he co-organizes, along
with several Chicago-based user groups. For more about Dean, visit
deanwampler.com or find him on Twitter @deanwampler.
Dean would like to thank Stavros Kontopoulos, Luc Bourlier, Debas‐
ish Ghosh, Viktor Klang, Jonas Bonér, Markus Eisele, and Marie
Beaugureau for helpful feedback on drafts of the two editions of this
report.