Apress Productionizing AI How To Deliver AI B2B Solutions With Cloud and Python 1484288165
Apress Productionizing AI How To Deliver AI B2B Solutions With Cloud and Python 1484288165
Barry Walsh
Productionizing AI: How to Deliver AI B2B Solutions with Cloud and Python
Barry Walsh
Ely, UK
Prologue���������������������������������������������������������������������������������������������������������������xxiii
v
Table of Contents
vi
Table of Contents
Agile�������������������������������������������������������������������������������������������������������������������������������������������� 47
Agile Teams and Collaboration���������������������������������������������������������������������������������������������� 47
Development/Product Sprints������������������������������������������������������������������������������������������������ 48
Benefits of Agile�������������������������������������������������������������������������������������������������������������������� 50
Adaptability���������������������������������������������������������������������������������������������������������������������������� 50
react.js: Hands-on Practice��������������������������������������������������������������������������������������������������� 51
VueJS: Hands-on Practise����������������������������������������������������������������������������������������������������� 52
Code Repositories����������������������������������������������������������������������������������������������������������������������� 53
Git and GitHub������������������������������������������������������������������������������������������������������������������������ 53
Version Control���������������������������������������������������������������������������������������������������������������������� 54
Branching and Merging��������������������������������������������������������������������������������������������������������� 55
Git Workflows������������������������������������������������������������������������������������������������������������������������ 56
GitHub and Git: Hands-on Practice���������������������������������������������������������������������������������������� 57
Deploying an App to GitHub Pages: Hands-on Practice��������������������������������������������������������� 59
Continuous Integration and Continuous Delivery (CI/CD)������������������������������������������������������������ 60
CI/CD in DataOps������������������������������������������������������������������������������������������������������������������� 60
Introduction to Jenkins���������������������������������������������������������������������������������������������������������� 61
Maven������������������������������������������������������������������������������������������������������������������������������������ 62
Containerization�������������������������������������������������������������������������������������������������������������������� 63
Play With Docker: Hands-on Practice������������������������������������������������������������������������������������ 64
Testing, Performance Evaluation, and Monitoring����������������������������������������������������������������������� 66
Selenium������������������������������������������������������������������������������������������������������������������������������� 66
TestNG����������������������������������������������������������������������������������������������������������������������������������� 67
Issue Management���������������������������������������������������������������������������������������������������������������� 68
Monitoring and Alerts������������������������������������������������������������������������������������������������������������ 70
Jenkins CI/CD and Selenium Test Scripts: Hands-on Practice���������������������������������������������� 72
Wrap-up�������������������������������������������������������������������������������������������������������������������������������������� 74
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
xi
Table of Contents
xii
Table of Contents
xiii
Table of Contents
xiv
Table of Contents
xv
Table of Contents
Postscript������������������������������������������������������������������������������������������������������������� 359
Index��������������������������������������������������������������������������������������������������������������������� 363
xvi
About the Author
Barry Walsh is a software delivery consultant and AI trainer at Pairview with a
background exploiting complex business data to optimize and de-risk energy assets
at ABB/Ventyx, Infosys, E.ON, Centrica, and his own start-up ce.tech. Certified as an
Azure AI Engineer and Data Scientist as well as an AWS Cloud Practitioner, he has a
proven track record of providing consultancy services in Data Science, BI, and Business
Analysis to businesses in Energy, IT, FinTech, Telco, Retail, and Healthcare, Barry has
been at the apex of analytics and AI solutions delivery for over 20 years. Besides being
passionate about Enterprise AI, Barry spends his spare time with his wife and 8-year-old
son, playing the piano, riding long bike rides (and a marathon on a broken toe this year),
eating out whenever its possible or getting his daily coffee fix.
xvii
About the Technical Reviewer
Pramod Singh works at Bain & Company as a Senior Manager, Data Science, in their
Advanced Analytics group. He has over 13 years of industry experience in Machine
Learning (ML) and AI at scale, data analytics, and application development. He has
authored four other books including a book on Machine Learning operations. He is also
a regular speaker at major international conferences such as Databricks AI, O’Reilly’s
Strata, and other similar conferences. He is an MBA holder from Symbiosis International
University and a data analytics certified professional from IIM Calcutta. He lives in
Gurgaon with his wife and five-year-old son. In his spare time, he enjoys playing the
guitar, coding, reading, and watching football.
xix
Preface
When Barry approached me to write a preface for this book, it filled me with delight,
not just because I was writing a preface for his new book, although it is always nice to be
asked, but primarily because I was being asked by someone who I have watched flourish
under my leadership at Pairview for the last few years.
The work we do at Pairview has helped shape thousands of fledgeling careers across
the UK and around the world for more than thirteen years at the time of writing. Fresh
graduates struggling to find high value jobs, returning to work mums who want flexible
work with high pay, and those who have reached the end of a career cycle and have no
real prospect for further growth have all turned to Pairview to kickstart their careers.
The company was created to help close the talent gap that was growing in the data
space. This gap has since spread to most other aspects of technology as new technology
innovation comes around and companies continue to grapple with the ultimatum to
adopt technology-induced change or sink.
Barry’s passion for Artificial Intelligence combined with his vast knowledge of
the subject matter and the work he has done for our numerous clients over the years
makes him an authority on the subject. Barry has acquired first-hand exposure to the
opportunities, challenges and the risks associated with AI ecosystems and development
processes. He knows what it takes to get buy-ins and drive organisation wide-adoption of
AI capabilities and the business impact. Barry has over the years become an evangelist of
the value of AI to business while being fully aware of the enormous investment of time,
resources and scrutiny required to get AI right first time and drive it through the digital
transformation value chain.
Although Artificial intelligence has been with us for many decades, in recent times
new capabilities have emerged in the constellation; enabling more complexity in
sensing, comprehending, acting and learning with human-like levels of intelligence.
With mathematical technologies like machine learning and natural language
processing the landscape of AI continues to expand and increasingly co-exist with
humans, enabling businesses to dare to digitalise with never-before levels of accuracy,
consistency and availability. A combination of data analytics, machine learning and
xxi
Preface
deep learning embedded with ever-more powerful computing power such as quantum
and edge, AI is being used to deliver next generation capabilities across all aspects of
human reach.
While we have seen many organisations invest quite heavily in machine learning
and analytics capabilities, bringing the insight and models from machine learning
developments into production continues to pose a significant challenge for many
business leaders, particularly CDOs, CEOs and Data and AI leads responsible for
the embedding and delivery or AI enabled products to markets at the speed and
scale required for success. This is why this book is a must read for leaders of these
organisations, programmes or products. It provides a strong framework for planning,
developing and deploying enterprise AI to production at scale.
Frank Abu
Director
Pairview Limited
xxii
Prologue
Out of a wasteland of failed Data Science projects and mounting technical debt,1
organizations today are attempting to redress the AI landscape and enforce a broader,
more considered Design/System Thinking approach which cuts through the hype. The
imperative is to ensure that Data/AI solutions are built with multiuser engagement at the
outset (both technical and nontechnical) and have system-wide ecosystem, enterprise
data center, infrastructure/integration, and end-to-end process in mind.
Applications of Artificial Intelligence leveraging the latest technological advances are
firmly at the top of this hype cycle.2 With the productivity benefits of AI difficult to ignore,
Covid and galloping digitalization have given rise to a vicious culture of disruption, the
accelerated uptake forcing fragile companies to build or buy solutions cheaper, smaller,
and more open sourced. This convergence of forces means that demand for rapid
prototyping and accelerated AI solution delivery is high.
At the same time, not every company is cognizant of what AI can do, or what it
means. Often these are corporates burdened by legacy tools and poor innovation
practices. Some are fearful of the impact on jobs; others have ethical concerns. But what
has become clear for most C-Levels is that AI implementation must fit an “Enterprise AI”
vision – a $341b market3 – and the prevalent trend is toward a platform of disparate but
highly integrated “best-of-breed” AI solutions.
For many of us where data and digital are inextricably tied up in our professional
life and associated opportunities in the job market, these meaningful, high-value AI
solutions are what employers are targeting. And the constant focus on delivering an
Return On Investment (ROI) – understanding and, more importantly, delivering tangible
outcomes using AI, typically machine and deep learning – is what keeps us employed.
1
The rush to digital means 60% of businesses (in Europe and globally) will find themselves with
more technical debt post-pandemic, not less (Source: Forrester).
2
https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-
the-2022-gartner-hype-cycle. See also https://tinyurl.com/3c7pcpfm
3
www.idc.com/getdoc.jsp?containerId=prUS48127321
xxiii
Prologue
This can be exhausting and part of the challenge for business leaders is that
highly technical skillsets have never led to particularly visual, communicable (and
understandable) results for the rest of the organization – most Data Scientists are
not good at BI or recruited for their soft skills. Employers in the job market today are
increasingly looking for more rounded/broader “end-to-end” skills which translate
to better visualization, front-end features, and integration. The opportunity for
Data Professionals is to give themselves an “edge” by addressing technical debt and
possessing the ability to deliver full-stack data solutions.
A great deal of this “opportunity” depends on cloud computing; AI or Data Science
is not just a Python notebook and a flip chart – it demands identifying and ingesting
suitable datasets and leveraging services on cloud to scale from sandbox to proof-of-
concept to prototype to minimum viable product. Ultimately today it is Enterprise AI that
most companies and organizations are targeting. But for many individuals employed
in roles and for many noncorporate companies, leveraging cloud is a minefield of
sometimes unclear, poorly documented artifacts and hidden costs. Enterprise AI is far
from affordable.
Successful (and affordable) Enterprise AI project delivery requires a healthy amount
of Emotional Intelligence,4 an agile mindset, robust data-fed pipelines and a whole lot
of workarounds to design, scope, and achieve integration across people, processes, and
tools – with all of us dependent on the three main hyperscalers/cloud service providers
(CSPs): Amazon Web Services, Microsoft Azure, and Google Cloud Platform,5 bound to
their data centers, scalable storage, and compute instances.
Agile is important, but so are hybrid/agnostic solutions, multiskill, T-shaped
capabilities, and results-based delivery. Above all, senior stakeholders/managers should
not be taking credit for being spiderman or spiderwoman implementing “agile” project
methodologies if they are just orchestrating chaos or spinning webs that attribute project
failures/blame elsewhere. Agile only works if it follows project design through to a
standard of delivery which meets the overall project vision, not if under the surface it’s
just a pile of shit.6
4
Amit Ray: “as more and more artificial intelligence is entering into the world, more and more
emotional intelligence must enter into leadership.”
5
All firmly in the world’s top 10 companies by market capitalization:
https://companiesmarketcap.com/
6
Forrester: “while investments in automation will continue to rise rapidly across Europe, many
companies have historically lacked a coherent data/AI strategy, with a patchwork of siloed digital
services that have left their IT function in a mess and customers frustrated.”
xxiv
Prologue
7
www.tiobe.com/tiobe-index/
xxv
CHAPTER 1
1
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_1
Chapter 1 Introduction to AI and the AI Ecosystem
In this chapter, we will scratch the surface first, before embarking on the more
involved hands-on practice and applications of later chapters. The intention is to provide
readers with the tools in this chapter to go forward, providing concise context and
definitions around the AI Ecosystem, the main applications of AI, data ingestion and
data pipelines, and machine and deep learning with neural networks before wrapping
up on productionizing AI.
The AI Ecosystem
Our first section sets the scene for AI today – starting with a look at the hype cycle before
a retrospective on how AI has evolved to this stage. We also introduce some definitions,
cloud computing as the enabler for scalable AI, the ecosystem for “full-stack” AI, and
discuss growing ethical concerns.
2
Chapter 1 Introduction to AI and the AI Ecosystem
H
istorical Context
Depending on your perspective, Artificial Intelligence has its roots in the age of
computing in the 1950s or in the “automata” of ancient philosophy.
Modern AI probably originated in classical philosophy in references to human
thought as a mechanical process so before we start, it's useful to establish some of this
context for AI today.1 Table 1-1 provides an overview of Artificial Intelligence’s evolution.
5th century BC First records of mechanical robots: Chinese Daoist Philosopher Lao Tzu
accounts of a life-sized, human-shaped mechanical automaton
c. 428–347 BC Greek scientists create “automata” – specifically Archytas creates a
mechanical bird
9th century First recorded programmable complex mechanical machine
1833 Charles Babbage conceives an Analytical Engine – a programmable calculating
machine
1872 Samuel Butler’s novel Erewhon includes the notion of machines with
human-like intelligence
1st half of 20th Science Fiction awareness of AI (Tin Man in Wizard of Oz, Maria robot double in
century Metropolis)
1950 Alan Turing publishes Computing Machinery and Intelligence – asking “Can
machines think?” – or “can machines successfully imitate thought?”
1956 MIT cognitive scientist Marvin Minsky coins the term “Artificial Intelligence”
1974–1993 Long AI “Winter” – lack of tangible commercial success and poor performance
of neural networks lead to reduced funding from governments
1997 IBM's Deep Blue defeats Garry Kasparov at Chess
2011 IBM Watson wins the quiz show "Jeopardy!"
2012 ImageNet competition – AlexNet Deep Neural Networks result in significant
reduction in error in visual object recognition
1
www.forbes.com/sites/gilpress/2016/12/30/a-very-short-history-of-artificial-
intelligence-ai/?sh=1cfaac1f6fba
3
Chapter 1 Introduction to AI and the AI Ecosystem
Simple A device that does work (transfer of energy Wheels, levers, pulleys, inclined
machine from one object to another) planes, wedges, and screws
Complex Combinations of simple machines Wheelbarrow, bicycle, mechanical
machines robot (Lao Tzu, mechanical bird)
Programmable Receive input, store and manipulate data, Punched cards, encoded music
machines and provide output in a useful format rolls
Calculating Mechanical device used to perform Abacus, slide rule, difference
machines automatically the basic operations of engine, calculator
arithmetic
Digital Systems that generate and process binary Computers
machines data
AI Today
While the question of “what is AI” may be less clear (we will address this in the next
section), without Digital Machines or Computers we certainly wouldn’t be talking about
AI – and as we will see shortly, ultimately it’s the growth of cloud computing and the
high-performance computing that comes with it that have enabled AI or democratic AI.
4
Chapter 1 Introduction to AI and the AI Ecosystem
When we look at the mechanics and, in particular, the applied use cases of AI
today, Machine and Deep Learning are the underpinning techniques of real Artificial
Intelligence, rather than any misconceived ideas about a “rise of the machines” partly
based on ignorance, part science fiction. AI in the job-world stands for Augmented
Intelligence and no-one really wants to see Artificial General Intelligence in the same
way as no-one (hopefully) wants a third world war.
Machine Learning
As shown in Figure 1-2, Machine Learning can be thought of as a subset of AI giving
computers the ability to learn without being explicitly programmed. Operationally,
machine learning is much like how humans learn from experience, that is, if we touch
something hot and get burnt, the negative experience is stored in memory and we
quickly learn not to touch it again.
We feed a computer data, which represents past experience, and then with the use
of different statistical methods, we “learn” from the data and apply that knowledge to
future events – these are our model “predictions.”
Deep Learning
Deep Learning is often considered a subset of Machine Learning and is distinguished by
the fact that deep neural network layers are used to solve predictive problems.
Because of their reliance on Big data and Modelling, all of AI, Machine, and Deep
Learning are core techniques for Data Science which combines modeling, statistics,
programming, and some domain expertise to extract insights and value from data.
5
Chapter 1 Introduction to AI and the AI Ecosystem
2
www.abrisconsult.com/artificial-intelligence-and-data-science/
6
Chapter 1 Introduction to AI and the AI Ecosystem
C
loud Computing
So we are concerned with Narrow AI, at least for now, and by the time AGI or ASI comes
around, if we are to believe Elon Musk, we should all hope we are gone.
As mentioned above, the enabler for this form of AI is cloud and successful Narrow
AI implementation requires an end-to-end Cloud Infrastructure. It is difficult to
underestimate the degree to which cloud computing is a fundamental requirement for
any business in 2022. Growth has been explosive with a 33% increase in cloud spend in
2020 driven by intense demand to support remote working and learning, ecommerce,
content streaming, online gaming, and collaboration.
Storage and Compute Power are the main Cloud components used for Big Data
handling central to AI. While Enterprise Machine Learning projects can run with low
overheads on both, Deep Learning projects cannot. Amazon Web Services, Azure, and
Google Cloud Platform are the major (“Big Three”) cloud service providers (or CSPs),
with IBM Cloud and Heroku also used.3 We will cover hands-on examples of all of these
in this book. All CSPs provide a catalog of AI services and tools which greatly simplify the
process of building applications.
While cloud is the key enabler of AI, cloud computing only really works for
Enterprise, or production-grade AI if the company’s Data Strategy is underpinned by
rich, BIG data sources and/or training data.
Ali Baba is the fourth biggest cloud service provider but for now largely confined to the
3
Chinese market
7
Chapter 1 Introduction to AI and the AI Ecosystem
Summarized in Figure 1-4 are some of the additional USPs of each of the main cloud
platforms.
8
Chapter 1 Introduction to AI and the AI Ecosystem
Figure 1-5. Supporting the wider AI Ecosystem – CSPs, Sis, and OEMs
9
Chapter 1 Introduction to AI and the AI Ecosystem
F ull-Stack AI
It's not all about cloud of course. While cloud provides the platform, a whole host of
proprietary and open-sourced tools are used to implement AI from data engineering
tooling such as Apache Kafka or AWS Kinesis to NoSQL databases such as mongoDB
and AWS DynamoDB through back-end programming languages like Python and Scala4
coupled with model engines like Apache Spark and front-end BI layers/dashboards like
Dash, PowerBI, and Google Data Studio.
We will be using many of the above in our hands-on examples in this book.
4
other languages such as R and Go / Golang are of course also used for Data Science, but are
some way off the popularity and spread of Python at the moment. We will however touch on Scala
further in Chapters 3 and 9.
10
Chapter 1 Introduction to AI and the AI Ecosystem
All good AI solutions require a visualisation / business intelligence (BI) layer to ensures
solutions are delivered with a compelling dashboard or interface to a dashboard. There are
many such tools used for this purpose including AWS QuickSight, Google Data Studio, Cognos,
5
See, for example, www.standard.co.uk/tech/apple-card-sexist-algorithms-goldman-sachs-
credit-limit-a4283746.html
11
Chapter 1 Introduction to AI and the AI Ecosystem
Tableau and Looker6. Here we take a look at Microsoft PowerBI – one of the leading BI tools at
the moment.
1. Accept cookies and sign up and download PowerBI from the link below:
https://powerbi.microsoft.com
2. Google “John Hopkins Covid data GitHub” to find the latest GitHub data
from John Hopkins University. For reference the live csv files for confirmed,
recovered and mortality cases are at the link below:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_
covid_19_data/csse_covid_19_time_series
3. Open PowerBI and Go to Get Data > Web. Enter the url for each of the three files
separately, selecting “Load Data” to import the data to the PowerBI data model
NB as the link is to the “live” files, the data shown in the visuals below will
automatically update each day
4. Exercise: In the Explorer view of PowerBI recreate the visuals shown in the
example dashboard at the link below:
https://app.powerbi.com/view?r=eyJrIjoiN2FkNzZlMWQtMm
E2OC00NzRiLWI0ZGItNDMzNzZhYTIwYTViIiwidCI
6IjhlYTkwMTE5LWUxYzQtNDgyNC05Njk2LTY0NzBjYmZiMjRlNiJ9
a. Selecting the correct visual (e.g. card, table, bar chart, area chart, tree chart)
d. Finally add slicers on date and country to allow a user to quickly drill-down
into Covid cases by date (window) and country
6. Exercise (Stretch) – push the finished PowerBI report to PowerBI Service7, then
proceed to host your dashboard on a public url
And DS.js and Apache Superset. We will look into more Python-specific visualisation tools in
6
12
Chapter 1 Introduction to AI and the AI Ecosystem
A
pplications of AI
Taking the context from the previous section and in keeping with the light introductory
nature of this first chapter, we will now address in the next section the main AI
applications:
• Machine Learning
M
achine Learning
Machine Learning is a technique enabling computers to make inferences from complex
data and remains the biggest area of AI research today. There are three main types:
Supervised, Unsupervised, and Reinforcement Learning,8 the development and
deployment of each we will look at in subsequent chapters. For now, we have provided
some basic definitions focused on the inherent difference between these machine
learning approaches:
Supervised – training on datapoints where the desired “target” output is known
Unsupervised – no outputs available but machine learning is used to identify
patterns in data
Reinforcement learning – training a machine learning model by maximizing a
reward/score
A key application of machine learning today is in Fraud Detection, which is often run
as both an unsupervised ML and supervised ML problem. The goal is to try and predict
patterns in transactional (and customer) data that indicate fraud is taking place.
7
PowerBI Service requires an organisational email. For assistance on see https://dash-bi
.medium.com/how-to-use-power-bi-service-for-free-without-a-professional-email-
in-4-steps-f97dbaf4c51e or https://learn.microsoft.com/en-us/power-bi/fundamentals/
service-self-service-signup-for-power-bi
8
Four if a semi-supervised approach is used – see Chapter 4
13
Chapter 1 Introduction to AI and the AI Ecosystem
It is assumed that most readers are familiar in some way with basic Machine
Learning techniques and would direct readers to other books to augment their
understanding in case some of the applications discussed in this book go beyond
assumed knowledge.
Deep Learning
In many ways Deep Learning is a subset of Machine Learning; essentially extending
Machine Learning to hard, “Big Data” problems typically solved using neural networks.
Neural networks themselves are inspired by neuron connections in the human brain –
when put together we get something like the image on the right (a). This is in fact a
multilayer neural network taking four inputs and providing one output after various data
transformations carried out within the two “hidden” layers of four nodes (or neurons) each.
14
Chapter 1 Introduction to AI and the AI Ecosystem
15
Chapter 1 Introduction to AI and the AI Ecosystem
neural networks (RNNs), and specialized long short-term memory models (LSTMs)
are increasingly able to outperform traditional methods. We will look at some practical
hands-on examples of these in Chapter 5 on Deep Learning.
C
hatbots
Probably the most well-known application of NLP is chatbots and personal assistant
applications such as Siri and Alexa and Watson Assistant.
Chatbots are essentially software applications to conduct interactive dialogue, with
growing text-to-speech and speech-to-text capability.
The technology has vastly improved since the earliest days of Cortana on Windows.
Today it has evolved to the extent that today Intelligent Virtual Agents (IVAs or Chatbots
2.0) or Interactive Voice Response (IVR) applications are widely used in call centers to
respond to user requests. IVAs have in-built self-learning capability and adapt to context
in contrast to earlier dialog rule-based chatbots.
The business value in 2002 is in improving the customer journey and customer
experience – with the ability to resolve issues quickly, although not in more
complex cases.
9
Statista
16
Chapter 1 Introduction to AI and the AI Ecosystem
The main techniques in Natural Language Processing are syntactic analysis and
semantic analysis. Syntactic analysis focuses on grammar while semantic analysis is
concerned with the underlying meaning of text. Both involve a number of underlying
subprocesses (such as lemmatization and word disambiguation) important for
categorization and more ultimately, insight extraction. We will look at these in more
detail in a later chapter.
In its simplest sense, Natural Language Processing in chatbots is invoked to detect,
then best-fit user “intents” and “entities” to a preconfigured dialogue “corpus.” As user
interaction grows, increasingly more data can be used in the training process to improve
the matching process, and enhance the dialog.
• Auto-completion of forms
• Document Archiving
17
Chapter 1 Introduction to AI and the AI Ecosystem
Other AI Applications
The above introduction to the main uses of Artificial Intelligence in the workplace is
intended to give a flavor of what to expect later on in this book. Specifically, Machine and
Deep Learning, Natural Language Processing, and CRPA underpin the industry-specific
use cases adopted by the private and public sectors today.
We will describe and walk through many of these later on in our hands-on practice,
but for now we have captured in Figure 1-9 a broad segmentation of use cases by sector.
18
Chapter 1 Introduction to AI and the AI Ecosystem
19
Chapter 1 Introduction to AI and the AI Ecosystem
The goal of our first exercise is to start getting familiar with cloud and cloud services
(in this case Azure and the Text Analytics API) and familiarize ourselves with a key AI
application, that of Natural Language Processing
1. We use a “sandbox” environment on Microsoft Learn for this exercise:
https://docs.microsoft.com/en-us/learn/modules/classify-
user-feedback-with-the-text-analytics-api/3-exercise-call-
the-text-analytics-api-using-the-api-testing-console
2. Activating the sandbox requires a Microsoft account and an Azure account10
3. Follow the steps in the tutorial to input text and a) detect language, b) extract
key phrases, c) analyze sentiment, and d) extract entities
5. Try the other methods, Detect Language, Entities, and Key Phrases, using the
same subscription key
6. Try to make a call from a different region with your subscription and observe
what happens
10
Free for 30 days, then requires credit card to maintain as pay-as-you-go (as long as
resources are deleted immediately after creation, the cost is likely < £2 costs per month or
free). You can check costs incurred at the link below https://portal.azure.com/#blade/
Microsoft_Azure_CostManagement/Menu/costanalysis
20
Chapter 1 Introduction to AI and the AI Ecosystem
A
I Engineering
Industry research shows that very few AI projects are successful, partly because
technically minded, and often junior resources on an AI project tend to forget that AI
involves people, processes, and tools. The reality is writing machine or deep learning
code is just one small part – and invariable notebooks or scripts fail when not considered
complex surrounding infrastructure.
Data pipelines lie behind every successful AI application – from data ingestion
through several stages of data classification, transformation, analytics, training machine
learning, and deep learning models through inference and retraining/data-drift
processes, the goal is to yield increasingly accurate decisions or insights.
Ultimately no AI project can be a success without a Data Strategy in place with a
robust, serviced delivery pipeline for ingesting data into downstream modeling and
analytics processes. With that in mind, we discuss in this section how data ingestion
works and setting up data pipeline necessary for building a successful AI application.
11
Enterprise AI – more on this in Chapter 7
21
Chapter 1 Introduction to AI and the AI Ecosystem
Extract
Data is extracted from a variety of internal and external sources such as text files, csv,
excel, json, html, relational and nonrelational databases, websites, or APIs. More
modern formats such as parquet and avro are also increasingly utilized for their efficient
compression of datasets.
Transform
Data is transformed in order to make it suitable for, and compatible with, the schema
of a target end-user system. Transform involves cleansing data to remove duplicates or
out-of-date entries; converting data from one format to another; joining and aggregating
data; sorting and ordering data among others.
Load
In the final ETL step, data is loaded into a target system such as a data warehouse. Once
inside the data warehouse, data can be efficiently queried and used for analytics and
business intelligence.
Data Wrangling
The above concepts are very much central to the role of the Data Engineer but there is
considerable overlap, particularly in the Transform step in ETL in serving downstream
Data Science and Data Analysis processes.
22
Chapter 1 Introduction to AI and the AI Ecosystem
Data Wrangling (or Data Munging) is the main process in Data Science and AI which
ensures data is in a fit-for-purpose state to carry out analytics or BI. Many people are
familiar with the statistic that cleaning data is 80% of the job of a Data Scientist In truth,
Data Wrangling can take up to 80% of a Data Scientist job, and it involves more than
just cleaning data: there are many more subprocesses including formatting, filtering,
encoding, scaling and normalizing, and shuffling or splitting. And these are not only
restricted to structured data, unstructured data (such as text or images) is in scope too
for both machine and deep learning
Data Wrangling sits in a process after data acquisition and before modeling/machine
or deep learning. It is highly iterative and often coupled with exploratory data analysis
(EDA) to understand better the structure of the individual fields (typically columns)
within a dataset. While EDA focuses on passively “looking” at the data, Data Wrangling
actually actively “changes” the data in some way.
We will discuss Data Wrangling (and ETL processes) in more detail in our chapter
on Machine Learning and in particular look to establish best practice techniques
for productionizing Machine Learning by looking at key Case Studies such as Fraud
Detection.
Performance Benchmarking
While ETL and Data Wrangling deal with preprocessing of data, we also need a means to
“score” (ideally continuously) the data flowing into an AI solution.
Building an AI app requires constant training and testing. Knowing how to
benchmark performance and which measures to use is a key overhead in both machine
and deep learning and needs to be rigorous and adaptive to the evolving (input) data
pipeline.
Measures such as accuracy, recall, precision, and a confusion matrix for Supervised
Classification problems to understand better the proportion of actual cases we correctly
predicted (whether negative or positive). For Supervised Regression problems, root
mean squared error and R-squared are used to compare forecasted output with the
actual target data. In Deep Learning we may use similar measures to above plus
additional specific Deep Learning measures such as loss and cross-entropy.
23
Chapter 1 Introduction to AI and the AI Ecosystem
24
Chapter 1 Introduction to AI and the AI Ecosystem
NO-CODE CLASSIFICATION
The goal of this exercise is to understand the dependency and flow of data on the
outcomes from any AI application. The exercise walks through how this is done with
a “No-code” binary (supervised) classification model to predict income levels in
Microsoft Azure ML Studio
3. Follow the steps in the tutorial below to import data, perform basic EDA and
Data Wrangling, modeling, and performance benchmarking.
4. http://gallery.cortanaintelligence.com/Details/3fe213e3
ae6244c5ac84a73e1b451dc4
25
Chapter 1 Introduction to AI and the AI Ecosystem
Before embarking on a high-level look at Deep Learning, let's take a quick refresher
on Machine Learning. As mentioned in the opening section, it is expected readers
have some grounding in Machine Learning already, so we will only address important
concepts here that are relevant to implementing an AI solution.
M
achine Learning
At a basic level, there are two types of Machine Learning: Supervised and Unsupervised
Machine Learning. Reinforcement Learning is sometimes considered a third type,
although can equally be considered a type of Unsupervised Learning.
12
Or three if Time Series Forecasting is considered as distinct from Regression
26
Chapter 1 Introduction to AI and the AI Ecosystem
Reinforcement Learning
Reinforcement learning involves real-time machine (or deep) learning with an agent/
environment mechanism which either penalizes or rewards iterations of a model based
on real-time feedback from the surrounding environment (how accurate is the model).
While the scope of this book is mainly focused on mainstream business and
organizational applications, advances in reinforcement learning are in general where
there is considerable hype in the media – essentially this is the underlying technique
that drives “industrial-scale” applications such as Google’s Search Engine, autonomous
vehicles, and robotics.
27
Chapter 1 Introduction to AI and the AI Ecosystem
Deep Learning
We saw above that Deep Learning refers to applying artificial neural networks with
several hidden layers, that is, a deep neural network.
28
Chapter 1 Introduction to AI and the AI Ecosystem
Deep Learning though can use a number of different types of Artificial Neural
Network, each of which has a number of key features which contribute to their adoption
in solving specific predictive analytics challenges in the workplace. The main types
covered in our hands-on labs in this book are listed below.13
Due to their now rather limited application to e.g. Dimensionality Reduction of datasets,
13
29
Chapter 1 Introduction to AI and the AI Ecosystem
14
We will cover Gradient Descent and Backpropagation in Chapter 5
30
Chapter 1 Introduction to AI and the AI Ecosystem
Although there are some labs in this book where we will run python as a script (.py format),
15
most labs run python via either a Jupyter Notebook or Google Colab Integrated Development
Environment (IDE) (i.e .ipynb format). The same labs can be run instead using Visual Studio
Code or PyCharm. For Jupyter environments, the use of the RISE (Reveal.js) extension is
recommended to enable code samples presented as slideshows: https://rise.readthedocs.io/
en/stable/
31
Chapter 1 Introduction to AI and the AI Ecosystem
TENSORFLOW PLAYGROUND
The goal of this exercise is to visualize the training process of a simplified deep
learning model and to try and improve its performance by playing with some of the
many tuning “levers” available to us.
1. Go to http://playground.tensorflow.org/
2. Have a look at the four datasets by clicking on the thumbnail icon on the RHS of
the screen (under DATA),
3. Note in all cases the (supervised) datasets have labeled data: either blue or
orange and the output shows these datapoints plotted on a 2D (x1, x2) grid.
The inputs for our model are shown under FEATURES and are initially set to
just x1 and x2 coordinates while we have two hidden layers of 4 and 2 neurons
respectively.
4. Choose one dataset and press run. This will start the training process (and
immediately after the evaluation process). Notice how the neural network
weights get updated through each forward pass and each epoch.
5. Observe the Training and Test loss shown on the right-hand side. A good model
will have both the training and test loss close to zero.
6. Stop the model training process for now. The first three datasets are relatively
easy to train on. Choose the last (spiral) dataset and restart the training
process.
8. Choose the last (spiral) dataset and restart the training process. For now, notice
that this time the modelling process struggles to converge – the loss oscillates
a lot. We will have another look at TensorFlow Playground later on in this book
in Chapter 5, and try to achieve a better result training on this dataset.
32
Chapter 1 Introduction to AI and the AI Ecosystem
Productionizing AI
Theory is one thing and delivery is another. We discussed briefly in Section 3 about the
proliferation of failed AI projects – the reality is that ever since Data Science has become
a “glamorous” job role backed by over-hyped job board marketing, poorly designed, and
over-engineered R and Python scripts, with broken integration links have left a trail of
waste across the Enterprise AI landscape.16
Out of this “wasteland” most organizations and businesses are attempting to redress
the landscape and recognize a need to introduce a broader Design/System Thinking
approach at the outset to ensure that AI solutions are built with multiuser engagement
(both technical and nontechnical), end-to-end process and system-wide ecosystem,
infrastructure and integration in mind.
Compute and Storage
Very few AI solutions today can be considered as “on-prem” solutions, and any true
Enterprise AI solution goes way beyond the underlying machine or deep learning model.
AI solutions today are inextricably tied to Cloud Computing and specific resources and
services offered by Big Tech – the need for cloud stems from two key “grouped service”
offerings: Compute and Storage.
Compute is essentially computer processing power – tied to computing memory, it is
the ability to perform software computation and (often complex and highly parallelized)
calculations. Compute is typically delivered via a Virtual Machine on Cloud.
Storage in the context of an organization’s operational and strategic needs is the
means by which all their data requirements are supplied, replenished, and maintained.
Most Cloud providers offer File Storage and SQL/NoSQL-based options for storing both
structured and unstructured data.
While original transactional DB systems required storage and compute to be as
close as possible to reduce latency, faster networks and the increasing availability and
scalability of database systems coupled with the need to reduce hosting costs have
driven the separation of compute power and storage.
16
“Technical Debt” is discussed further in later chapters
33
Chapter 1 Introduction to AI and the AI Ecosystem
All businesses need two types of data: transactional and processed (batched or
aggregated) data. Much of this still resides in Data Warehouses today, but because
sophistication in Predictive Analytics demands complex analytical queries, businesses
and organizations often decide to migrate their data to cloud as the relative increase in
latency is a lower priority than the overall cost savings achieved through cloud.
Table 1-3. Gartner Worldwide IaaS Public Cloud Services Market Share,
2019-2020 (Millions of US Dollars)17
Company 2020 2020 Market 2019 2019 Market 2019-2020 Growth (%)
Revenue Share (%) Revenue Share (%)
Vendor-lock on one particular Big Tech provider has become an issue, with many
companies now trying to diversify and utilize services from multiple cloud service
providers.
www.gartner.com/en/newsroom/press-releases/2021-06-28-gartner-says-worldwide-
17
iaas-public-cloud-services-market-grew-40-7-percent-in-2020
34
Chapter 1 Introduction to AI and the AI Ecosystem
While each CSP comes with its own “Marketplace” of cloud tools, we discuss below
only those that fall under the key Compute and Storage grouped services. Other grouped
services such as Governance, Security, Auto-scaling, and Containerization are discussed
in more detail later.
Compute Services
Azure Virtual Machines and EC2 instances, typically provisioned via a Virtual Machine
on a Virtual Private Cloud on AWS, are the main options for Cloud Compute. Google
Compute Engine is GCP’s main offering.
Storage Services
Amazon Simple Storage Service S3 is probably AWS’s most well-known storage service,
and is used to securely store data and files in “buckets.” Comparable services from other
vendors include Azure Blob storage, and Google Cloud Storage. We will make use of
these compute and storage services in some of our hands-on labs later.
Important Note Use of Cloud doesn’t come free, despite the pretense to offer
“Free Tiers.” We will be using it frequently in this book, but be prepared for costs
that could run up to $500 to follow all the labs.
Free Tiers are limited to certain resources, and these resources themselves have
rather mediocre capacity limits. There are limits on usage of services, so if a customer
faces costs on their cloud account, it's likely this is a result of usage on a cloud resource/
service exceeding free tier limits or the subscription has rolled over to pay-as-you-go
(after one year)
It is the reader’s responsibility to bear all costs related to provisioning of cloud
services. We strongly advise to always stop and delete resources when finished with
them. It is a source of immense frustration to this author that Big Tech will not do it
automatically for you. One wonders how this practice is justifiable from the richest
companies in the world.
Virtual Machines in particular incur costs when not running as they are someone
else’s server, and whether it's on or not, there are energy costs involved (particularly high
right now with a war going on between Russia and Ukraine)
35
Chapter 1 Introduction to AI and the AI Ecosystem
Help to manage your costs by use of sandbox environments (where available) and
by deleting resources when done. The author is also happy to provide complaint letters
to a CSP for onward referrals or a small donation to a charity, but ultimately it is your
responsibility to bear all costs related to provisioning of cloud services.
C
ontainerization
While the relative low cost and ease of deployment of compute and storage solutions on
cloud have underpinned adoption of cloud for AI, the use of containers has become the
go-to means of productionizing AI applications.
All the main cloud platforms contain containerization services – a lightweight
alternative to full machine virtualization that involves encapsulating an application in a
container with its own operating environment. Containerization comes with a number
of benefits highly suited to building robust production-grade AI solutions including
the ability to simplify and speed up the development, deployment, and applications
configuration process, increased portability and server integration and scalability as well
as increased productivity and federated security.
D
ocker and Kubernetes
Docker is the main container runtime we will utilize in this book. Docker’s USP lies
in its handling of dependencies, multiple (programming) language, and compilation
issues when creating isolated environments to launch and deploy applications. In
much the same way a “physical” container can be transported by ship, truck, or train,
standardization within a Docker container means it is effectively able to run on any
platform.
Although there are many similarities with Virtual Machines, as shown in Figure 1-12,
Docker better supports multiple applications sharing the same underlying operating
system. Docker is also fast, starting and stopping apps in a few seconds. PostgreSQL,
Java, Apache, elastic, and mongoDB all run on Docker.
36
Chapter 1 Introduction to AI and the AI Ecosystem
Although we will not use it in this book, Kubernetes (k8s), a container management
tool, is often used to orchestrate Docker “instances.” As an open source platform,
Kubernetes was originally designed by Google and automates deployment,
management, and scaling of applications in the container.
The goal of this exercise is to get familiar with Compute, Storage, and Container
solutions on cloud and to explore Machine Learning automation with Python – often the
end goal of a production-grade AI solution.
1. Sign up for AWS Free Tier and open AWS Console: https://console.aws.
amazon.com
37
Chapter 1 Introduction to AI and the AI Ecosystem
https://aws.amazon.com/products/compute/
https://aws.amazon.com/products/storage/
https://aws.amazon.com/containers/services/
3. Now go to Google Colab and run the Python script below to see an end-to-end
example of machine learning automation.
4. This script uses PyCaret and an inbuilt insurance dataset to run through a
number of data pipeline/feature transformations, train linear regression models
and select the best one based on performance. This same hands-on lab will be
the starting point for an extensive final lab in Chapter 9 on Productionizing AI.
5. Note the script uses a static data file built into the PyCaret library but often
this kind of automation with a Big Data dataset leverages cloud storage, and
distributed computing to improve overall performance and data security.
Python code:
38
Chapter 1 Introduction to AI and the AI Ecosystem
Wrap-up
This glimpse at productionizing AI using AWS Storage and Compute instances and,
separately, PyCaret hopefully gives readers a taste of things to come. While basic, the
emphasis is on exposure at this stage to key tools and techniques, all of which will be
elaborated in subsequent chapters.
In our next chapter, we look at the high level of AI solution implementation and best
practice AI project delivery with DataOps.
39
CHAPTER 2
AI Best Practice
and DataOps
We ran through in the first chapter the key themes for productionizing AI today. Before
we proceed into an exhaustive look at data ingestion and techniques and tools for
building an AI application, it's important to establish a framework for success.
That framework starts with a step-back and a “top-down” understanding of the
wider context for AI, key stakeholders, business/organizational process methodologies,
the importance of collaboration and stakeholder consensus, adaptability and reuse as
well as best practice in delivering high-performance AI solutions. There are many best
practice frameworks which can perform this function but in this book we consider the
one best we consider as best placed to achieve a culture of continuous improvement in
the workplace today – DataOps.
The approach taken in this chapter is focused on awareness of DataOps concepts,
rather than a “deep dive” but along the way we will touch on the cornerstones of
DataOps1 including agile and how to orchestrate agile development and delivery, team
and design sprint methods and collaboration.
We will also take a brief look at creating a high-performance culture, reusing
materials and artifacts, version control and code automation including continuous
integration (CI) and continuous deployment (CD) with Jenkins and containerization
with Docker as well as test automation with Selenium and monitoring with Nagios.
In later chapters, we will “join-the-dots” when we take a look at practical
implementation of a data/analytics/AI project, and adapting projects from other
industries while tying implementation of best practice to DataOps techniques.
1
See also https://dataopsmanifesto.org/en/
41
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_2
Chapter 2 AI Best Practice and DataOps
D
ataOps
Let's start with the basics – DataOps is not DevOps. While DevOps is concerned with
software development, Data Analytics (and therefore AI) requires this PLUS control over
how data is evolving.
As data is the underlying “currency” of Data Analysts, Data Scientists, or AI Engineer
roles, governance and data quality is critical if we are to generate tangible, meaningful
outcomes and insights – in addition to cloning a production environment to develop an
“application” or “solution,” underlying infrastructure must accommodate the “continual
orchestration” of changing data.
As shown in Figure 2-1 from DataKitchen a DataOps implementation stretches
across the entire data pipeline – from multiple data sources, through integration,
cleaning, and transformation before being consumed by (multiple) end users. The
context for DataOps is that analytics and AI fail without an effective data strategy, and
89% of businesses struggle with managing data.2
2
Source: Experian
42
Chapter 2 AI Best Practice and DataOps
T he Data “Factory”
DataOps is essentially the confluence of three key areas: DevOps, Agile, and Lean,3 with
the goal of streamlining data pipelines (more on this in Chapter 3) and improving data
quality (and reliability) by decreasing the innovation and change cycle, lowering error
rates in production, improving collaboration and productivity through, for example,
self-service enablement. At a more granular level, data monitoring and measurement,
metadata, scalable platforms, and version control are also key areas to ensure a data
pipeline solution is driven by organizational goals.
Besides being a framework or methodology, DataOps is also a culture typically
driven by a CIO/CDO or the organizational IT function. Metrics are key, both at an
individual contributor level and to measure improvements in productivity and quality
across projects.
3
specifically Statistical Process Control (SPC)
4
Of course some may be sandbox, proof of concept (PoC), prototype, or minimum viable product
(MVP) with the low-level aims being to demonstrate/test an idea and/or demonstrate core
features/user journey
5
DLOps
43
Chapter 2 AI Best Practice and DataOps
Besides high entropy data, the real challenge with Machine Learning isn’t building
the model itself. As Figure 2-2 shows, the MLOps process map/landscape from nvidia
and GCP, it's integrating a machine learning system, with countless examples of poorly
implemented projects resulting from a failure to adopt a “system-wide” approach:
• Lack of reproducibility
44
Chapter 2 AI Best Practice and DataOps
Enterprise AI
As we will see in later chapters, MLOps (and DataOps) is closely coupled with
“Enterprise AI” – effectively embedding AI into an organization’s company-wide
strategy. MLOps and Enterprise AI are about designing a Target/To-Be architecture and
implementing a robust AI infrastructure and enterprise data center while ensuring the
entire workforce is aligned and trained on a company’s tangible AI assets.
Enterprise AI is viewed by C-Level/Board as best practice for businesses to run AI
successfully – MLOps fits that vision with many CSPs today offering MLOps as an inbuilt
solution and others (such as DataRobot) adopting MLOps as their entire business model
or product offering.
Using Python in Jupyter Notebook, the goal of this exercise is to take a first look
at Google Cloud Platform and BigQuery, to help bridge the knowledge gap between
standalone Data Science and a cloud-managed DataOps or MLOps solution.
https://console.cloud.google.com
45
Chapter 2 AI Best Practice and DataOps
2. Activate your free trial – this requires putting through credit card details but
contains $300 of free credits6
6. Clone the Jupyter notebook from GitHub below and run through the notebook to
see how the Python–BigQuery interface works
https://github.com/bw-cetech/apress-2.1.git
https://portal.lenses.io/register/
2. After email verification, select the Lenses Demo and create a workspace
6
Google say they won’t charge unless you manually upgrade to a paid account but keep a track
on usage, check API calls and free trial status (remaining credit) on the Google Console (link
above) by going to Billing ➤ Overview. Remaining free trial credit is shown in the lower RHS of
the screen.
46
Chapter 2 AI Best Practice and DataOps
4. Exercise – run a different streaming query on one of the other datasets, for
example, financial_tweets
d. Consumer Monitor
Agile
We saw in the last section that Agile is one of the three core practices intrinsic to
DataOps. This next section takes a closer look at Agile in the context of AI solutionizing.
47
Chapter 2 AI Best Practice and DataOps
As one of the three cornerstones of DataOps, agile development and delivery is all
about collaboration and adaptability to balance these seemingly competing aims.
The main part is clearly anchored on the “people” perspective. DataOps brings a
diverse mix of stakeholders together on a data project with roles spanning the business/
client (who define the business requirements, traditional roles: data/solution architects
and data engineers, newer roles including Data Scientists and ML Engineers and IT
operations or those who build and maintain the data infrastructure).
Development/Product Sprints
Prioritization of business requirements on projects makes use of techniques such as
MoSCoW (i.e., must have this requirement, should have, could add if doesn’t affect
anything else, would like to have this feature/wishlist). Successful data projects prioritize
bug fixes and features through “sprints” which walk through the pipeline process from data
integration, to cleaning, transforming, and publishing and typically last 2-4 weeks in length.
48
Chapter 2 AI Best Practice and DataOps
While well-defined project work packages should lead to slick, agile handoffs across
delivery teams, at the start, teams often rebuild organizational Data Warehousing and
Analytics landscapes into an initial scaled-down sandbox (prototyping) environment.
From these modest beginnings, sandboxes get scaled-up to lab environments for process
orchestration and agile testing phases with continual integration (CI7) of code changes.
Ultimately the end goal is to dynamically deploy (CD) via a “Production Belt” and
monitor the results.
7
CI/CD will be discussed in the section below
49
Chapter 2 AI Best Practice and DataOps
Benefits of Agile
The DataOps test and release cycle from sandbox, scale-up through deployment is
designed around agile software delivery frameworks. Data and Analytics specialists use
this DataOps best practice to rapidly fix errors and implement feature requests for re-
deployment into Production. The end result is a “managed delivery” which addresses
the key areas below:
Changing requirements – incremental approach helps evolve a centralized data
repository and supports “dynamic” analytic requirements.
Slipped schedules – iterative, stepped approach shortens the overall delivery times
Improved flexibility – allows a pipeline of feature requests turned around with
continual process improvement while limiting the delivery of irrelevant features
Disappointed users – systemized, accelerated issue resolution process means less
scope for deliverables not meeting user requirements
Ultimately the continual (cyclical) improvement process should lead to higher-
quality product delivery and ROI.
Adaptability
While the people perspective rightfully gets prioritized, Agile (and DataOps) extends
beyond optimizing teams and collaboration to applications and systems. Adaptability
is important, particularly with cloud computing underpinning many AI applications
today; essentially adaptability means scalability and reuse of business logic, APIs, and
microservices.
Microservices in particular have become popular across the application ecosystem
as companies move away from siloed code on a single server to applications as a
collection of smaller, independently run components.
The underlying architectures from microservice integration promote consistent and
secure reuse of business logic, API sharing, and events handling and so enable greater
agility from having decentralized team ownership, elastic scalability around usage, and
discrete resilience from isolating run changes from other microservices at runtime.
50
Chapter 2 AI Best Practice and DataOps
React.js, along with Vue.js and Angular is one of the best and open source front-end
JavaScript libraries for building an aesthetic user interface. The goal of this lab is to
implement a simple react app which can be extended to front-ending AI applications:8
8
We will use the same boilerplate in Chapter 7 to build a full-stack deep learning app
51
Chapter 2 AI Best Practice and DataOps
1. Install node and npm from the link below – both are installed from the single
windows installer:
https://nodejs.org/en/download/
3. Opening terminal from within your new folder (type “cmd” in the windows
explorer path and press enter), install the react boilerplate app by running:
npm install -g create-react-app
As an alternative for building web applications, VueJS can be more suitable for front-ending
smaller, less complex AI applications. While React is a library, VueJS is a Progressive JavaScript
Framework. Here we walk through how to get a VueJS up and running as a template for
building an AI user interface.
1. If not already installed, go ahead and install node.js (and npm) from the url in
step 1 for the react.js lab above
https://nodejs.org/en/download/
52
Chapter 2 AI Best Practice and DataOps
6. Select the Babel default source-source compiler for browser-readable .js, .html,
.css (using arrow keys)
9. Finally navigate in your broswer to the url shown in the terminal i.e. http://
localhost:8080 to see the app running
Exercise - try to update the source code to remove all the text underneath the message
“Welcome to Your Vue.js App” and replace with a hyperlinked screenshot to Replika https://
replika.com/
Code Repositories
Any cohesive team working on developing an AI solution needs to be “singing from the
same hymn sheet.” Code repositories “repos” are one of the key collaborative enablers
for ensuring developers and data specialists are working in sync.
Git and GitHub
Version control, or source control, is the practice of tracking and managing changes to
source code. In recent years version control systems (VCS) and, specifically, Distributed
Version Control Systems (DVCS) have become valuable to DataOps teams. Besides
intrinsic “DevOps” benefits including reduced development time and increased number
of successful deployments, evolving datasets, such as widely used Covid data from John
Hopkins University are increasingly widely maintained on distributed version control
systems.
53
Chapter 2 AI Best Practice and DataOps
Git9 is by far the most popular DVC, though there are a number of other systems,
such as Beanstalk, Apache Subversion, AWS CodeCommit, and BitBucket, used for
projects with specific integrations to other (usually single) CSP providers.
Besides traceability and file change history (keeping track of every code modification
and every dataset change), Git eases the process of rolling back to earlier code/data
states during development and has enhanced branching and merging features vital for
DataOps teams working on specific application components or user stories.
The GitHub “ecosystem” comes complete with Git (effectively the command line
back-end), GitHub – a cloud-based hosting service that lets you manage Git repos from
a central location and GitHub Desktop – a desktop version to interact with GitHub
using a GUI.
GitHub is now utilized by millions of software developers and companies worldwide
with public repos in existence prior to February 2020 archived in the GitHub Arctic Code
Vault – a long-term archive 250m below the permafrost of an Arctic mountain. Forking a
repo on GitHub or cloning a public GitHub repo into a local directory are the main ways
to access and develop preexisting source code (or update datasets) on GitHub. Both
can be carried out using git commands in terminal or (more intuitively) using GitHub
Desktop.
V
ersion Control
The below image is an example of how Git performs version control on three different
versions of the same (.py) file. Multiple users can select which version of the file they
want to use and make changes to it independently before merging back to the single
“Master” repo.
9
Git was originally developed in 2005 by the creator of the Linux operating system kernel
54
Chapter 2 AI Best Practice and DataOps
Branching and Merging
Simplified “branching” is one of the main reasons why Git is by far the most widely used
version control system today.
The diagram below shows branching (to a development “dev” branch) from a master
repo in order to develop code further – with two other developers on the project adding
feature requests. The first developer makes some minor changes and merges their code
changes to the development (dev) branch, while the other developer continues working
on their feature, postponing merging until later.
Once development (bug fixes and feature requests) have been completed, testing
would be done on the dev branch before finally being committed (back) to the
master repo.
55
Chapter 2 AI Best Practice and DataOps
Git Workflows
Git workflows are the means by which changes to code and data are placed into repos.
There are four fundamental “layers” through which code and data can pass: a working
directory (typically a local user machine), staging area, local repository, and remote
repository (typically on GitHub itself ).
While a file in your working directory can be in three possible states: staged,
modified, and committed, because Git is a distributed version control system as
opposed to a centralized system, certain commands (such as commits) do not require
communication with a remote server each time they are actioned. This is shown in the
diagram below, with corresponding workflows and git commands shown underneath.
No development is done today without recourse to GitHub so let's take a look in this lab
at setting up a GitHub account, installing and creating a new repo using Git:
57
Chapter 2 AI Best Practice and DataOps
4. Introduce yourself to Git (by right-clicking the folder and selecting Git Bash or
via terminal opened in your test folder):
https://docs.github.com/en/github/authenticating-to-
github/connecting-to-github-with-ssh/checking-for-
existing-ssh-keys
https://docs.github.com/en/github/authenticating-to-github/
connecting-to-github-with-ssh/generating-a-new-ssh-key-and-
adding-it-to-the-ssh-agent
8. Make sure the SSH button is selected and run the git add, commit and push
commands shown on the screen by click on the copy button and pasting them
to Git Bash (still opened at your local folder) or by pasting them in terminal
9. Upon refreshing your repo you should now see a README.md file in your
GitHub repo
10. Exercise – try to add the react.js app source code from section 2 (Agile) to your
local repo and push to GitHub
11. Exercise – clone a public repo into a different local repo (make sure you have
created a new folder and are running the clone command in that folder)
NB this lab can be completed instead by installing GitHub Desktop from https://
desktop.github.com/ and using that instead of Git.
As an additional exercise with GitHub Desktop, try to fork a public repo (into your GitHub
account) in addition to cloning to a local repo.
58
Chapter 2 AI Best Practice and DataOps
GitHub isn’t just about version control. This lab shows you how to deploy and host the
react app from section 2 (Agile) to GitHub Pages.
NB the approach shown is via Git but a deployment using GitHub Desktop can also be
carried out by following the instructions here: https://pages.github.com/
a. Commit
b. Push to remote
59
Chapter 2 AI Best Practice and DataOps
C
I/CD in DataOps
The intention of Continuous Integration and Continuous Delivery (or Continuous
Deployment10) in software is to enforce automation through build, test, and deploy
processes. Essentially this is to enable teams to release a constant flow of software
updates into production to quicken release cycles, lower costs, and reduce the risks
associated with development.
In the context of DataOps (as opposed to DevOps) and AI, the scope of automation
has extended to data pipeline orchestration including data drift and modeling
automation including the retraining process. In theory, this should mean each time a
change is made to underlying code or infrastructure and a data change occurs (or, more
realistically, each time data distributions have deviated significantly), automation kicks-
in and the application is rebuilt, tested, and pushed to production.
Continuous Delivery and Continuous Deployment are similar but have slightly different goals –
10
Continuous Deployment focuses on the end-result, that is, the actual (end-point) deployment
while Continuous Delivery focuses on the process, that is, the release (steps) and release strategy
60
Chapter 2 AI Best Practice and DataOps
I ntroduction to Jenkins
Jenkins is an open source automation server with hundreds of plug-ins and one of
the leading CI/CD tools for DataOps. It is used by Expedia, Autodesk, UnitedHealth
Group, and Boeing as a continuous delivery pipeline.
Originally built to automate testing for Java developers, but now with support for
multilanguage multicode repositories, Jenkins simplifies the setup of a continuous
integration or continuous delivery (CI/CD) environment. It does this via a Jenkinsfile,
effectively a “pipeline” script where a declarative programming (macro-managed) model
defines executable steps via a hierarchy: pipeline block > agent > stages.
A Jenkins pipeline is really a collection of jobs triggering each other in a specified
sequence. For a small app, this would be, for example, three jobs: job1 build, job2 test,
and job3 deploy. Jobs can also be run concurrently and for more complex pipelines, a
Jenkins Pipeline Project is used where jobs are written as one whole script and the entire
deployment flow is managed through Pipeline as Code. Continuous integration with
Jenkins also supports GitHub automation.11
Blue Ocean provides a better UX for setting up Jenkins pipelines by exposing a low-
code interface and low-click functional development processes, negating the need for
programming a Jenkinsfile.
11
That is, normally manual processes on GitHub such as updating code releases with notes
and binary files, adding git tags to a workflow and compiling a project are all automated within
Jenkins CI/CD
61
Chapter 2 AI Best Practice and DataOps
M
aven
Closely coupled with Jenkins (Jenkins uses Maven as its build tool), Apache Maven is
a Project Management tool designed to work across the software lifecycle (performing
compile, test, package, install, and deploy tasks), centrally managing project build
including dependencies, reporting, and documentation.
When a Jenkins build is triggered, Maven downloads the latest code changes and
updates, packages them, and performs the build. Like Jenkins, Maven works with
multiple plugins, allowing users to add other bespoke tasks, but only does Continuous
Delivery, not Continuous Integration (CI).12
12
So no integration to merge developer code on GitHub
62
Chapter 2 AI Best Practice and DataOps
Containerization
We finish this section with a look at containerization.
Because containers standardize deployments across multiple machines and
platforms, they can naturally accelerate DataOps processes, and in particular CI/
CD, where testing and debugging processes are “ringfenced” from external file
dependencies.
Containerization comes with a number of additional benefits highly suited to
building robust production-grade AI solutions including the ability to simplify and speed
up the development, deployment, and applications configuration process, increased
portability and server integration and scalability as well as increased productivity and
federated security.
Docker and Kubernetes
Docker is the main container runtime we will utilize in this book. Docker’s USP lies in its
handling of dependencies, multiple (programming) language, and compilation issues
when creating isolated environments to launch and deploy applications as containers.
Although there are many similarities with Virtual Machines, as shown in the image
below, Docker better supports multiple applications sharing the same underlying
63
Chapter 2 AI Best Practice and DataOps
operating system. Docker is also fast, starting and stopping apps in a few seconds.
PostgreSQL, Java, Apache, elastic, and mongoDB all run on Docker.
Although we will not use it in this book, Kubernetes (k8s), a container management
tool, is often used to orchestrate Docker “instances.” As an open source platform,
Kubernetes was originally designed by Google and automates deployment,
management, and scaling of applications in the container.
Now that we have introduced containers and Docker in a CI/CD context, let's take a
look at the process of containerizing a simple application with Play With Docker, which
supports 4 hours of free usage:
2. After signing up and verifying your email, logon to Play with Docker at
https://labs.play-with-docker.com/
64
Chapter 2 AI Best Practice and DataOps
NB use "ctrl + shift + V" to paste commands into the Docker command line
interface (CLI)
6. The container is a tutorial for Docker – getting started.13 Complete the tutorial
to the end of the first part "Our Application" which will walk through how to
build a “To-do list” app
NB when Building the App's Container Image (creating Dockerfile), use the
command below:
touch Dockerfile
13
Public url is here: https://docs.docker.com/get-started
65
Chapter 2 AI Best Practice and DataOps
7. Exercise – complete the next part of the tutorial "Updating our App" to see how
to amend the message and behavior of the Dockerized application
8. Exercise – complete "Sharing our App" to see how to share Docker images,
using a Docker registry on Docker Hub
Selenium
Created in 2004 to automate testing actions on web apps, Selenium is an automated
framework for testing across different browsers and platforms and has multilanguage
support (Java, C#, Python). Like much of the DataOps ecosystem, it's free and
open source.
Not just a single tool but a software suite, Selenium is customized for specific
organizational quality assurance (QA) requirements and supports important DataOps
unit testing processes14 involving data pipelines and analytics. Once unit testing is
complete, continuous integration commences with QA testers able to create test cases
and test suites for logically grouped test cases including data-in-transit and data-at-rest.
Running Selenium tests in Jenkins allows users to (a) run tests every time software
changes and (b) deploy software to a new environment when the tests pass. Jenkins can
also schedule Selenium tests to run at specific time, saving execution history and test
reports.
14
Intended to minimize the number of defects in the quality assurance testing phase
66
Chapter 2 AI Best Practice and DataOps
TestNG
Selenium Test Scripts work with TestNG (Test Next generation) – a testing framework
which addresses a reporting gap in Selenium Webdriver, generating default HTML
reports after execution. These reports (shown below) identify information about test
cases (such as pass, skip, or fail) and the overall status of a project.
67
Chapter 2 AI Best Practice and DataOps
Issue Management
Issue Management and Issue Tracking allow project managers, users, or developers to
record and follow up the progress of issues on a Data project, capturing bugs, errors,
feature requests, and customer complaints. Issue tracking criteria typically include
• Level of importance
• A progress metric
68
Chapter 2 AI Best Practice and DataOps
J ira
Jira is one of the most popular issue (ticket) tracking and PM tools, although Trello,
GitHub Boards, and Monday are also widely used.
Developed by Atlassian, Jira has been around a while (2002) and is an agile project
management and issue/bug tracking tool. It comes with an easy-to-use dashboard and
“stress-free” PM including agile delivery features such as team/user story and sprint
oversight, scrum and kanban boards, roadmaps, and team performance reporting.
S
erviceNow
Jira is integrated with ServiceNow – a workflow automation solution to connect people,
functions, and systems across an organization. ServiceNow’s product USPs are ramping
up customer service and transforming workforce/employees to digital.
The below image shows how ServiceNow automates a change request workflow
across a CI/CD pipeline.
69
Chapter 2 AI Best Practice and DataOps
Besides clear automation productivity benefits, ServiceNow also helps scale IT with
“AIOps,” interpreting telemetry data across organizations and using machine learning
for, for example, anomaly detection.
M
onitoring and Alerts
Once an AI application is fully tested, and a process is implemented for managing issues,
the focus turns to application monitoring – essential in AI due to data changes (data
drift) quickly rendering model results invalid or suboptimal.
70
Chapter 2 AI Best Practice and DataOps
N
agios
Nagios monitors entire IT infrastructure to ensure systems, applications, services, and
business processes are functioning properly, with technical staff alerted in the event of
failure. Even older than Jira, Nagios was first launched in 1999.
71
Chapter 2 AI Best Practice and DataOps
The Nagios suite of tools consists of an Enterprise Server and Network Monitoring
Software (Nagios XI), centralized log management, monitoring and analysis (Nagios
Log Server), Infrastructure monitoring (Nagios Fusion), and Netflow Analysis with
Bandwidth Utilization (Nagios Network Analyzer).
And so to our final lab in this chapter focused on CI/CD and testing, this extensive
lab introduces readers to Jenkins on Azure and running Selenium test scripts. Maven
integration is also included as an exercise:
1. Create A Virtual Machine by going to the link below. This will also create a
storage resource, which may incur a small monthly cost15
https://portal.azure.com/#cloudshell/
15
See Chapter 1, Important note on cloud resource usage and cost management
72
Chapter 2 AI Best Practice and DataOps
2. Toggle CLI settings to Bash and follow steps 3-9 in the link below
https://docs.microsoft.com/en-us/azure/developer/jenkins/
configure-on-linux-vm
NB you may want to replace the Azure region (references in az group create
command) to a closer Data Centre
3. Configure Jenkins by following the steps in the link above under section 4.
Configure Jenkins
Note it may take some time at the end (10-15 mins) for the project home page
to appear
a. Downloading Eclipse IDE from www.eclipse.org/. Eclipse is a popular IDE for Java
development.
https://q-automations.com/2019/06/01/how-run-selenium-
tests-in-jenkins-using-maven/
c. For step 8 above, add all dependencies WITHIN the project tag
d. For step 9, the TestNG add-in is required which can be downloaded from the link:
https://marketplace.eclipse.org/content/testng-eclipse and drag and
dropped into the Eclipse workspace16
7. Exercise – try to push your Selenium script to GitHub. Push source code to
GitHub first and then connect from Jenkins to repo. NB to integrate GitHub
to Jenkins follow Configuring GitHub and Configuring Jenkins steps here:
www.blazemeter.com/blog/how-to-integrate-your-github-
repository-to-your-jenkins-project
Confirm and accept terms of license. If prompted by warning on authenticity, select ok.
16
Restart Eclipse
73
Chapter 2 AI Best Practice and DataOps
8. Exercise – for Maven integration, try to follow steps under "Integrating Your Test
Into Jenkins" here https://qautomation.blog/2019/06/01/how-run-
selenium-tests-in-jenkins-using-maven/17
9. Exercise – to integrate all three tools: Jenkins, Maven, and Selenium,
follow steps 1-12 under "Integrating Your Test Into Jenkins": https://q-
automations.com/2019/06/01/how-run-selenium-tests-in-
jenkins-using-maven/ and note the following:
d. For step 7 you need the HTML Publisher plug in. Go to Manage Jenkins > Manage
Plugins, select Available tab, type HTML Publisher and click check box
The option to Publish HTML reports should now be available under Build Settings tab (and Post
Build Actions) of the Job created in step 2 of the q-automations link above.
W
rap-up
Having completed our last lab on tools and interfaces for continuous testing, we bring
an end to this chapter on DataOps. From Agile development, through code repos, CI/
CD, testing, and monitoring, the emphasis here has been on understanding stakeholder
relationships, the end-to-end process, ecosystem of tools, and integration landscape in
order to “frame” our AI solutions later in this book.
Our next chapter takes the learning from DataOps and applies best practice into one of
the first (and perhaps most critical) phases of an AI project implementation: data ingestion.
NB in step 2 if Maven Project is not shown, then go to Manage Jenkins ➤ Manage Plugin ➤
17
Available Tab. In the filter box enter “Maven plugin” and you will get search result as “Unleash
Maven Plugin,” √ enable the check-box, click on “Download now and install after restart.” In the next
screen click checkbox to restart Jenkins otherwise the Maven project won’t show up. You can also
check this link if not seeing the Maven project: https://stackoverflow.com/questions/45205024/
maven-project-option-is-not-showing-in-jenkins-under-new-item-section-latest
74
CHAPTER 3
75
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_3
Chapter 3 Data Ingestion for AI
More formally, Data Ingestion is the process by which data is moved from source
to destination where it can be stored and further analyzed. Through the course of this
chapter more nuanced definitions will become apparent, but we start this first section
with a look at the world’s current global data needs.
• Costs
The AI Ladder
Going back to our earlier failed AI projects, one of the biggest misses is in properly
defining a data strategy at the outset. There is no point in undertaking an AI project
without an architected solution for Data Ingestion, or put another way there is no AI
(Artificial Intelligence) without IA (Information Architecture).
One of the better tools in the wider ecosystem for addressing Big Data gaps in a
project is IBM’s AI Ladder methodology (shown in Figure 3-1), which focuses on data
ingestion best practice as a means for businesses to accelerate their AI journey. The
methodology seeks to reinforce messaging around why you can’t have AI without IA
1
Source: Eland Cables
76
Chapter 3 Data Ingestion for AI
We will address Data Lakes further in the following sections, but before that we
address some key concepts in AI Data Ingestion.
77
Chapter 3 Data Ingestion for AI
• Data collector layer – transports data to the rest of the data pipeline
• Data query layer – collects the data from the storage layer, this time
for active analytic processing.
These layers are important as ultimately they define the primary concern of an end
user consuming the data.
78
Chapter 3 Data Ingestion for AI
Ultimately organizations today have to carve up their data needs and architectural
landscape to decide which data is handled as batch data and which as streaming data.
The simplified representation in Figure 3-2 shows how most organizations deal with
streaming data that needs to be stored by pushing it into an Enterprise Data Warehouse.
APIs
Besides remote databases and raw logs, one of the most common ways to connect
to OLTP and OLAP data today is via an API (Application Programming Interface).
Essentially an API is a software intermediary that allows two applications to talk to each
other. The example below shows the use of Amazon API Gateway, a key component of
Amazon Web Services and the means by which streaming data can be ingested from the
web via REST, HTTP, or WebSocket API and AWS Kinesis into an AI application.
79
Chapter 3 Data Ingestion for AI
Figure 3-3. Data Ingestion via AWS API Gateway REST API
80
Chapter 3 Data Ingestion for AI
File Types
We wrap up our introduction to data ingestion with a closer look at file formats. How
you store Big Data in modern data stores is critical, and any solution architect needs to
consider the underlying format of the data, compression and how to leverage distributed
computing/how to partition data in the fastest, most optimal way.
Traditional file formats such as .txt, .csv, and json have been around for decades and
so we assume the reader is familiar with these. Newer formats such as avro, Parquet, and
Apache ORC, tr.gz, and pickle formats have sprung up to mesh with rapid advances in
the use of clusters (a group of remote computers) in big data processing.
Apache Parquet is an open source column-oriented data storage format of the
Apache Hadoop ecosystem. A binary format that is more compressed by virtue of storing
repeat data structures as columns, Parquet contains metadata about contents of the data
such as column names, compression/encoding, data types, and basic stats. Compressed
columnar files like Parquet, ORC, and Hadoop RCFile have lower storage requirements
and are ideal for optimal performance during query execution as they read quickly
(although write slowly). We will take a further look at parquet formats in a lab at the end
of this section.
Avro files are a row-based binary file with a schema stored in a dictionary
(specifically JSON) format. Avro files support strong schema evolution by managing
added, missed, and changed fields. Because they are row-based, Avro or JSON is ideally
suited for ETL (extract, transform, and load) staging layers.
81
Chapter 3 Data Ingestion for AI
A summary of these file formats compared with csv and json is shown in Figure 3-5.
Splittable * *
Readable X X
Complex data
structure X
Schema
evolution X X
@luminousmen.com
Two other important formats, Pickle and HDF52 (Hierarchical Data Format version
5) are often used in Python, generally to store trained models. Both are fast – pickle in
particular, but comes with the drawback of increased storage space. HDF5 has better
support for Big (heterogeneous) Data, storing data in a hierarchical (directory-like/
folder) structure, with repeat data compressed in a similar way to parquet files.
2
The file extensions .h5 and .hdf5 are synonymous
82
Chapter 3 Data Ingestion for AI
Using Python in Jupyter Notebook, the goal of this exercise is to automate the ingestion
of (live) semistructured weather data, then transform and extract temperatures data3
for forecasting.
1. Clone the GitHub repo below
https://github.com/bw-cetech/apress-3.1.git
2. Run through the code carrying the steps as below:
c. Carry our “recursive wrangling” to extract temperature forecasts for D to D+7 (day
time and night time)
We described earlier in this section how parquet files use a columnar format for data
compression. We take a look in this lab at how to work with these files.
https://github.com/bw-cetech/apress-3.1b.git
2. Download the US CDC (Centers for Disease Control and Prevention) dataset
from the link below:
https://catalog.data.gov/dataset/social-vulnerability-
index-2018-united-states-tract
3
Fundamental data is data that influences or drives a target variable, such as consumer demand
on an underlying energy price
83
Chapter 3 Data Ingestion for AI
a. Import libraries to work with Parquet (here the Apache Arrow pyarrow library)
b. Upload the CDC data downloaded above (this may take up to 30 minutes due to the
file size)
c. Perform basic EDA (data dimensions / no. of rows/columns, data types and first five
rows etc.)
4
Make sure to download file first from Google Drive then paste without quotes (“ ”) the local path
to the downloaded file
84
Chapter 3 Data Ingestion for AI
85
Chapter 3 Data Ingestion for AI
The above are obviously “database” style Data Stores but note that any file systems/
hard drives used in an organization are also examples of a Data Store containing
additional rich datasets for AI (such as images, pdfs, and columnar formats/flat files).
L akehouses
Many companies see the benefits of maintaining both Data Lakes and Data Warehouses
served to target specific organizational needs. It is no surprise then that Lakehouses have
been introduced as a new, open architecture that combines the best features of both.
Examples include Databricks Lakehouse Platform, GCP BigLake, and Snowflake is
at least partly considered a Data Lakehouse. Perhaps even more interesting, is Delta
Lake5 – an open source project that enables building a Lakehouse architecture on top of
data lakes.
5
https://delta.io/
86
Chapter 3 Data Ingestion for AI
Given that the “enterprise goal” is to have a Data Store system that is as flexible
and performant as possible, the idea of multiple, separate data lakes, warehouses and
databases might seem the most productive solution, but the cost overhead and multiple
integration points create complexity and latency.
Lakehouses attempt to avoid these issues through the use of data structures and data
management features akin to that of a data warehouse while at the same time possessing
low-cost storage options similar to that of data lakes.
Besides ensuring the business is able to connect directly to raw data sources,
a DataOps/Agile approach should be used to scope out the underlying business/
organizational data requirements. One way of doing this is capturing requirements in a
“data dictionary” – a great tool for capturing scope and later delivering both low (MVP)
and high fidelity AI solutions.
A well-structured Data Dictionary helps with:
• Stakeholder alignment, buy-in, and project sign-off
• Terminology and defining data types
• Collecting and classifying data critical to the success of a Data project
• Categorizing/mapping attributes from structured and unstructured
data as well as online, offline, and mobile sources
• Quick checks for primary keys for joining data
• Isolating anomalies and data flow conflicts
• Documenting experiences with new data sources
• Ongoing data maintenance
87
Chapter 3 Data Ingestion for AI
88
Chapter 3 Data Ingestion for AI
Elasticity – refers to the ability to provide necessary resources dynamically for the
current workload
Naturally, this brings about more sophisticated capability measure for those
companies at higher (data) maturity levels – elastic scaling, or the ability to
automatically add or remove compute or networking infrastructure based on changing
application traffic patterns.
Most cloud services today are both elastic and scalable but cost models vary greatly
and tying the inherent CSP capability to scale up or scale out with forecasted (or even
actual) AI application usage is still mired in opacity.
90
Chapter 3 Data Ingestion for AI
The goal of this exercise is to create a DynamoDB NoSQL table on AWS, an S3 (File
Store) output bucket, and an AWS Data Pipeline to transfer data from DynamoDB to S3:
1. Follow the steps below to see how data can be transferred across data
stores on AWS:
a. Add DynamoDB NoSQL table
b. Create S3 bucket
c. Configure AWS Data Pipeline
d. Launch EMR cluster with multiple EC2 instances
e. Activate pipeline
f. Export S3 data
92
Chapter 3 Data Ingestion for AI
AWS Simple Storage Service (S3): on Amazon Web Services, S3 is the storage
service of choice to build a data lake on AWS. Secure, highly scalable and durable, able
to both ingest structured and unstructured data and catalog and index the data for
downstream analysis, S3 is used as an underlying Data Store in many analytics projects
and machine learning applications today. When block storage is required, Amazon
Elastic Block Store (EBS) is used (comparable with Azure Blob Storage).
Google BigLake: BigLake has recently been launched on GCP as a new cross-
platform data storage engine.
Hadoop
Apache Hadoop is the defacto framework that allows for the distributed processing of
large datasets across clusters, essentially a collection of software for solving BIG data
problems using a network of computers.
MapReduce is the fundamental programming model of Hadoop at the heart of
Apache Hadoop for processing huge amounts of data using a Master/Slave architecture
(Figure 3-9). As shown in Figure 3-10, it works by performing a (1) mapping task which
splits data then maps to a key/value pair format and (2) reduction task where the
mapped output is shuffled and combined into a smaller set of key/value pairs.
93
Chapter 3 Data Ingestion for AI
94
Chapter 3 Data Ingestion for AI
Many tools which handle stream processing today come with extended analytics
capabilities – Stream Analytics. The Apache tooling suite: Apache Storm, Sqoop, Spark,
Flink and Apache Kafka, Talend, Algorithmia, and upsolver are all streaming frameworks
with these inbuilt features, as is AWS Kinesis. We will take a look at three of these in our
hands-on labs in this book – Apache Spark for Big Data Machine Learning, AWS Kinesis
for streaming stock price data and Apache Kafka, which makes use of message brokers
for handling streaming data into an AI application.
Whether it’s Kafka or Kinesis, there are a huge number of AI use cases today
which rely on real-time streaming technologies including algo trading, supply chain
optimization, fraud detection, and sports analytics.
The goal of this exercise is to leverage Microsoft Data Streamer as one of the most
basic ways to set up a streaming PoC for an AI application after scraping the latest
(tech) stock prices with Python
1. Get the add-in for Microsoft Data Streamer by opening a blank excel file then
selecting:
a. File ➤ Options
b. Add-ins
c. COM Add-ins
d. Go
e. In the COM Add-ins dialogue check the box for Microsoft Data Streamer for Excel and
click OK
95
Chapter 3 Data Ingestion for AI
https://github.com/bw-cetech/apress-3.3a
The goal of this exercise is to build a python script, create an access key in AWS, an
IAM Policy and IAM User then connect to Kinesis and stream stock price data.
1. Create a new (blank) python script (e.g. in Jupyter notebook, Google Colab) and
add the Python libraries for AWS (boto3) shown in the sample notebook here:
https://github.com/bw-cetech/apress-3.3
6. Add the remaining config steps shown in the above sample Python notebook to
view the streaming data.
c) modify your code to API into live stock price data on yahoo / quandl
Storage Considerations
At the commencement of an AI project storage considerations should be on the list
of priority considerations. Can we get way with a simple database or data mart, or is
integration (as in most enterprise projects) required with a data Warehouse or data lake?
97
Chapter 3 Data Ingestion for AI
• Data size – what is the storage size of each relative data source?
98
Chapter 3 Data Ingestion for AI
A rigorous take on the above can lead to a more performant data ingestion process
and accessible data lake similar to that used at Just Eat (using Apache Airflow) through
the Ingestion, Transformation, Learn, Egress, and Orchestration cycle.
Serverless Computing
One of the most popular serverless components on cloud is AWS Lambda. Lambda’s
event-driven, serverless compute architecture allows end users to focus more time
on rapidly building data and analytics pipelines by virtue of its independence from
infrastructure management and a pay-per-use pricing model.
While a source and sink data store are obviously not isolated from this process,
serverless computing can simplify the often complex alert-driven events configured for
data ingestion.
End-of-day Processes
Many corporate companies undertake end-of-day processes (EOD). Particularly in
banking and retail sectors, for example, outbound billing and reconciling tills after stores
have closed, these are crucial operational workflows.
Typically an EOD process involves (a) updating, verifying, and posting daily sales
information and (b) aggregation of raw transactions into meaningful business data. The
automation requirements here coupled with job scheduling and workflow automation
are ideally suited for batch processing and data pipeline orchestration. These automated
workflows can involve other (non-daily scheduled) batch processing, such as quarterly
or annual reporting.
• Collect data
99
Chapter 3 Data Ingestion for AI
However, the above process doesn’t fully capture requirements for ML/DL
applications, because features and predictions are time-sensitive. As examples, Netflix’s
recommendation engines, Uber’s arrival time estimation, LinkedIn’s connections
suggestions, Airbnb’s search engines all require training, or at least inference
(predictions) in real time.
Data Ingestion for ML/DL needs to consider both online model analytics (real
time, operational decision making) and offline data discovery (learning and analysis on
historical aggregated data) as shown in Figure 3-11.
100
Chapter 3 Data Ingestion for AI
Look for a “frictionless” solution – for example, a serverless data lake architecture
enables agile and self-service data onboarding and analytics for all data consumer roles
across a company on shared infrastructure
Write a detailed solution – it’s helpful to consider data lake–centric analytics
architectures as a stack of six logical layers, each of which has multiple components:
• Ingestion
• Raw zone
• Cleaned
• Curated
• Processing layer
• Consumption layer
As there is no “one size fits all,” we show below four examples of robust data
ingestion architectures and pipelines built to better service, speed-up, and scale AI and
analytics solutions.
Example: XenonStack
XenonStack’s Big Data Ingestion Architecture consists of six key layers: Ingestion,
Collector, Processing, Storage, Query, and Visualization.
101
Chapter 3 Data Ingestion for AI
102
Chapter 3 Data Ingestion for AI
103
Chapter 3 Data Ingestion for AI
104
Chapter 3 Data Ingestion for AI
105
Chapter 3 Data Ingestion for AI
The goal of this exercise is to leverage Apache Spark’s6 Big Data Processing capability
to run regional and product level forecasts with fbprophet – a common use case for a
global multinational and currently run by Starbucks.
1. Databricks sign up – sign up to Databricks Community Edition (DCE), login, and
spin up a cluster
5. Import Data to Databricks Filing System (DBFS) – enable admin rights in the
Databricks console
6. Run the baseline forecast with fbprophet in the notebook as described
7. Using a Data Pipeline, leverage Apache Spark to scale the forecast to every
store and item combination
6
Apache Spark is written in (statically typed) Scala to compile faster but we will be using Python
in this hands-on lab
106
Chapter 3 Data Ingestion for AI
Wrap-up
Our last lab brings to an end this chapter where we have introduced Data Ingestion in
the context of AIaaS, looking at the challenges today in dealing with the deluge of data
available to companies, how to store this data, how to process the data using cloud
services, and how to orchestrate the various different data sources, ideally through a
seamless, automated data pipeline.
Our next chapter takes a whistle-stop tour through Machine Learning. As most of the
core concepts should already be known to the reader, our main interest is on machine
learning as a bridge to a more in-depth look at Deep Learning (in the subsequent
chapter). However, we will also lay out a best practice roadmap for successfully
implementing Machine Learning, covering critical data wrangling, training, testing,
benchmarking, implementing, and deployment phases.
107
CHAPTER 4
Machine Learning
on Cloud
In terms of (Gartner) Hype Cycle transitions, Machine Learning has long since passed
its “peak of inflated expectations,” but it remains the core AI technique in use in most
businesses and organizations today.
Before embarking on an extended look at Deep Learning, we carry out in this
chapter a quick refresher on Machine Learning with reference to applications on cloud.
As mentioned in Chapter 1, it is expected readers have some grounding in Machine
Learning already, so will assume a basic understanding of supervised and unsupervised
machine learning exists.
As such this fourth chapter is an accelerated run through of the mechanics of
Machine Learning, covering critical processes from Data Import through EDA and Data
Wrangling (cleaning, encoding, normalizing, and scaling) as well as the model training
process. We will look at both unsupervised (clustering) techniques and supervised
classification and regression as well as time series approaches before interpreting results
and comparing performance across multiple algorithms.
We wrap up with the inference process and deploying a model to cloud. Following
a more expansive look at Neural Networks and Deep Learning in AI in the next chapter,
we will revisit machine learning again in Chapter 6, specifically usage of increasingly
important NoLo code UIs and AutoML tools for AI: Azure Machine Learning and IBM
Cloud Pak for Data.
109
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_4
Chapter 4 Machine Learning on Cloud
M
L Fundamentals
As mentioned in Chapter 1, Machine Learning is a technique enabling computers to
make inferences from complex data. High-level definitions for the main types are given
below, focused on the inherent difference between these machine learning approaches:
Supervised – training on data points where the desired “target” or labeled output
is known
Unsupervised – no labeled outputs available but machine learning is used to
identify patterns in data
Semisupervised – an initial unsupervised ML approach applied to a large amount of
unlabeled data followed by supervised machine learning on labeled data
Reinforcement learning – training a machine learning model by maximizing a
reward/score
C
lassification and Regression
What distinguishes Supervised Machine Learning from Unsupervised Machine Learning
is the prevalence of “labeled” or “ground truth” data, that is, a specific target field or
variable that we wish to train a model to predict.
There are two main types of Supervised Machine Learning:1 classification and
regression. Whereas the label or target variable in classification is discrete (usually
binary, but sometimes multiclass), the label in a Supervised Regression problem is
continuous. The objective of supervised classification is to find a decision boundary
which splits the (training/test) dataset into separate “classes” while for supervised
regression the aim is to find a “best-fit” line through the data – a straight line for linear
regression or a curve for nonlinear regression as shown in Figure 4-1.
1
Or three if Time Series Forecasting is considered as distinct from Regression. We will look at this
separately below
110
Chapter 4 Machine Learning on Cloud
2
AutoAI in IBM Cloud Pak for Data for instance automates time series forecasting
3
LinkedIn’s open source Python forecasting library Greykite was only released in May 2021 but
may have speed and accuracy advantages: https://engineering.linkedin.com/blog/2021/
greykite--a-flexible--intuitive--and-fast-forecasting-library
111
Chapter 4 Machine Learning on Cloud
variations. Rather than relying on any one approach however, and in line with common
practice in machine learning, multiple algorithmic techniques (such as ARIMA,
fbprophet, and RNNs) are generally employed and compared before the final model
selection for the specific forecasting use case.
Using Python in Jupyter Notebook, the goal of this exercise is to introduce fbprophet as
an accelerator for forecasting. In this specific case we forecast UK quarterly housing
demand, but adaptability of the code sample supports easy swapping out of the data for
other sectoral/industry forecasts.
1. Clone the GitHub repo https://github.com/bw-cetech/
apress-4.2.git
2. Running the notebook in Colab,4 walk-through the code sample
c. Wrangle the data into the required format (showing quarterly time series for UK Public
Housing Output)
5. Exercise (stretch): automate the data import to read in the latest data (i.e., pick
up the latest quarter)
4
No install of fbprophet is needed in Colab
112
Chapter 4 Machine Learning on Cloud
Clustering
With unsupervised machine learning, we have no labeled data or “ground truth,” so the
goal of predictive modeling in this context is instead to identify an unseen pattern on the
underlying data. These patterns are identified as “clusters” often over multiple dimensions,
where intracluster distances (the distance between datapoints within a cluster) are
minimized and intercluster distances (distance between separate clusters) are maximized.
Because of its inherent ability to find hidden patterns in BIG datasets running into
hundreds, thousands, or even more parameters/features, unsupervised clustering is
ideally suited to mine customer data buried in a CRM platform (or multiple systems)
and for anomaly detection. Given a specific number of groupings (clusters), an
unsupervised machine learning approach can throw up customer segments which
have some degree of commonality (high spend, low to mid income, located in a specific
region, etc.). Equally after training an unsupervised machine learning model on the data,
anomalies in the data stand out as isolated clusters, as shown in Figure 4-2.
D
imensionality Reduction
Dimensionality Reduction is also considered an unsupervised technique. Here the idea
is that a machine learning algorithm is used to simplify (reduce) the data, often from
thousands of underlying features into tens of features.
The process of reducing the number of dimensions in your data can of course be
carried out in both a manual way (by dropping unwanted features) or automated via an
algorithm. The main unsupervised technique used is Principal Component Analysis,
where the data is “collapsed” into a form that statistically resembles the original data.
While this approach can greatly reduce datasets and improve runtime/performance,
it is not strictly a Machine Learning modeling technique in the normal sense as the
outcome in this case is another dataset, albeit compressed, rather than a trained model.
Using Python in Jupyter Notebook, the goal of this exercise is to apply unsupervised
machine learning technique (using the K-Means algorithm5) to a LIDAR dataset – here
an unlabeled “point cloud”6 representing an airport terminal:
https://github.com/bw-cetech/apress-4.3.git
5
Instead of K-Means, K-Modes is useful for feature-rich Telco datasets which tend to have
an abundance of categorical variables: dependents, Internet service type, security protocol,
streaming, payment type, etc. See, for example, https://medium.com/geekculture/
the-k-modes-as-clustering-algorithm-for-categorical-data-type-bcde8f95efd7
6
A set of data points in space
114
Chapter 4 Machine Learning on Cloud
4. Exercise (stretch) – try to apply the same technique to the car dataset provided
in the GitHub link above
7
So therefore “semi-supervised” in the sense of partial labeling, rather than an unsupervised–
supervised two-step process.
115
Chapter 4 Machine Learning on Cloud
Design Thinking and Data Mining and Data Import have been covered in Chapters 2
and 3. We will cover each of the subprocesses from EDA to Performance Benchmarking
in the sections that follow with Deployment covered in Chapter 7 on Application
Development.
8
Python .head and .tail methods
9
Correlation values measured between -1 and 1 with the absolute value determining the intensity
of the relationship – negative values indicating an inverse relationship. Spearman rank correlation
is just Pearson’s correlation on the ranked values of the data.
116
Chapter 4 Machine Learning on Cloud
D
ata Wrangling
While EDA involves passively observing the data,10 Data Wrangling is focused on
actively changing the data in some way. There are a number of subprocesses here,
described below as a desired sequence of steps, although many are iterative and,
depending on the dataset, there can be many loopbacks to an earlier sub-process
Data cleaning: primarily dealing with incomplete data, that is, missing values, but
formatting or dealing with invalid data such as negative ages or salaries, inaccurate
data (incorrect application of a formula at source, e.g., profit), removing duplicates, and
treating outliers are additional tasks that may be required. Dealing with missing values
involves taking a decision as to whether the data can be (a) dropped outright (e.g., drop
rows when < 1% values missing in a column or drop column when sparsity > 97%), (b)
replaced with a proxy value: mean for normal distributions, median for skewed or mode
for categorical variables, or (c) interpolated when more sophistication/performance
tuning is required.
Although “passive,” good Data Science includes documenting findings along the way, typically
10
117
Chapter 4 Machine Learning on Cloud
Data cleaning also includes dealing with outliers. Outliers are problematic in
machine learning as they can lead to overfitting, but equally outliers can be important to
building sophisticated models. Ultimately a decision is taken about removing outliers,
usually in conjunction with normalization and scaling techniques covered below:11
Encoding – involves transforming text (string) data to a numerical format. Most
machine learning algorithms require all data to be numeric12 – there are two approaches:
ordinal encoding when the field values have an underlying order (e.g., movie ratings –
good 2, average 1, bad 0) and nominal encoding when the field values have no intrinsic
order (e.g., gender). Nominal encoding is often referred to as one hot encoding due to
the transformed appearance of the data resembling binary machine language (in the
case of gender we would typically create 2 columns one for females (taking 0 or 1 values)
and one for males.13
Normalization – normalization is the process of transforming a skewed field/
variable distribution into a normal distribution. Log transforms are normally used for
right-skewed distributions while power transforms are used for left-skewed.
Standardization (Scaling) – refers to the process where data is brought under the
same scale, with treatment different for data that is normally distributed (StandardScaler
in Python) or skewed (MinMaxScaler or RobustScaler).
F eature Engineering
Data Wrangling is considered by some to include Feature Engineering – the process
of selecting which features to use in a predictive model (i.e., in the machine learning
training process). This process is highly iterative, occurring before and after each
iteration of the machine learning training process.
11
A good “line in the sand” for an outlier is any value outside [-1.5 x IQR, 1.5 x IQR], however
depending on whether this is too broad, a capping method (outside [5th, 95th] percentile) or too
narrow (three or more standard deviation away from mean) may be used instead. Clustering as a
preprocess (see earlier section on Unsupervised Machine Learning) is also an option.
12
CART (classification and regression tree) methods are an exception – decision tree and
random forest
13
Normally implemented with just one column/field, for example, for Females, with Males taken
as the complement of the Females column
118
Chapter 4 Machine Learning on Cloud
119
Chapter 4 Machine Learning on Cloud
S
ampling
In the next two sections, we will take a look at the algorithmic process and performance
benchmarking. As it is highly iterative, data wrangling also encompasses a number
of tasks not typically performed until several model runs have been undertaken.
Normalization and scaling are two such techniques mentioned above, which are not
always necessary to get a model “up and running” but rather to fine tune once there is a
stable model process. Sampling, specifically over or undersampling, is another when it's
applied to imbalanced datasets.
Most datasets have inherent imbalance14 – take fraud detection where the number of
fraudulent transactions is typically minuscule in comparison to the number of nonfraud
transactions. The same applies in cybersecurity – a DDoS attack can be like finding a
needle in a haystack of benign network activity.
When we imbalanced data, random undersampling refers to the process of only
sampling enough of the majority class so that we have approximately the same volume
of both majority and minority class samples. For random oversampling, we instead
synthesize data (duplicates from the minority class to achieve the same effect).
Under or oversampling can often make the difference between a good and badly
performing model although undersampling runs the risk of losing information in the
data that may be valuable to the model while oversampling can lead to overfitting.
or worse, feature “bias” where multiple features have statistical bias in the dataset or
14
modelling process
120
Chapter 4 Machine Learning on Cloud
Using Python in Google Colab, the goal of this exercise is to apply EDA and Data
Wrangling techniques to a common banking challenge – detecting and predicting
customer credit risk:
2. Referencing the TOC on the LHS in Colab walk-through the processes below:
a. Library import
b. Config
d. EDA
e. Data Wrangling
f. Feature Selection
3. Proceed to the modeling part of the notebook, carrying out a baseline run then
a scenario where specific features are scaled
4. Exercise – try to improve model performance by carrying out more runs and
changing the data/enhancing the feature engineering process
5. Exercise – do the same but this time carry out a parallel run across multiple
algorithms
121
Chapter 4 Machine Learning on Cloud
A
lgorithmic Modelling
After splitting our data into training and test sets (or training, validation, and test sets),
we are ready to “fit” a model. The algorithmic technique takes the training data and
fits an algorithm to the model.15 Once the model has been trained, using the .predict()
function in sklearn, we put the test data through the trained algorithm in order to
benchmark model predictions/forecasts (see section below).
This book is a practical look at building AI applications, not a theoretical (and often
irrelevant16) discussion on the mechanics of machine or deep learning algorithms.
That said, there are plenty of labs in this chapter and others showing how underlying
algorithms, both machine and deep learning, are applied and plenty of relevant best
practice on how to fine-tune and get the most out of the modeling process in later
chapters.
For reference going forward in this book, we summarize below the main machine
learning algorithms used and their strengths and weaknesses. This is followed by a
hands-on lab in the next section where we train multiple machine learning (and one
deep learning) algorithm on three different datasets and compare performance.
15
k times in the case of k folds cross-validation, with each iteration also tested against the
validation set
16
See e.g. Chapter 1 discussion on “technical debt”
122
Chapter 4 Machine Learning on Cloud
17
CatBoost is an algorithm for gradient boosting on decision trees (tree ensembles). It’s simple
install, superior model exploration, and performance tracking (global and locally supported
SHAP value analysis means it has become one of the more performant ML algorithms). See also
e.g. https://towardsdatascience.com/why-you-should-learn-catboost-now-390fb3895f76
123
Chapter 4 Machine Learning on Cloud
For unsupervised machine learning, K Means is the main algorithm, but there
are several variants including hierarchical clustering. Neural networks (specifically
autoencoders) can also be used for unsupervised machine/deep learning,18 and
Principal Component Analysis (discussed earlier) is essentially also an unsupervised
machine learning algorithm.
18
Autoencoders are discussed in Chapter 5
124
Chapter 4 Machine Learning on Cloud
Performance Benchmarking
Building an AI app requires constant training and testing – knowing how to benchmark
performance and which measures to use is a key overhead. It’s always a good idea to
compare model output with a baseline.
For supervised classification models, this can be a model which randomly assigns
classes to each datapoint in the test set, or a model which uses all available features
(unaltered except for data cleansing associated with runtime errors, e.g., missing values
removed, text data encoded, etc.) – essentially a model with no “feature engineering.” For
supervised regression models, a persistence forecast is often used as a baseline, where
the forecast just “persists” (i.e., repeats) a pattern of data from the test set, for example,
day minus 1, day minus 7, or year minus 1, etc. As with classification models, it’s a good
idea to run a baseline with all features to compare later runs where more sophisticated
feature engineering has been carried out.
The above baseline is useful for supervised machine learning “up and running” but
after 5-10 model iterations more sophistication is required and, as shown in the table
below, a number of measures are used to determine how well a model is performing.
125
Chapter 4 Machine Learning on Cloud
126
Chapter 4 Machine Learning on Cloud
Figure 4-7. Confusion Matrix with recall = 100 / (100 + 5) = 0.95 and precision =
100 / (100 + 10) = 0.91
Continual Improvement
Continual improvement is of course part of performance benchmarking – rather than
relying on static metrics and blind trial and error approaches to improve results, best
practice should be adopted to ensure over time we can surpass tolerances on model
performance.
Almost always revisiting the data should be the first port of call and critiquing the
approach used for key data cleansing tasks such as imputing proxy values for missing
values – has too much valuable (other) feature information been lost when we removed
missing data from another feature, is it appropriate to just replace missing values with
the mean value, etc.? Some of these same concerns around loss of information and/
or model bias extend to the sampling approaches already described for imbalanced
datasets.
The question of data extends further to the data acquisition process – while more
data doesn’t always mean better results, a larger sample is statistically more likely to
yield better performance. More data also doesn’t necessarily mean more records or rows
of data – new features, often derived from existing ones (such as “no. of days since last
purchase,” “freq of purchase,” etc.) can have a huge impact.
Whatever the granular approach taken, only after an exhaustive reexamining of
the data should we look to further algorithmic testing (parallel or otherwise) and
hyperparameter tuning.
127
Chapter 4 Machine Learning on Cloud
And while the above focus is on continually improving model training and test
results, model resiliency is an additional consideration. A great model today isn’t always
fit for purpose in one month’s time – in our last chapter we will look at data drift and
automated retraining for problem mitigation.
Scikit-learn (or sklearn) is the “go-to” for simple predictive analysis and modeling – in
this lab we compare machine learning algorithms in scikit-learn against three different
datasets.
Not just “entry-level” ML, sklearn is widely used in production by well-known brands
including J.P.Morgan, Spotify, and booking.com. Built on NumPy, SciPy, and matplotlib,
and besides the main algorithmic processes in machine learning, sklearn comes
with some built-in datasets, preprocessing, feature engineering, and model selection
functions/methods:
b. Scale
c. Carry out a parallel run and fit the data to the ten different algorithms
3. Compare visually how the multiple algorithms fit to the training data
a. Extract the model scores (raw values in a list displayed on the screen)
b. Isolate the multilayer perceptron (MLP) for the "make_circles" dataset only. Plot this
with the input data and make it bigger on the screen
128
Chapter 4 Machine Learning on Cloud
129
Chapter 4 Machine Learning on Cloud
1. In Azure Machine Learning Studio, follow the steps below to train and
evaluate a machine learning model on predicting whether income is above or
below $50k:
https://gallery.azure.ai/Experiment/3fe213e3ae6244c5ac84a7
3e1b451dc4
2. Now follow the steps below to set up a web service and perform a simple API
inference test:
c. Run
f. SHOW with some default values (age = 12, income = 45) customer predicted to have
LOW INCOME (shown in small text at the bottom of the screen)
g. Exercise – after deploying the Web Service and using the config settings shown under
“Consume,” add the Azure ML add-in to MS excel and entering sample data, call the
API from excel
Reinforcement Learning
Reinforcement learning involves real-time machine (or deep) learning with an agent/
environment mechanism which either penalizes or rewards iterations of a model based
on real-time feedback from the surrounding environment (how accurate is the model).
130
Chapter 4 Machine Learning on Cloud
There are three primary components to the algorithm, with the aim being to discover
through trial and error which actions maximize the expected reward over a given
amount of time:
While the scope of this book is mainly focused on mainstream business and
organizational applications, advances in reinforcement learning are in general where
there is considerable hype in the media – essentially this is the underlying technique
that drives “industrial-scale” applications such as Google’s Search Engine, autonomous
vehicles, robotics, and gaming.
Wrap-up
Whether Reinforcement Learning is Machine or Deep Learning may be a moot point but
yawning skill gaps means AI ambitions for most companies today are in establishing in-
house capabilities in more prosaic Machine Learning solutions.
Deep Learning represents a step-up in organizational AI maturity and having
reached the end of this chapter’s accelerated run through on Machine Learning, we
move onto how cloud data-fed neural networks have become mainstream and their
specific application in Deep Learning solutions in our next chapter.
131
CHAPTER 5
133
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_5
Chapter 5 Neural Networks and Deep Learning
Most of the “hyped-up” AI applications discussed in the media today are Deep
Learning solutions: from Driverless Cars (Autonomous Vehicles) to Search Engines,
Computer Vision/Image Recognition, Chatbots, and Portfolio Optimization and
Forecasting.
These events have been widely publicized by the media and when taken together
with the increasing scalability of industry deep learning solutions, a snowball effect has
essentially led to a broader market commissioning of Deep Learning implementation
projects.
135
Chapter 5 Neural Networks and Deep Learning
136
Chapter 5 Neural Networks and Deep Learning
This hype among businesses is part ignorance, part science fiction. AI in the job-
world means “Augmented” Intelligence, rather than Artificial Intelligence. Artificial
Intelligence is too often confused with the “Artificial General Intelligence” of the movies
and all its scary apocalyptic visions.1
Real “Augmented” intelligence is today generally perceived as delivering on its
potential and the tangible benefits for businesses and organizations are a reality, driven
by an existential need to accelerate digitalization during Covid such as:
• Provision of chatbot support during the pandemic
In 2022 and beyond, the focus has turned to the “democratization” of AI and Deep
Learning post-Covid, shifting projects from expert/niche knowledge and a mountain of
“technical debt” to achieving buy-in (and understanding) across a wider ecosystem of key
stakeholders (all employees, customers). Similarly, the “industrialization” of AI has moved
front and center, necessitating a push for reusability, scalability, and safety of AI solutions.2
H
igh-Level Architectures
As described above, Deep Learning extends Machine Learning to the use of neural
networks to solve hard and very large Big Data business problems. The computational
overload is often achieved with parallel processing over a cluster or Graphical Processing
Units (GPUs) to accelerate the internal calculations.
As we will see later on in this chapter, the heavy computation takes place inside an
artificial neural network of ANN. These ANNs underpin Deep Learning and are inspired
by the structure and function of the brain. A simple diagrammatic representation is
shown as follows:
• Two hidden layers of four nodes (or neurons)
• Four inputs
• One output
1
No-one really wants to see that happening, although plenty is being done in research labs
around the world.
2
These two trends are central to Chapter 7 on AI Application Development
137
Chapter 5 Neural Networks and Deep Learning
Most ANNs and therefore Deep Learning models can trace their architecture back to
this fundamental structure. To get a flavor of how these ANNs actually operate and how
we can use them to perform predictions we will now proceed to our first hands-on lab in
this chapter.
The goal of this exercise is to get familiar with an Artificial Neural Network, how data is
imported, and how it can be trained to predict an output:
1. Go to https://playground.tensorflow.org
2. Notice the data configuration on the LHS of the screen – there are four sample
datasets to choose from, each of which is defined by a set of x1 and x2
coordinates
138
Chapter 5 Neural Networks and Deep Learning
3. The neural network architecture is shown in the middle from left to right,
that is, Features (transformations of the x1 and x2 coordinates in the dataset)
through the network’s Hidden Layers and Output
5. Notice that the Problem type is set to “classification” – we are trying to predict
whether data belongs to the blue or orange group
6. Taking the first (concentric circles) dataset try to play around with (a) number of
features, (b) number of hidden layers, and (c) number of neurons in each layer
and observe how quickly the model converges (loss close to zero). Observe that
the more hidden layers and neurons we have, the quicker the convergence
7. Exercise: try to train the model on the most complex / non-linear dataset (the
spiral pattern) and see how long it takes (how many epochs) to achieve a
loss < 0.05
8. Stretch Exercise: find a suitable configuration which achieves a loss < 0.01 in
under 500 epochs
Stochastic Processes
Before we proceed further and examine the different types of neural networks leveraged
for actual industry used cases today, we first need to take a look at some fundamental
probability theory to understand the intricate processes happening in our ANNs.
139
Chapter 5 Neural Networks and Deep Learning
R
andom Walks
Another important concept for understanding neural networks is a random walk.
A random walk is a stochastic or random process describing a path that consists
of a succession of random steps on a variable. A two-dimensional random walk can
be visualized if we were to consider a person tossing a coin twice every time they are
standing at a crossroads in order to determine where they can go ahead, turn back go left
or right:
• heads, heads -> forward
We will create this 2D random walk shortly in the hands-on lab for this section.
The important aspect here is that each of the outcomes from each step of a
random walk have equal probability. Random walks are important for, for example,
image segmentation, where they are used to determine the labels (i.e., “object” or
“background”) associated with each pixel. They are also used to sample massive online
graphs such as online social networks
140
Chapter 5 Neural Networks and Deep Learning
The stopping criteria is (a) if she/he reaches a fortune of N dollars or (b) if his purse
is ever empty.
Systems modeled as a Markov process give rise to hidden Markov models (HMMs) –
statistical models similar to the hidden layers/states in neural networks/deep learning.
To see how they work and how they can be applied to real-world phenomena, take a
look at a Markov chain in action: https://setosa.io/ev/markov-chains/
141
Chapter 5 Neural Networks and Deep Learning
https://github.com/bw-cetech/apress-5.2.git
b. Define number of steps in your random walk and lists to capture the outcomes
c. Try the exercise to create a 2D random walk using a For loop and four If statements
142
Chapter 5 Neural Networks and Deep Learning
143
Chapter 5 Neural Networks and Deep Learning
144
Chapter 5 Neural Networks and Deep Learning
145
Chapter 5 Neural Networks and Deep Learning
146
Chapter 5 Neural Networks and Deep Learning
and thereby extract hierarchical patterns from input data (such as identifying a person
or building in a composite image).
Across the entire network, CNNs rely on several convolutional layers, repeatedly
applying the same filter to the input data, resulting in a feature map of activations
indicating the locations and strength of a detected feature in an image.
3
Gated Recurrent Units (GRUs) also have this feature, but are less complex (two gates: reset and
update) as opposed to three for an LSTM (input, output, forget). GRUs generally use less memory
and can be faster than LSTMs but are less accurate with longer sequences.
148
Chapter 5 Neural Networks and Deep Learning
149
Chapter 5 Neural Networks and Deep Learning
In general, RBMs are less frequently used today as the time needed to calculate
probabilities is significantly slower than the backpropagation algorithm. Generative
Adversarial Networks (GANs) or Variational Autoencoders are preferred, both of which
are discussed below.
Autoencoders
An autoencoder neural network is a type of feedforward neural network. Specifically, it's
an unsupervised learning algorithm that applies backpropagation and sets the target
values to be equal to the inputs. Autoencoders are one component of OpenAI's exciting
DALL-E Generative AI model which we will take a look at in a hands-on lab in Chapter 8.
151
Chapter 5 Neural Networks and Deep Learning
152
Chapter 5 Neural Networks and Deep Learning
The MNIST (Modified National Institute of Standards and Technology) handwritten digits
dataset is a standard dataset used in computer vision and deep learning.
Consisting of 60,000 small square 28×28 pixel grayscale images of handwritten single
digits between 0 and 9, it's great as a basis for learning and practicing how to develop,
evaluate, and use convolutional deep learning neural networks for image classification.
153
Chapter 5 Neural Networks and Deep Learning
https://github.com/bw-cetech/apress-5.3.git
2. Import the notebook to Google Colab. Colab has better support for TensorFlow
(no install!) so it’s easier to run the notebook
3. Take a look at the Table of Contents on the RHS panel, showing each part of the
modeling process from Data Import through EDA, Data Wrangling, Building and
Running the Neural Network, and Performance Benchmarking
4. Try to complete the exercises first before looking at the solutions as you
progress through the notebook
154
Chapter 5 Neural Networks and Deep Learning
Our follow-up lab in this section explores Unsupervised Deep Learning using the MNIST
Fashion dataset. The idea is to leverage the encoder, code and decoder architecture
of an autoencoder to effectively “re-construct” specific images for a given class (here
articles of clothing but the code is easily adapted to other image sets):
https://github.com/bw-cetech/apress-5.3b.git
a. Perform EDA
e. Reducing image noise and reconstruct images using the trained autoencoder
4. Compare runtime using Colab GPU runtime vs. a normal (CPU) runtime
155
Chapter 5 Neural Networks and Deep Learning
TensorFlow
The favorite tool of many industry professionals and researchers, TensorFlow is an
end-to-end deep learning framework developed by Google and released as open source
in 2015.
Though well-documented with training support, scalable production and
deployment options, multiple abstraction levels, and support for different platforms
(such as Ubuntu, Android, and iOS), TensorFlow is a complex low-level programming
language, coding of which on its own is not massively intuitive to Data Scientists.
A focus in the last 5-10 years on TensorFlow development means that accelerated
improvements in research, flexibility, and speed have led to uptake by many of the
major brands and corporations recognized today, including Airbnb, Coca Cola, GE, and
Twitter.
No longer just the low-level language, Google improved integration with Python
runtime in its last major (TensorFlow 2.0) release in 2019. Today’s TensorFlow ecosystem
encompasses Python, JavaScript (TensorFlow.js), and mobile (TensorFlow Lite)
156
Chapter 5 Neural Networks and Deep Learning
K
eras
Developed by MIT, Keras was open-sourced in 2015 to address the challenges around
ease of use coding deep learning model training and inference in less user-friendly
“back-end’ languages, such as the syntax used in TensorFlow, Theano, and CNTK,4 the
latter two of which are now deprecated.
A high-level, modular and extensible API written in Python focused on fast
experimentation, Keras is the high-level API of TensorFlow 2 and the most used Deep
Learning framework among top-5 winning teams on Kaggle.
It’s “drag-and-drop” style of programming maintains back-end linkage to TensorFlow
and accelerates the design, fit, evaluation, and use of deep learning models to make
predictions in a few lines of code.
The actual implementation of the Keras API on TensorFlow is codified as tf.keras
with set up in Python is simple:
import numpy as np
import tensorflow as tf
from tf import keras
The core data structures of Keras are layers and models with the simplest type of
model a sequential model, a linear stack of layers. We will discuss these layers and the
implementation of a deep learning model in Keras more in the following section of this
chapter.
4
aka Microsoft Cognitive Toolkit
157
Chapter 5 Neural Networks and Deep Learning
P
yTorch
PyTorch is the last of the main Deep Learning “tools” we will cover in this book and is
predominantly used for Deep Learning with Natural Language Processing.5
Developed by Facebook’s AI research group and open-sourced on GitHub in 2017,
PyTorch claims to be simple and easy to use and flexible, as well as possessing efficient
memory usage with dynamic computational graphs. PyTorch is faster and has excellent
debugging support, however claims to be easier to use should be taken with a pinch
of salt– it is a low-level language. PyTorch integration with a simpler DL framework
like Keras does not yet exist although PyTorch Lightning as a lightweight wrapper
providing a high-level interface to PyTorch may come into its own as an alternative in the
coming years.
In this author’s opinion, implementation of PyTorch is easier using Jupyter Notebook
on Windows. We will install PyTorch using the one-off shell command (directly in a cell
in a Jupyter notebook):
After commenting out the above Python code line, the installation can then be
verified by running the sample PyTorch code:
import torch
x = torch.rand(5, 3)
print(x)
5
We will cover more on NLP in Chapter 10
158
Chapter 5 Neural Networks and Deep Learning
A
pache Spark
While MXNet is more of a niche Deep Learning tool, Apache Spark is a popular unified
analytics engine for big data processing, with built-in modules for streaming, SQL,
machine and deep learning, and graph processing.
Because of its inherent “scalability” around Big Data processing and its built-in
interface for programming entire clusters with implicit data parallelism (so a great fit for
distributed training on large datasets) and fault tolerance, Data Scientists have jumped
on Apache Spark as an accelerator for productionizing Deep Learning.
Many APIs exist to simplify the implementation and usage of Apache Spark in
Python, including PySpark (defacto Python API for Apache Spark) and SparkTorch (for
running PyTorch models on Apache Spark).
Probably the most important of the “other” tools important for productionizing Deep
Learning, Apache Spark is the focus of a hands-on lab in the next chapter of this book.
Azure Machine Learning and Azure Cognitive Services are now main Microsoft’s main AI tools.
6
We will discuss these more in Chapter 6: The Employer’s Dream AutoML, AutoAI, and the rise of
NoLo UIs
159
Chapter 5 Neural Networks and Deep Learning
Tensors
Both TensorFlow and PyTorch utilize tensors, so it's helpful to understand this
important data structure before proceeding to how they are constructed and used in a
deep learning model.
In effect, a tensor is a multidimensional array with a uniform data type. They
are similar to numpy arrays but are immutable, much like a standard Python “tuple.”
Tensors can also be stored in GPU memory as opposed to standard CPU memory,
optimizing usage for deep learning.
160
Chapter 5 Neural Networks and Deep Learning
• Model evaluation
• Predictions/Inference
161
Chapter 5 Neural Networks and Deep Learning
162
Chapter 5 Neural Networks and Deep Learning
Implementing a CNN
The above sections cover the generic deep learning implementation process but how
does the process go when we are specifically building, for example, a Convolutional
Neural Network for Image Classification? Machine and Deep Learning modeling is
163
Chapter 5 Neural Networks and Deep Learning
2. Build network
3. Compile network
The remaining steps are for the most part in line with the generic process shown in
the Deep Learning lifecycle above, from training the network, through evaluation on the
test set (checking such metrics as loss, accuracy, confusion matrix, and classification
report). Often training and evaluation is repeated until we hit our “target” performance,
with dropout often added to address overfitting.
7
More on pooling in the next section
164
Chapter 5 Neural Networks and Deep Learning
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32,
32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu’))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_
logits=True), metrics=['accuracy’])
Implementing an RNN
There are actually three types of Recurrent Neural Network “layers” we can leverage
in Keras:
keras.layers.SimpleRNN
This is the “vanilla” version – essentially a fully connected RNN where the output from
previous timestep is to be fed to next timestep.
keras.layers.GRU
Gated recurrent units (GRUs) are gating mechanisms in recurrent neural networks
intended to solve the vanishing gradient problem in standard RNNs. Via an update gate
and a reset gate, a GRU “vets” information before passing to an output, thus avoiding
diminishing gradients and inflexible (unchanging) weights in a simpleRNN. We will
discuss vanishing gradients in more detail in the next section 6 “Tuning a DL Model” in
this chapter
Example Keras RNN implementation of GRU and SimpleRNN layers:
model = keras.Sequential()
model.add(layers.Embedding(input_dim=1000, output_dim=64))
model.add(layers.Dense(10))
model.summary()
keras.layers.LSTM
As we know from Section 3 in this chapter, LSTMs generally help achieve better forecasts
with “longer-term effects.” This “memory” in the network is achieved through the
persistence of hidden state through three gates:
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1)) # single hidden layer of LSTM units
model.compile(optimizer='adam', loss='mse')
166
Chapter 5 Neural Networks and Deep Learning
167
Chapter 5 Neural Networks and Deep Learning
• Epoch — a pass over the entire dataset. The number of epochs is the
number of times you go through your training set
• Loss function (or a cost function) – by how much does the prediction
given by the output layer deviate from actual (ground truth)? The
objective of a neural network is to minimize this
168
Chapter 5 Neural Networks and Deep Learning
• we can take the dot product of the input and the set of weights then
add bias:
Figure 5-22. Multilayer Neural Network – how do we get our output from the
input data?
169
Chapter 5 Neural Networks and Deep Learning
In this lab, we will build and train a Convolutional Neural Network in Keras and
TensorFlow to solve a predictive (multiclass classification of complex images) problem:
The goal of this lab is to build and train an LSTM for stock price (univariate) forecasting,
that is, where the forecasted stock price depends only on previous time steps:8
https://github.com/bw-cetech/apress-5.5b.git
3. Run through the notebook – uploading the data into a (temporary) Colab folder
as described and importing, performing EDA and Data Wrangling in order to
transform the stock price data into the format needed to forecast with RNNs
5. Exercise - retrain the model with a different batch size and compare the (root
mean square) error
6. Exercise – retrain the model with a different activation function (tanh) and
compare the test result (tanh generally regulates better the output of a
recurrent neural network than ReLU)
7. Stretch Exercise – adapt the TATA univariate forecast to carry out a live stock
price forecast by completing the exercise commented out under “Live Stock
Price Scenario”. Run thru the steps to import the latest stock price price for a
leading Tech Stock (e.g. Apple or Tesla) from Jan-21 to D-1 and performing a
forecast 30 time steps into the future
8
Multivariate would involve the target variable (stock price) being dependent on multiple other
variables, such as macroeconomic factors, weather and previous time steps
171
Chapter 5 Neural Networks and Deep Learning
Activation Functions
As discussed in previous sections, neurons produce an output signal from weighted
input signals using an activation function. So in effect, an activation function is a simple
mapping to the output of the neuron where weighted inputs in the neural network are
summed and passed through the activation (or transfer) function.
The term “activation” is related to the threshold at which the neuron is activated
and the corresponding strength of the output signal. The Heaviside step function used
in the Simple Perceptron is a simple step activation function where if the summed input
exceeds 1 then the neuron would output a value of 1.
The four main nonlinear activation functions which we actually use in practice
provide much richer functionality:
172
Chapter 5 Neural Networks and Deep Learning
Softmax
The softmax activation function (or normalized exponential function) is a special case
for multiclass classification problems where there is a discrete (non-binary) output,
such as in the case of most image classification problems (e.g., is the image a person, a
building, a vehicle, a road, etc.).
Softmax turns logits (the final layer in neural network) into probabilities that
sum to 1.
9
Predetermined before training, as opposed to learnt during training
173
Chapter 5 Neural Networks and Deep Learning
10
See Loss Functions
174
Chapter 5 Neural Networks and Deep Learning
A forward pass11 is then carried out where the input data gets transformed through
the network, activating neurons through hidden layers to ultimately produce an output
value. This output is then compared to the expected (actual) output and an error is
calculated.
B
ackpropagation
Backpropagation takes place when the above error is propagated back through the
network, one layer at a time, and the weights are updated according to the amount that
they contributed to the error.
The process repeatedly adjusts weights so as to minimize the difference between
actual output and desired output and is repeated for all rows/examples in the training
data (an epoch). A neural network is typically trained over multiple epochs, with 100
generally being an upper limit.
A forward pass is also used in the inference process, after the network is trained, in order to
11
175
Chapter 5 Neural Networks and Deep Learning
SGD with Momentum
As we saw in the previous section, Stochastic Gradient Descent addresses a large
memory disadvantage of Gradient Descent (loads the entire datasets to compute loss
derivative). SGD with Momentum is an enhancement of the vanilla SGD algorithm,
helping accelerate gradient vectors in the right directions in order to speed up
convergence.
Adam
Adam (derived from adaptive moment estimation) is generally the choice of
optimization algorithm in Deep Learning. To converge faster (use of this algorithm can
be the difference in getting quality results in minutes, days, or months), Adam uses both
Momentum to accelerate the stochastic gradient descent process and Adaptive Learning
Rates to reduce the learning rate in the training phase.
Adam combines the best properties of the AdaGrad and RMSProp algorithms to
handle sparse gradients on noisy problems.
Loss Functions
Before we move on to look at best practice in improving Deep Learning performance,
let’s take a closer look at the concept of “loss” or the cost function in a neural network.
Minimizing a loss function in Deep Learning equates to minimizing the training
error or lowering the cost of the neural network/weight calibration process.
176
Chapter 5 Neural Networks and Deep Learning
The three most common loss are binary cross entropy for binary classification,
sparse categorical cross entropy for multiclass classification and mse (mean squared
error) for regression, but as with all things in deep learning there are several variants
which under certain conditions can provide better results:
Squared Error Regression square of the penalizes large errors Not robust to outliers
(L2) Loss / difference between by squaring them
Mean square the actual and the
error predicted values
(MSE)
Absolute Error Regression distance between more robust to Penalization of
Loss / Mean the predicted and outliers as compared large errors may be
Absolute Error the actual values to MSE insufficient
(MAE)
Huber Loss Regression Combined MSE and more robust to Slower convergence
MAE - quadratic outliers than MSE
for smaller errors,
linear otherwise
Binary Cross- Binary Uses Log-Loss and Ideally suited for Sigmoid can
Entropy Classification sigmoid function binary classification saturate and kill
models gradients
Hinge Loss Binary primarily used with penalizes wrong Limited to SVM
Classification Support Vector predictions as well as models
Machine (SVM) low confidence right
predictions
Multi-class Multi-class generalization of Works well with one Each sample should
Cross Entropy Classification the Binary Cross hot encoded target have multiple
Loss / Entropy loss variables classes or is
Categorical labelled with soft
Cross entropy probabilities
(continued)
177
Chapter 5 Neural Networks and Deep Learning
Improving DL Performance
The above internal workings of a deep learning model help to understand what is
happening “under the hood” and the options described to improve results are somewhat
experimental, tied to the specific dataset and business or organizational problem with
which we are presented.
Where do we start though if we are embarking on improving neural network
performance? We should take a best practice approach, almost always commencing
(and ending) with a review of the underlying data:12
12
Source: machinelearningmastery.com
178
Chapter 5 Neural Networks and Deep Learning
1. Review Data
–– Feature Selection
3. Tune hyperparameters
Besides these core principles for achieving higher model performance, periodic
refactoring should be built into model maintenance to ensure code is as efficient as
possible and (python) libraries and functions are not deprecated.
Network Tuning
Deeper Network/More Layers/More Neurons
Adding more hidden layers/more neurons per layer means we add more parameters to
the model, and therefore allow the model to fit more complex functions
179
Chapter 5 Neural Networks and Deep Learning
Activation Function
As described above, generally with CNNs, ReLU is used to address vanishing gradients
with Sigmoid (binary) or Softmax (multiclass classification) in the outer layer. For RNNs,
tanh is used to overcome the vanishing gradient problem, as its second derivative can
sustain for a long range before going to zero.
Batch Normalization
The deep neural networks training process is also sensitive to the initial random weights
and configuration of the learning algorithm. Normalization of the layers' inputs (batch
normalization) can help to make artificial neural networks faster and more stable by
recentering and rescaling of the data.
Pooling
Multiple convolutional layers in a CNN can be very effective, learning both low-level
features (e.g., lines) and higher-order features, like shapes or specific objects in the
outer layers.
However, these “feature maps” are tied to the EXACT position of features in the
input. This “inflexibility” can result in different feature maps for minor image changes
such as cropping, rotation, shifting, and a resultant loss in model generalization to
new data.
Pooling (implemented using pooling layers) is used to down sample convolutional
layers in a CNN and avoid overfitting to exact/precise image features. This lower-
resolution version of an input signal which retains only important structural elements is
similar to the effect of “pruning a decision tree” in machine learning.
180
Chapter 5 Neural Networks and Deep Learning
Image Augmentation
Image data augmentation is used to artificially expand the size of a training dataset for
Convolutional Neural Networks. The idea is that training on more data means a more
skillful model. Using the ImageDataGenerator class in Keras to generate batches of
augmented images, we can achieve this in a number of ways:
Process Tuning
Number of Epochs and Batch Size
Increasing the number of epochs (a complete sampling of the entire training set) will
generally produce better performance, but only up to a point. Evidence of validation
(test) accuracy decreasing even when training accuracy is increasing should put an
upper limit on the number of epochs, otherwise we are essentially overfitting our model.
Rather than trial and error, we can control setting the number of epochs at too high a
level by implementing early stopping criteria to avoid overfitting on the training set.
181
Chapter 5 Neural Networks and Deep Learning
Our batch size – the size of the input data sample used in a forward pass is another
lever we have at our disposal. Using too large a batch size can have a negative effect on
generalization, that is, the test accuracy of the network is worse as we have reduced the
stochasticity (randomness) of gradient descent during the training process. A rule of
thumb for largish datasets (e.g., > 10,000 images) is to use the default value of 32 first,
then increase to 64, 128, and 256 if the model is underfit or the training time is onerous.
Learning Rate
Learning rate is a hyperparameter used to impact the speed of the gradient descent
process. Setting this too high can mean the algorithm will bounce back and forth without
reaching a local minimum, while too low and convergence can take some time.
Regularization
Regularization is a general technique to modify the learning algorithm such that the
model generalizes better, that is, avoids overfitting. In machine learning regularization
penalizes feature coefficients while in deep learning regularization penalizes the weight
matrices of the nodes/neurons.
A regularization coefficient (a hyperparameter) controls regularization – we get
underfitting if this regularization coefficient is so high that some of the weight matrices
are nearly equal to zero. L1 and L2 are the most basic types of regularization, updating
the general cost function by adding another term known as the regularization term:
182
Chapter 5 Neural Networks and Deep Learning
Dropout
Deep learning neural networks are prone to overfitting when datasets are relatively
small. Dropout is a computationally cheap and effective regularization method to reduce
overfitting and improve generalization error. In effect, dropout simulates in a single
model a large number of different network architectures by randomly dropping out
nodes (neurons) during training.14
A good starting point for dropout is to set this @ 20% and increase to, for example,
50% if the model impact is minimal. Too high will result in under-fitting.
Early Stopping
We should stop training at the point when performance on a validation dataset starts to
degrade, for example, by:
• No change in metric over a given number of epochs
14
Nodes may be input variables in the data sample or activations from a previous layer
183
Chapter 5 Neural Networks and Deep Learning
This is implemented in Keras using the EarlyStopping function. The below example
monitors and seeks to minimize validation loss across epochs:
EarlyStopping(monitor='val_loss', mode='min’)
Transfer Learning
Transfer learning is actually an “accelerator” for deep learning model – essentially we
are taking a pretrained model as a starting point for another (different, but related)
model. Popular in Deep Learning to reduce the computation required to develop neural
network models from scratch, there are a number of research models often used for
transfer learning:
The VGG model is used in the Transfer Learning exercise in our Hands-On lab above
(Convolutional Neural Networks with Keras & TensorFlow).
184
Chapter 5 Neural Networks and Deep Learning
Wrap-up
The exhaustive neural network and process tuning that we have just walked through
is clearly complex. The best practice techniques described are often rather vague
and the range of options available/settings to configure can be difficult to pin down,
especially if results are required relatively quickly. But quick wins are always possible
and perseverance in achieving better results is usually rewarded provided a structured
approach is adopted and performance iteratively monitored.
While we may have reached the end of this chapter, our next chapter will take a
look at how we can couple best practice in deep learning with training (and testing)
automation to help accelerate the process of fine-tuning both machine and deep
learning models. Before we go there though, we complete this chapter with two deep
learning model tuning labs.
Simple hands-on lab to see how the Softmax activation function takes the last neural
network layer and turns output into probabilities that sum to one:
https://github.com/bw-cetech/apress-5.6.git
2. Run through the Python code in Jupyter Notebook to see how softmax is
computed.
185
Chapter 5 Neural Networks and Deep Learning
3. As a stretch exercise try to create a Python function which stores the softmax
activation formula and call the function
In this lab, we will continue with our German traffic light image classification dataset
and look at the impact of early stopping criteria on model performance:
1. Continue with the notebook from earlier (Convolutional Neural Networks with
Keras and TensorFlow: Hands-on Practice)
2. Run through the remainder of the notebook from “2nd Run –Early Stopping
Criteria”
186
CHAPTER 6
AutoML, AutoAI,
and the Rise of NoLo UIs
In what is still so far a relatively short space of time, growth in machine and deep
learning implementations in organizations across the world has been extraordinary.
However, this hasn’t always translated into commercial success, with disappointingly
low adoption rates in the retail sector (11.5% in the UK1) and just over 50% of prototypes
ending up in production across all industries.2 Historically many solutions have
been operationally siloed with PhD-level statisticians left to explain code-heavy
technical models.
Coupled with a reliance on dummy or synthetic datasets for training, an absence of,
or worse, broken interfaces with training and testing done in a Python notebook (Jupyter,
Colab, etc.) and the “technical debt” associated with these poorly designed applications
starts to become a burden to the organization.
Roll forward to today and we are starting to see a rise in the rollout of “AutoML”3
and “AutoAI” tools far better equipped for an enterprise-wide deployment – from fully
automated data import, through interface orchestration, machine/deep learning, and
deployment. Moreover, these tools are increasingly more usable, and, importantly,
understandable to multiple stakeholders across departments by virtue of built-in user-
friendly “NoLo” GUIs and embedded data traceability and auditability.
1
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/
attachment_data/file/1045381/AI_Activity_in_UK_Businesses_Report__Capital_
Economics_and_DCMS__January_2022__Web_accessible_.pdf. For Gartner stat, see www.
gartner.co.uk/en/newsroom/press-releases/2020-10-19-gartner-identifies-the-top-
strategic-technology-trends-for-2021
2
www.gartner.co.uk/en/newsroom/press-releases/2020-10-19-gartner-identifies-the-
top-strategic-technology-trends-for-2021
3
The global AutoML market is projected to grow @ 43% CAGR and exceed $5b by 2027 (Source:
businesswire)
187
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_6
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
The step change in use, and collaboration across AutoML/AI tools and so-called
NoLo, or no/low code applications,4 is coming at a time companies shift from ”rule-
based” robotic process automation (RPA) to enhanced cognitive robotic process
automation (CRPA) with AI-infused “context.”
Perhaps this is nowhere more apparent than in the evolution of chatbots where
intelligent virtual agents (IVAs) or “conversational” chatbots have supplanted legacy,
rule-based chatbots, but the trend is also evident in the proliferation of augmenting
tools such as Microsoft PowerAutomate on top of PowerBI, NLP/text analytics on top of
optical character recognition (OCR) and specific sectoral enhancements such as (X-Ray)
diagnostics (AI) on top of patient screening (RPA) in the healthcare industry.
4
Or “LCNC” Low-Code/No-Code platforms. Although growth is expected to be at a slower c. 23%
CAGR than AutoML, the market size for low-code development platforms is bigger and expected
to reach $35b by 2030 (Source: Grand View Research, Inc)
188
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
5
See also Chapter 9 AI Project Lifecycle - Data Drift
189
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
As we shall see later,6 typically the above automations are boiled down to four core
processes7 for building and evaluating candidate model pipelines:
• Data preprocessing
• Hyperparameter optimization
6
See AutoAI in IBM Cloud Pak for Data section below
7
See e.g. https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/autoai-
overview.html
8
Or variable search space. See also https://machinelearningmastery.com/
what-is-bayesian-optimization
190
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
The goal of this lab is to compare machine learning model performance when using (a)
random sampling, (b) grid search, and (c) Bayesian optimization
1. Clone the GitHub repo below:
https://github.com/bw-cetech/apress-6.2.git
9
Out of scope of this book, but also gaining in application, is the computationally intensive use of
Genetic Algorithms
10
Objective functions are nonconvex, nonlinear, noisy, and computationally expensive, hence the
need to approximate with a surrogate function
191
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
3. Exercise – try to plot the mean test score (AUC) for each of the three techniques
4. Exercise (stretch) – import a larger IoT or retail dataset, update the wrangling
pipeline and modeling hypothesis and view the mean score to see how
Bayesian inference outperforms the other techniques
PyCaret
While experience with Python is required, PyCaret is marketed as low-code machine
learning due to its accelerated approach to machine learning model training. The USP
is in democratization of machine learning and as the example of training a dataset for
anomaly detection in Figure 6-3 shows, a minimal amount of end-to-end coding is
required.
192
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
auto-sklearn
Auto-sklearn automates the Data Science libraries from scikit-learn in order to
determine effective machine learning pipelines for supervised classification and
regression datasets
Machine Learning for the “Enterprise,” prioritizing Data Team efficiency and
productivity, is more easily accessible to non-Data Scientists, but like PyCaret, still
involves Python coding. It comes with built-in preprocessing and data cleaning, feature
selection/engineering, algo selection, hyperparameters optimization, benchmarking/
performance metrics, and postprocessing.
193
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
A
uto-WEKA
Auto-WEKA is in fact a Java application for algorithm selection and hyperparameter
optimizations, which is built on the University of Waikato, New Zealand’s WEKA12
Machine Learning. pyautoweka is the Python wrapper.
In contrast to auto-sklearn, Auto-WEKA simultaneously selects a learning
algorithm and configures hyperparameters with the goal of helping nonexpert users
more effectively identify ML algorithms and hyperparameter settings appropriate to
applications, as well as improve performance.
T POT
TPOT (Tree-based Pipeline Optimization Tool) uses a tree-based structure/genetic
programming to optimize machine learning pipelines and is designed to train on large
datasets over several hours. The TPOT API supports both supervised classification and
regression, carrying out, as shown in Figure 6-4, data wrangling and PCA before iterative
training, testing, and recursive feature elimination to arrive at a pipeline with the
highest score.
11
Distributed Asynchronous Hyper-parameter Optimization – an open-source Python library for
Bayesian optimization. See https://hyperopt.github.io/hyperopt/
12
Waikato Environment for Knowledge Analysis
194
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
The goal of this lab is to use TPOT optimization to find the best-performing pipeline and
algorithm on a synthetic classification problem. The code sample also shows how to
automate in Python a direct connection, unzip, and read of a web dataset:
https://github.com/bw-cetech/apress-6.3.git
2. Walk-through the notebook step by step to read in the dataset directly from the
UCI website, unzip the files, and import the smaller csv dataset
3. Carry out basic EDA, data wrangling, partition the data and configure KFolds
cross-validation (STEPS 2, 3, and 4)
195
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
4. Run the TPOT optimization step and observe the pipeline/model scores as they
start to come through after a few minutes. The whole process should take no
more than 30 minutes
6. Exercise – perform more sophisticated data wrangling, for example, one hot
encoding for nominal categorical variables, improved feature selection, and/
or scaling
7. Open the exported pipeline file “tpot_best_model.py” and observe the best-
performing algorithm and associated hyperparameters
8. Exercise (Stretch): swap out the data with the larger banking dataset. Speed up
runtime by executing instead on Colab using a GPU accelerator and compare
pipeline performance over 10 generations
13
We cover five main AutoAI/AutoML “platforms.” Out of scope of this book but certainly worth a
look are tools such as c3, DataRobot, Peltarion, Ludwig, and KNIME in what is fast becoming an
increasingly fragmented ecosystem.
196
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
The product’s “Data Fabric” support means the product supports multiple APIs
into structured and unstructured data sources spread across multicloud environments,
whether IBM Cloud, AWS, Azure, or GCP. This underpins of the USP of the product – IBM
claim that Data Fabric architecture means data access is 8* faster, while reduced ETL
requests amount to a 25-65% productivity boost. There are data governance benefits as
well – $27m saving in costs by virtue of the products inbuilt smart data negating the need
for manual cataloging.
AutoAI is the graphical tool, previously built into Watson Studio, which automates
the AI process – analyzing data, discovering data transformations, algorithms, and
parameter settings that work best for a specific predictive modeling problem.14 As shown
below, these automations essentially fit the four core AI automation processes described
earlier:
14
As well as meshing with IBM “ModelOps” best practice for monitoring model/data drift and
re-training
197
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
details.html
198
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
199
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
The entire ecosystem of MLOps tools within the Vertex AI platform is too broad to
mention here16 but of note is Vertex AI Pipelines – a serverless service that runs both
TensorFlow Extended, which we will cover shortly, Kubeflow pipelines. In keeping with
the theme of unified data and machine learning, Vertex AI also comes with multiple
integrations to BigQuery17.
See https://cloud.google.com/vertex-ai
16
See https://cloud.google.com/blog/products/ai-machine-learning/five-integrations-
17
200
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
Figure 6-8. Using Google Cloud Dataprep API to trigger automated wrangling
jobs with Cloud Composer18
18
See https://medium.com/google-cloud/
automation-of-data-wrangling-and-machine-learning-on-google-cloud-7de6a80fde91
19
https://aws.amazon.com/machine-learning/automl/
201
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
Our final AutoML tool mentioned in this section goes to TFX. Although developed by
Google, TensorFlow Extended (TFX) is an open source tool and considered here in its
own right. Built for scalable, high-performance ML Production Pipelines/deployment,
TFX extends TensorFlow execution pipelines and the tf.data API to end-to-end MLOps.
202
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
Several big brands use TFX, including Spotify (for personalized recommendations)
and twitter (for ranking tweets) and although TFX is not strictly a no-code tool for AI
automation, we will take a look at how this works in one of our hands-on labs below.
Wrap-up
The above coverage of the main AutoAI tools being used today completes this
chapter. While many strive to incorporate NoLo user interfaces to ensure stakeholder
engagement extends beyond data team silos, the need for many of these applications to
support model customization with Python remains. In the next chapter, we will turn our
attention to developing AI applications – specifically taking back-end models beyond
simple scripts to front-ended “full-stack” solutions.
We now proceed to have a look at our AutoAI tools – first up is IBM Cloud Pak where we
will run an AutoAI experiment for predicting customers likely to default on loans:
https://dataplatform.cloud.ibm.com
203
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
8. Exercise – create and test also a batch endpoint where several customer
records are entered as a batch and predictions returned for all of them
9. Exercise (stretch) – try to integrate your deployed model with a sample (Flask20)
application
a. copy .env file
e. test app
1 0. NB make sure to stop your environment runtime after use by following the
instructions "Stop the Environment" here:
https://ibm.github.io/ddc-2021-development-to-production/
ml-model-deployment/batch-model-deployment/
20
Flask is covered in more detail in Chapter 7
204
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
This lab takes a look at how Google Teachable Machines operates by loading in xray
images from Kaggle and training a predictive model to detect health issues from the
scans (in this case pneumonia).
www.kaggle.com/datasets/paultimothymooney/chest-xray-
pneumonia
Note this is a large 2GB dataset so may take several minutes to download.
2. Unzip the images in your local drive – this should complete in < 5 minutes.
6. Test the model on completion using images from the test set in the unzipped
image folder
7. Export the model, choosing TensorFlow and Keras. We will need this in a
later lab21
21
The lab is expanded on in Chapter 9 where we will build with Streamlit and deploy on Heroku a
full stack application integrated with our model trained here on Google Teachable Machines
205
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
Make note of your bucket name and region for step 6b below
4. Enable Vertex AI and Cloud Storage APIs by confirming your project, then
enabling at the link below:
https://console.cloud.google.com/flows/
enableapi?apiid=aiplatform.googleapis.com,storage-
component.googleapis.com
5. Download the notebook from the GitHub repo below and run it in Colab:
https://github.com/bw-cetech/apress-6.4-tfx-vertex-ai.git
22
Go to Google Cloud Console https://console.cloud.google.com/ and select Billing from the
Navigation (hamburger) Menu in the top left corner. Remaining credits are shown in the lower
right corner of the billing dashboard.
206
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
6. Run through the notebook steps making sure to restart the runtime after
installing the dependencies at the start
c. Prepare the example data from the Palmer Penguins sample dataset
7. The TFX pipeline at the end is orchestrated using Vertex Pipelines and Kubeflow
V2 dag runner. Make sure to click on the link shown in the last cells output
to see the pipeline job progress in Vertex AI on GCP, or visit the Google Cloud
Console: https://console.cloud.google.com/ to see the API requests.
NB make sure to delete resources on GCP after finishing the lab, namely your pipeline
run, Colab notebook, Cloud Storage bucket, and project
207
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
Although not strictly part of Azure Machine Learning, Azure Video Analyzer is part
of Azure Cognitive Services and exhibits many of the automated features central to
AutoML/AutoDL/AutoAI. This lab takes a look at automating the cataloging process of
video metadata and video sections/snippets.
https://docs.microsoft.com/en-us/learn/modules/analyze-
video/5-exercise-video-indexer
208
Chapter 6 AutoML, AutoAI, and the Rise of NoLo UIs
5. NB the video upload may take some minutes (5-10) as the video is indexed
6. Review Video Insights, selecting Transcript on the RHS of the screen and
observing a moving transcript with speakers, topics discussed, named entities,
and keywords as the video plays
8. Exercise – run the REST API again, this time to get finer-grained insights. Check
your solution against the PowerShell script at the GitHub location below:
https://github.com/bw-cetech/apress-6.4.git
NB make sure to close Visual Studio and logout of the Virtual Machine on completion of
the lab to avoid costs incurred on Azure
209
CHAPTER 7
1
McKinsey famously predicted in 1980 that the cellphone market would reach 900,000
subscribers by 2000, less than 1% of the actual figure of 109m. www.equities.com/
news/a-look-at-mckinsey-company-s-biggest-mistakes
211
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_7
Chapter 7 AI Full Stack: Application Development
2
www.itjobswatch.co.uk/jobs/uk/agile.do. AI, Cyber and Cloud and Software Engineering/
Development are also Indeed’s top 5 “In-Demand Tech Skills for Technology Careers.” See www.
indeed.com/career-advice/career-development/in-demand-tech-skills
212
Chapter 7 AI Full Stack: Application Development
In 2022, most AI apps fit into one of three categories: Machine Learning, Computer
Vision, or NLP but due to their technical complexity, machine and deep learning
implementations have historically been operationally siloed with PhD-level statisticians
left to explain code-heavy technical models.
Quick-fix or mock-up dummy/synthetic datasets have been relied on rather than
implementing a Data Fabric approach to connect to (or create) sophisticated prelabeled
datasets. Training and testing has often been limited to a Python notebook (Jupyter,
Colab, etc.) and inference (often without an API) done as an afterthought. The end result
is significant technical debt – certainly not an “app” as we know it.
The evolution is toward an “Enterprise AIaaS” solution, perhaps coupled with
TinyAI: new algorithms to shrink existing deep learning models without losing their
capabilities.3
3
Not mutually exclusive in our opinion as Enterprise AI is hard and takes time, so easier,
incremental developments are still vital for commercial buy-in
4
See recommended reference list
213
Chapter 7 AI Full Stack: Application Development
A
PIs and Endpoints
Application Programming Interfaces (or APIs) provide a standardized way of
communication between two software applications.
An API opens up certain user-defined URL endpoints, which are then used to send
or receive data requests. REST (representational state transfer) APIs are one of the most
popular specialized web service APIs which make use of URIs, HTTP protocol, and
JSON data formats, but there are others, with Google’s gRPC high-performance Remote
Procedure Call (RPC) framework and Facebook’s GraphQL suitable alternatives.5 Both
are now open-sourced.
There are clear benefits in exposing AI/ML models as an API, from UI/UX or providing
a user-friendly analytics/model interface and workflow to endpoint stability, ability to
include data validation6 checks and security handling, separation of Data Science and IT
functions, usability across a wider organization as well as multi-app reusability.
Besides internal productivity, there is additional external value creation from API-
supported AI solutions also exposing Data Science models to a wider customer base. API
endpoints/responses can be tested using cURL and/or Postman – both simplifying the
steps in building an API and troubleshooting/debugging connection issues. Figure 7-2
illustrates this.
5
There are multiple libraries for implementing GraphQL in Python including ariadne, graphene,
and strawberry
6
Data querying via GraphQL restricts the API response to providing only that information that is
required, cf. the analogy of ordering from a menu rather than eating from a buffet. See also e.g.
https://medium.com/@kittypawar.8/alternatives-for-rest-api-b7a6911aa0cc
214
Chapter 7 AI Full Stack: Application Development
Clusters
Clusters are a group of nodes (computers) in a high-performance computing (HPC)
system (Figure 7-3) which process distributed workloads using parallel processing.
The most famous example is Apache Hadoop – a framework that allows for
the distributed processing of large datasets across clusters. See Chapter 3 for more
on Hadoop.
215
Chapter 7 AI Full Stack: Application Development
216
Chapter 7 AI Full Stack: Application Development
S
harding
Sharding is the process of breaking up large tables into smaller chunks called shards
that are spread across multiple servers.7 A shard is essentially a horizontal data partition
that contains a subset of the total dataset, and hence is responsible for serving a portion
of the overall workload.
Sharding in Deep Learning is capable of saving over 60% in memory and training
models twice as large in PyTorch.
Similar to the partitioning process served through Resilient Distributed Datasets (RDDs) in
7
217
Chapter 7 AI Full Stack: Application Development
Before proceeding to our next section and a look at the main software vendors and
their ecosystem of tools for AI application development, we finish this section with a
look at creating isolated (virtual) environments for application development and some
hands-on practice around scripting and working with APIs in Python before wrapping
up with the use of GPUs in Colab.
Virtual Environments
Virtual Environments offer isolated Python package installations for specific apps
allowing apps to coexist independently on the same system. Because these are self-
contained Python (and Python library) installations where an application does not share
dependencies with any other application, they are particularly useful for developing AI
solutions.
A virtual environment in Python is created using virtualenv (or venv from Python
3.3+) which includes a Python binary and essential tools for package management
including pip, for installing Python libraries. The following hands-on lab describes how
to set up a virtual environment from terminal.
218
Chapter 7 AI Full Stack: Application Development
Many AI applications utilize raw Python scripts (.py files) instead of Jupyter notebooks.
Using standalone Python, the goal of this exercise is to get familiar with these scripts,
create a virtual environment and run the script:
2. Add the path of the installed location on your laptop as a (system) environment
variable – this will enable both Python and pip (library installer) commands to
work on terminal
3. Create a “test” folder in a suitable location on your local directory. Clone the
Python script to your local drive from GitHub below:
https://github.com/bw-cetech/apress-7.1.git
4. On File/Windows Explorer go to the cloned local folder and type “cmd” in the
pathname to open terminal. Create a virtual environment using the commands
below (one after the other):
python -m venv env
env\Scripts\Activate
5. Your virtual environment should now be enabled. Run the simple game with
the command
python python-guessing-game.py
219
Chapter 7 AI Full Stack: Application Development
The goal of this exercise is to deploy, then test/perform inference via an endpoint on
a machine learning model for detecting cybersecurity (DDoS – Distributed Denial of
Service) threats:
1. Starting from the experiment below trained on labeled network traffic events,
we deploy a machine learning model from Azure Machine Learning Studio by
first creating a predictive experiment
https://gallery.cortanaintelligence.com/Experiment/Cyber-
DDoS-trained-model
3. Finally consume the Web Service by invoking the API from excel. Test using the
sample data at the GitHub link below. NB the first record is a “teardrop” denial
of service (DoS) attack, the second record is benign network traffic
https://github.com/bw-cetech/apress-7.1b.git
5. Stretch: see if you can beat the model performance (AUC = 0.85) by changing
the data (or the algorithms) used
220
Chapter 7 AI Full Stack: Application Development
The final lab in this section compares runtime for a Big Data pipeline – specifically the
time to download a zip file directly from Kaggle then unzip the file.
1. Obtain an API key from Kaggle (kaggle.json file) or use the same key used in
labs from Chapter 4 and 5
2. Now download the python script “GPU_test.ipynb” at the GitHub link below and
open in Colab
https://github.com/bw-cetech/apress-7.1c.git
3. Drag and drop your kaggle.json file to the Colab default (content) folder
4. Run the Colab notebook to copy the json file to a root > .kaggle folder
5. Connect directly to a BIG dataset on Kaggle – the example given will download
a 350 MB dataset containing 50,000 images. Do this initially with the standard
(no hardware accelerator / CPU) runtime and time how long it takes
6. Unzip the images to a "review" folder – still using the standard runtime, timing
how long it takes
7. After completion (or after interrupting the code if taking too long), repeat steps
5 and 6 above with Runtime type set to GPU. Compare the time taken for both
steps with the standard runtime.
NB Although the lab is focussed on the data import process, the same parallel
process efficiencies extend to machine and deep learning, providing substantial
runtime savings when training and deploying models
221
Chapter 7 AI Full Stack: Application Development
8
Despite the hidden costs, 2021 saw a 33% increase in global cloud spend driven by intense
demand to support remote working and learning, ecommerce, content streaming, online gaming
and collaboration
222
Chapter 7 AI Full Stack: Application Development
223
Chapter 7 AI Full Stack: Application Development
224
Chapter 7 AI Full Stack: Application Development
C
loud Platforms
There is general consensus on the leading cloud platforms, but we will now drill down
further into the main services and resources offered by each, starting with Amazon Web
Services.
AWS
AWS claims to offer the broadest and deepest set of machine learning services and
supporting cloud infrastructure. SageMaker is the main tool for Data Science due to its
scalability in machine learning while for NLP, AWS also offers Amazon Polly (text-to-
speech) and Lex (Chatbots).
Evident from the list of customers using AWS, such as Siemens, FICO, Formula1,
pwc, and Netflix, is a strong set of industry use cases for AI including Document
Processing, Fraud Detection, and Forecasting.
As the world’s biggest cloud service provider, many AI applications rely on AWS’s
breadth of supporting infrastructure, from Simple Storage Service (S3) cloud storage
to Elastic Compute (EC2) instances, Elastic MapReduce (EMR) – managed clusters for
running Apache Spark, Redshift – AWS’s cloud data warehouse, Lambda – serverless
compute for processing events and Kinesis for real-time streaming.
225
Chapter 7 AI Full Stack: Application Development
However despite appearances, AWS’s “free tier” isn’t completely free. AWS products
can be “explored” for free, some products are free and some are free for a limited period.
Almost all have capacity/usage limits.
Azure
The strength of Microsoft’s Azure AI platform lies in API access to key Azure cloud
services and an impressive list of clients from Airbus, NHS, Nestle, and the BBC.
Using Azure SDKs, simple API calls from Jupyter Notebook and Visual Studio Code
enable integration with underlying Python code and sklearn machine learning and
TensorFlow/PyTorch deep learning models.
Azure SDKs also enable access to Azure Machine Learning, with scaling via Azure
Kubernetes Service (AKS), Azure Databricks (with Apache Spark support), and high-
quality vision, speech, language, and decision-making models in Azure Cognitive
Services including Anomaly Detector, Content Moderator, LUIS and QnA Maker
(knowledge-based chatbots/conversational AI), speech to text to speech and Computer
(prebuilt models) and Custom Vision (build your own models).
226
Chapter 7 AI Full Stack: Application Development
GCP
Although it's now open source TensorFlow is Google’s child, is still developed by Google
researchers and is probably still the USP for developing on Google Cloud Platform (GCP)
for AI projects.
When coupled with BigQuery, Google’s serverless Data Warehouse, Vertex AI,
managed machine learning, and Colab for running Jupyter notebooks Google has a
wealth of strong products for Data Scientists and AI Engineers.
Previously under the Google AI Platform umbrella, Vertex AI is the preferred
“managed service” for taking sandbox machine learning models into production and
includes Cloud AutoML high-quality low-code models with state-of-the-art transfer
learning. It also supports Deep Learning containers and sharing of code from the
centralized repos and ML pipelines on Google AI Hub.
The same ML pipelines on AI Hub can be deployed to highly scalable and portable
Kubeflow pipelines based on Docker.
The Google Natural Language API provides app integration with Google NLP models
while Dialogflow is used to integrate chatbots and IVAs into mobile and web apps.
IBM Cloud
While IBM may no longer be in the top tier of Big Tech companies, and IBM Cloud lies
outside the “Big 3” cloud offerings, there is no doubt that the company continues to
lead in terms of innovation, often rolling out new developments and services before
mainstream catches on.
IBM’s Watson platform has pioneered Commercial AI since winning the "Jeopardy!”
quiz show in 2011 and continues to drive IBM’s current suite of AI services provisioned
on IBM Cloud. The main tools are shown below:
More recently, IBM Cloud Pak® for Data (CPDaaS) has become IBM’s go-to business
data and AI platform-in-a-box, “infusing” applications with AI while automating
(AutoAI) and governing data and the AI lifecycle.
227
Chapter 7 AI Full Stack: Application Development
Cloud Pak is a one-stop shop for collecting, organizing, and analyzing data assets
for undertaking machine and deep learning. Composed of integrated microservices
benefitting from running on a multinode Red Hat® OpenShift® cluster, Cloud Pak provides
open and extensible REST APIs, support for hybrid cloud and on-prem resources and
elastic resource management with “minimal” downtime.
H
eroku
One of the first cloud platforms, Heroku is now owned by Salesforce. Lesser known but
one of our favorites in terms of simplicity and elegance, apps can be deployed, managed,
and scaled on cloud much like the other leading platforms.
In our opinion, it is the most cost-effective cloud for quick application deployment.
The Free Plan doesn’t force users/developers to spend credits on enabling tools for
machine/deep learning, provided you limit monthly usage. Azure VMs and AWS
SageMaker, for example, are not free, but you can deploy an ML/DL app on Heroku
for free.
Scaling models is also intuitive – with a simple “pay as you go” option for app
monthly uptime. The hosting models are done through Heroku dynos – building blocks
that underpin/power Heroku apps. Essentially these are containers, but each dyno type
comes with a specific number of cluster workers (free/hobby dyno 1 cluster, standard 2
clusters, medium-performance 5 clusters, high-performance 28 clusters).
Heroku brings to a close the main cloud platforms available to build an AI solution.
The remainder of this section looks at front-ending these solutions with Python-based
user interfaces.9
P
ython-Based UIs
We address three main Python frameworks for creating web apps below. We will take
a look at another, Streamlit, in Chapter 9 which uses a simple API, supports interactive
widgets defined as Python variables, and deploys pretty quickly.10
9
For Big Data Engineering/parallel processing, specifically Apache Spark, that supports AI
applications, see Chapter 5. We will also cover another parallel computing platform, Dask in
Chapter 9. For front-ending solutions, see also use of react.js and VueJS in Chapter 2 – we will look
at an end-to-end application deployment built with a React UI at the end of this chapter.
10
Streamlit is used by a growing list of Fortune 50 companies including Tesla, IBM, and Uber
228
Chapter 7 AI Full Stack: Application Development
F lask
Flask is a micro web framework written in Python.
Simple, self-contained but scalable code, mostly suited for single-page applications,
SQLAlchemy can be used for database connectivity and Flask comes with a wider range
of database support (such as NoSQL) than Django (discussed below).
As we will see in the hands-on lab that follows this section, after installing Python
for standalone scripting together with Visual Studio Code, the recommended workflow
is to clone a flask app from GitHub, cd into the local copy of the app and create a virtual
environment with
Flask is then installed within the virtual environment in the normal way, that is,
D
ash
Dash converts Python scripts to production-grade business apps. Filling a perceived
Predictive Analytics gap in Traditional BI/Tableau/PowerBI/Looker dashboards, Dash
supports complex Python analytics /BI and is written on top of Flask, Plotly.js, and
React.js.
Dash also provides a “point-and-click interface” to Machine Learning and Deep
Learning models, greatly simplifying the process of front-ending AI apps with, for
example, object detection and NLP user interfaces (e.g., chatbots).
The pip install dash command will also install a number of other tools:11
• dash_html_components
• dash_core_components
• dash_table
The attraction of Dash is in the potential to quickly wrap a user interface Python
code; it does this via a file structure of four files with html-like elements:
11
Dash should be upgrades frequently with pip install [package] --upgrade
229
Chapter 7 AI Full Stack: Application Development
Django
Used by Facebook, Instagram, and Netflix, Django is designed to make it easier to build
better Web apps more quickly and with less code. Like Flask, Django is a high-level
Python Web framework for rapid development of web apps, but in contrast these apps
tend to be full-scale and more powerful than simpler Flask limited-page apps.
Django is free and open source and has a “batteries-included” framework with
most of the features preinstalled. It comes with automated tools to avoid repetitive
tasks and a clean and “pragmatic” design which takes care of much of the hassle of Web
development with a simplified three-step process for making model changes:
230
Chapter 7 AI Full Stack: Application Development
C3
C3 is an AI and IoT software provider for building enterprise-scale AI applications.
It offers out-the-box/prebuilt, industry-specific AI applications to optimize critical
processes – C3 claims to run 4.8 million AI models and 1.5 billion predictions per day.
DataRobot
DataRobot’s entire business model (and USP) is MLOps automation and “accelerating
data to value.” The product targets less technical users, for example, Business Analysts
looking to build predictive analytics with no knowledge of Machine Learning.
We now move onto have a look at three of the tools covered above in our hands-on
labs for this section.
231
Chapter 7 AI Full Stack: Application Development
Using sample Dash app boilerplates on GitHub, the goal of this exercise is to clone an
IoT app’s source code and run it locally. This lab includes a number of stretch exercises
to edit source code, work with Dash callbacks and deploy a Dash app to Heroku to gain
further experience with Dash:
https://dash-gallery.plotly.host/dash-manufacture-spc-
dashboard/
2. Go to https://github.com/dkrizman/dash-manufacture-spc-
dashboard, click on green button to download a zip file containing the
source code
4. Copy the code from the app.py file to a notepad and save as app.ipynb
5. Open the new Jupyter notebook, change Debug = True in the last line to Debug
= False and try to run it. NB you may need to install dash_daq first be running
(then commenting out) %pip install dash_daq in the first code cell.
232
Chapter 7 AI Full Stack: Application Development
In this lab, starting from a boilerplate on GitHub, we create a virtual environment, install
some Python dependencies and create a Flask dashboard
1. Clone the source code at the link below to your local drive
https://github.com/app-generator/flask-black-dashboard.git
6. Stretch: change the font and font size on LHS menu and replace the “Daily
Sales” chart with a different dataset
Following on from the previous labs on Dash and Flask, the goal of this exercise is to
familiarize ourselves with another Python front-end (web) framework: Django
233
Chapter 7 AI Full Stack: Application Development
• Admin oversight
2. Install Django into the virtual environment (NB first check if you have this
installed already with py -m django –version)
3. Start the app by cd’ing into the mysite folder then running
python manage.py runserver
4. Create a polls app by following the steps at the tutorial link below, and for the
remainder of this lab
https://docs.djangoproject.com/en/3.2/intro/tutorial01/
Exercise: try to complete “Writing your first Django app part 2” by following the instructions on
the next page of the tutorial. Make sure to go through the steps below to understand how the
application interfaces to the underlying datastore for the dashboard and how to manage the
application development process
ML Apps
We now turn our attention to the specific Machine and Deep Learning applications
developed and deployed in organizations and businesses today.
As we have already covered the high-level overview of AI applications in our first
chapter, we focus this section on how these applications are being built in organizations
and the tools used. The specific “industry perspective” will be covered in the next
chapter on AI Case Studies.
234
Chapter 7 AI Full Stack: Application Development
By no means exhaustive, the table below describes what is driving businesses, what
applications they are using, and the data enablers and sources that tend to be pulled
on to get the job done. What is clear is that framing an organizational problem and
documenting the delivery plan is key to successfully implementing Machine Learning –
best practice is to embrace continual improvement over many iterative processes: from
problem framing, through data collection and cleansing, EDA and data prep, feature
engineering, model training, evaluation and benchmarking, inference and data drift.
235
Chapter 7 AI Full Stack: Application Development
Customer Experience
Most businesses today are striving to achieve greater brand authority by tailoring
and improving Customer Experience (CX) via predictive modeling and using
customer feedback directly in the model training process
The classic application of machine learning to customer experience is a
Recommendation Engine providing personalized offers to customers based on
236
Chapter 7 AI Full Stack: Application Development
Transactional and demographic data are long-established value levers, but in 2022,
more sophisticated mining of customer journey and behavioral data, often unstructured,
is what differentiates high-value retail organizations.
237
Chapter 7 AI Full Stack: Application Development
Content-based filtering uses only the existing interests of the user as opposed to
12
a collaborative filtering model which extends modelling to the entire user base and
seeks similarities between users and items (movies here) simultaneously to provide
recommendations. See also https://developers.google.com/machine-learning/
recommendation/collaborative/basics
238
Chapter 7 AI Full Stack: Application Development
5. Add a JSON file to the local app folder with a movie title/TV series of interest
{
"title" : "Narcos"
}
6. Test the app with cURL (installed by default with Windows systems):
curl -H "Content-Type: application/json" --data @test.json
http://127.0.0.1:5000/api/
You should now see a list of recommended movies/TV series in a test.json file in your
app folder based on the movie/TV series entered in step 5
8. Stretch: Postman is another utility to explore and test APIs – set up a Postman
account and send a request to the app above
Although not strictly involving machine learning, prescriptive analytics problems requiring
the use of complex optimisation techniques and solvers are often run in parallel (or as a post
process) to ML/DL processes to maximise profit or minimise risk.
towardsai.net/recommendation-system-in-depth-tutorial-with-python-for-netflix-
using-collaborative-filtering-533ff8a0e444
239
Chapter 7 AI Full Stack: Application Development
This lab takes a look at implementing Convex Programming with Python – in this case
simplifying via a Linear Prigramming (LP) special case where the model constraints and
objective function are linear.
https://github.com/bw-cetech/apress-7.3.git
b. Import the stock data from the monthly csv data file downloaded from GitHub
d. Compute expected risk and returns for each of the three stocks
e. Optimise a $1000 across each of the three stocks to achieve a balanced portfolio
3. Exercise (stretch) – adapt the code to connect to live stock prices with Pandas
DataReader and Yahoo Finance Python libraries, optimising / balancing £100k
capital spend across three tech stocks
DL Apps
While certain machine learning applications are firmly established in many
organizations, implementing Deep Learning applications has been far more an
aspirational goal. That is starting to change, partly through the use of leading accelerator
AutoAI tools we discussed in Chapter 6 and partly via experimentation and prototyping
around the main use cases we discuss in this final section.
240
Chapter 7 AI Full Stack: Application Development
Some of the drivers related to machine learning apply here too, but there is increased
sophistication as well, such as in understanding customer journeys rather than just
buying patterns. Some of these drivers are shown in the table in Figure 7-13, together
with key business applications and benefits.
Forward planning on cloud storage and compute are vital in delivering a successful
Deep Learning solution, and deep learning projects require a lot of iteration, a lot of
time, and a lot of effort. But being disciplined, leveraging your resources to the max and
monitoring progress along the way can help bring about success.
Like ML it starts by understanding the problem context and project lifecycle and a
kaizen (continual improvement) approach is key. Before proceeding to our final section
in this chapter, we outline below a DL-specific high-level framework for success (Source:
neptune.ai):
241
Chapter 7 AI Full Stack: Application Development
Computer Vision
No Computer Vision project today is developed without recourse to TensorFlow or
OpenCV (open source computer vision tool). While TensorFlow has already been
extensively discussed, OpenCV was originally developed by Intel and offers support for
multiple programming languages and operating systems. OpenCV-Python is the Python
library interface to the tool.
Most of the CSPs have APIs for Computer Vision, with IBM Watson Visual
Recognition, API Google Cloud Vision, and Microsoft Computer Vision perhaps three of
the best. Critical to these offerings is the huge storage requirement for Computer Vision
coupled with low latency compute during critical training and inference processes, with
ease of deployment (as containerized solutions) also increasingly important.
Forecasting
As described in Chapter 5, LSTMs have become highly accurate in forecast short- and
mid-term horizons. These algorithms, combined (or compared) with fbprophet for
longer-term forecasting with multiplicative seasonality where seasonality is added to the
trend, provide a powerful multihorizon forecasting “arsenal.”
The enhanced forecasting capability of neural networks means AI augmentation of
traditional forecasting approaches is increasingly common – often LSTMs are compared
with regressional techniques and exponential smoothing, ARMA/ARIMA/SARIMA/
SARIMAX techniques, Monte Carlo, and VaR approaches.
IoT
More than 9 billion IoT devices currently exist online with some 50 billion to nearly 1
trillion devices expected online in the next decade.
This incredibly rich source of data is fueling the current trend toward the artificial
Internet of things (AIoT) – the intersection of AI and IoT.
242
Chapter 7 AI Full Stack: Application Development
The goal of this exercise is to deploy a deep learning application using a “hand-shake”
between react.js and Flask:
1. Using a simple Early Stopping Criteria, train and export the VGG model from
Chapter 5
2. Put the exported model above into a new local folder for the app called
“dl-traffic-app”
3. Create a react.js front-end boilerplate by cd’ing into this folder from terminal
and running the commands below (this will create a folder called react-frontend
for the front-end source code we need):
243
Chapter 7 AI Full Stack: Application Development
numpy
flask #==1.1.2
dash
dash_bootstrap_components
matplotlib
tensorflow
opencv-python
7. To have the Flask backend serve the react.js front-end edit the blank main.py
file with the contents of the file sample_flask.py here:
https://github.com/bw-cetech/apress-7.4.git
9. Fire up the Flask backend by cd’ing into the folder created in step 3
and running
python main.py
10. Taking the “additional-files” folder at the GitHub repo above https://
github.com/bw-cetech/apress-7.4.git, replace those in the react-
frontend and flask-backend folders. Add drag and drop capability to front end
and model function call from flask. Additionally:
a. Create a "python" folder under flask-backend and add the new file dlmodel.py
(renamed from sample_dlmodel.py)
b. Also update the path to the trained model in the Python script (dlmodel.py) to:
model = load_model("python/best-model-traffic-ESC.h5") # load model
244
Chapter 7 AI Full Stack: Application Development
12. Test the app with some sample images from our GitHub repo
https://github.com/bw-cetech/apress-7.4.git
13. Stretch: complete the steps in this lab using a VueJS front-end instead of
react.js14
14. Stretch (HARD): separate out React and Flask in a Docker container
15. Stretch (HARD): Add a database to serve the app with training images
Wrap-up
This extensive lab bringing together back-end development (Flask and TensorFlow) with
a front-end UI (Dash and React) hopefully gives a flavor of what to expect in building out
“full-stack” AI application. We will walk-through several more of these labs in the next
chapter when we cover solutions for specific industry case studies before a final look at
end-to-end AI deployment in our penultimate chapter.
14
for a simple VueJS front-end boilerplate, see hands-on lab in Chapter 2
245
CHAPTER 8
AI Case Studies
AI technologies are used today to improve customer service and offer personalized
promotions to avert cyberattacks, to detect and prevent fraud, to automate management
reports, to perform visual recognition, to mitigate insurance risks, and to sort and
categorize documents and images.
Many of the technologies already have a major impact on everyday life, particularly
the day-to-day “apps” and bespoke content services we engage with on our handheld
devices.
This chapter takes a comprehensive (multisector, multifunctional) look at the main
AI use uses in the last few years, in order to highlight what drives the needs and business
requirements for AI in the workplace.
We will take a look at the key enablers for AI before embarking on a granular deep
dive into specific vertical industry challenges in Telco, Retail, Banking and Financial
Services, Oil and Gas, Energy and Utilities, Supply Chain, HR, and Healthcare. Ultimately
the goal is to bring together the tools perspective in the last chapter with the business or
organizational problem in this chapter in order to gain a deeper understanding of the
Machine and Deep Learning applications implemented to address them.
The chapter includes some more advanced use cases involving multitool integration
such as Social Media (Twitter API) Sentiment Analysis, Fraud Detection, and Supply
Chain Optimization. The aim of these cases is to map the most suitable AI technologies
and platforms for handling the underlying data and AI components: specifically data
ingestion, storage, compute and modeling, and analytics.
As part of these enhanced use cases, we will look at how to set up a Twitter
Developer account and implement the Twitter API to undertake Social Media Sentiment
Analysis, how to leverage key cloud solution components such as AWS SageMaker,
Lambda, S3 and QuickSight to train and deploy a Fraud Detection model and IBM Cloud
Pak for Data, Watson Studio, Watson Assistant, AutoAI, and Decision Optimizer with
CPLEX for Supply Chain Optimization.
247
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_8
Chapter 8 AI Case Studies
AI Enablers
The main enablers of commercial AI solutions and use cases are today four-fold.
From the emergence of cloud-based services, simplification and increased access to
storage with limitless scaling and computing power (compute) enablement of complex
calculation support is key.
The large-scale use of sensors means IoT (and mobile) devices now generate
feature-rich, peta-scale data as a matter of course. This data is often integrated with
smart APIs, meaning machine and deep learning is not always needed “from the ground
up” – users can instead API into a pretrained endpoint solution.
And of course Digital Transformation itself and the driving force of “disrupt or be
disrupted” enables Artificial Intelligence as a value-added profit center focused on
the constant need to manage costs, improve productivity, and open up new revenue
streams.
248
Chapter 8 AI Case Studies
249
Chapter 8 AI Case Studies
From defining the strategy and objectives through data strategy, security model,
testing, implementation, and change management processing, it's critical to ask the
right questions. Figure 8-2 shows some of the key ones we think are necessary in order to
weigh the landscape and work toward a successful delivery.
Solution Architectures
Due diligence on the existing IT landscape is a necessary prerequisite to understand
gaps in data, processes, technologies, and infrastructure. These gaps inform the
implementation and solution approach – at some stage on this learning journey a “To
Be” Solution Architecture will be required to articulate how the AI solution will interface
to the supporting (hybrid) infrastructure, both on-prem and cloud-based.
We have already taken a look at some example architectures in our chapter on Data
Ingestion (see Building a delivery pipeline) so we will wrap up this section with a quick
look at one more – this time an Azure Cloud BI Architectures.
Example: Azure
Often AI projects are grown out of simpler descriptive analytics projects.
250
Chapter 8 AI Case Studies
The simple Azure architecture in Figure 8-3 shows (in training phase I) an extract,
load, and transform (ELT) pipeline executed by Azure Data Factory (ADF). ADF
automates the workflow to move data from the on-prem SQL database to a cloud-based
one – in this case Azure Synapse (SQL Data Warehouse).
The data in Azure Synapse is then transformed for analysis using Azure Analysis
Service, effectively creating a semantic model for Power BI to consume and for an end
user to generate insights.
Telco Solutions
We now turn our attention to the leading AI use cases by specific industry, starting with
the Telco sector.
251
Chapter 8 AI Case Studies
Specific Challenges
Implementing AI solutions comes with specific challenges in the Telco sector, stretching
across people, processes, and tools. As shown in Figure 8-4 from Boston Consulting,
commissioned AI projects and applications need to demonstrate proof of value, data
governance, agile digital delivery, in-house AI capabilities, and managed business
transformation. We address some of these challenges as we pick out the main Telco AI
and Analytics solutions in the sections below.
Solution Categories
The Telco sector is unique in its reliance on famously “feature-rich” datasets –
information on customers often turns in hundreds of variables regarding demographics
(age, gender, etc.), transactional (e.g., total spend per month), and attitudinal parameters
(such as visited upgrade plans).
But AI solutions are not always focused on the end-customer. Broadly we see three
main categories of AI solution in the Telco sector. These are grouped as shown in
Figure 8-5 into Data and Governance and Oversight, Dashboards-driven Insights, and
Predictive Analytics.
252
Chapter 8 AI Case Studies
R
eal-time Dashboards
Churn is one of the main metrics by which a telco company is measured with industry
analysis1 showing customer churn drivers for Mobile services varying by:
• Price 45%
Ideally, businesses want a single place to track progress across multiple client
metrics and internal KPIs and prevent churn. Making data-driven decisions requires a
360° perspective on key customer metrics including customer complaints tracking and
key SLA metrics.
Real-time, compelling dashboards can support the supervisory process with key
metrics, monitoring, and automated reporting on customer complaints:
• Examine customer profiles
• Monitor indicators
1
See e.g. www.analysysmason.com/
253
Chapter 8 AI Case Studies
Figure 8-6. Example Telco dashboard view – churners vs. non-churners (Source:
Microsoft PowerBI / Starschema)
254
Chapter 8 AI Case Studies
Sentiment Analysis
Sentiment Analysis using Natural Language Processing has become a valuable tool for
companies in the Telco industry to better understand the voice and opinions of the
customer. But typically customer “sentiment” is just one of the key tracking criteria
(KTCs) that a company is interested in gleaning from its customer engagement channels.
Other KTCs include customer themes or concepts, emotions, entity, and keyword
analysis
We will look below at how to implement a Sentiment Analysis solution for Telco
using the Python Twitter API below, but in terms of a project approach these KTCs
are a good starting point in which to create a solution design for displaying social
media metrics relevant to the business and upon which marketing teams can segment
customers and act.
Typically the best-of-breed sentiment analysis solutions are full-stack, combining
analysis of feature-rich Telco datasets and under-the-hood KTC analysis of the latest
Twitter feeds (or Facebook Insights) with a front-end dashboard displaying KPIs in
real time.
255
Chapter 8 AI Case Studies
Predictive Analytics
Besides dashboards and sentiment analysis, more generic “Predictive Analytics”
solutions leveraging Machine Learning are sought after by Telcos to achieve productivity
improvements.
The use cases are varied, from optimal planning and scheduling (e.g., of broadband
repairs) through push-button forecasting of customer demand (by segment) and
competitor pricing, support for what-if scenario analysis (e.g., modeling rollout of a new
service), automation of customer service delivery including installations and service
upgrades to document search capability/probabilistic document retrieval (e.g., based on
contract terms).
A natural fit for these challenges is an “in-house” Machine Learning as-a-service
capability that can tackle these myriad issues, transforming into solvable machine or
deep learning problems, often unsupervised at first before moving to a scheduled (e.g.,
monthly) supervised learning approach (until a sufficient volume of data is captured
and auto-updated through a pipeline). The key here for delivering production-grade
solutions in Telco, as in all sectors, is performance benchmarking against legacy
processes with either
• A cost reduction or
256
Chapter 8 AI Case Studies
Figure 8-8. Use of PCA – a valuable tool for data reduction on "feature-rich" Telco
datasets
–– https://twitter.com/i/flow/signup
• Set up a Twitter Developer Account
–– https://developer.twitter.com/en/portal/petition/use-case
257
Chapter 8 AI Case Studies
–– There is a 5-10 minute application form to fill out – make sure to clarify use
cases are for upskilling on NLP and to understand social media sentiment
Opening a Python notebook (e.g., with Jupyter or Google Colab) we can now import
the twitter library with the command below and start to query live (and historic) tweets.
Accessing the Python Twitter API, the goal of this lab is two-fold:
b) Perform basic sentiment analysis on the latest tweets with tweepy and nltk
1. Make sure you have followed the set of the Twitter API as described under
“Connecting to the Twitter API from Python” above
https://github.com/bw-cetech/apress-8.2.git
Get tweets – Import Libraries and credentials, define twitter search criteria, and
a)
fetch tweets
Process the tweets – Collect tweets in Python list and filter out retweets
b)
Analyze tweets – Who is tweeting? Isolate locational metrics and convert list to a
c)
pandas dataframe
258
Chapter 8 AI Case Studies
2. Make sure to complete the exercises in the notebook for creating analytics on
the processed tweets
Retail Solutions
Many of the same challenges exist in the retail sector as in the Telco industry, but there
are also certain unique challenges, related to products and customer experiences that
distinguish digital disruption from more service-driven industries such as Telco. This
section takes a look at how AI solutions are helping retailers overcome these challenges.
259
Chapter 8 AI Case Studies
2
For example, Wiser’s survey of consumer shopping preferences suggests that 29% of respondents
are much more likely and 33% of them are more likely to shop in-store than online if a unique
experience is offered.
260
Chapter 8 AI Case Studies
to prevent profit erosion or negative marketing of the brand/store. More than that, an
active approach to predict and prevent customer churn is a necessity to retain existing
customers and drive dependable revenue streams.
261
Chapter 8 AI Case Studies
• Ranked customers
• Revenue at risk
Given the end customer of a churn model is often a marketing team looking to
support campaign ideas with data-driven analytics and model outcomes, marketing (and
media) “hook-ups” can help automate engagement with the most at-risk customers.
Using Google Cloud Platform BigQuery ML, the goal of this lab is to train a machine
learning model to forecast average duration for bike rental trips
https://console.cloud.google.com
2. Login to GCP sandbox below (free even after three-month expiry of GCP Free
Tier) and create a project
https://console.cloud.google.com/bigquery?ref=https:%2F%2Fa
ccounts.google.com%2F&project=gcp-bigquery-26apr21&ws=!1m0
262
Chapter 8 AI Case Studies
4. Exercise: perform a couple of quick SQL “EDA” queries to get (a) the number of
records and the (b) the busiest bike station
5. Compose another SQL query to build an ML model using the features below:
6. Repeat the above for a second model but add “bike share subscriber type” and
remove the two features:
Using Python in Jupyter Notebook, the goal of this exercise is to predict the likelihood of an
active customer leaving an organization and identify key indicators of churn
1. Clone the GitHub repo below containing a churn model Python notebook and
“customer_churn_data” dataset
https://github.com/bw-cetech/apress-8.3.git
2. Run through the notebook in Google Colab, importing the data then carrying out:
a. Pre-processing
b. Visual EDA
263
Chapter 8 AI Case Studies
c. Feature Engineering
d. Modelling
g. Exercise (Stretch)
i. rather than using a dataset with a predefined target (churn) variable, swap it out
with a retail dataset containing customer order information (such as the dataset
“AW-SOH.xlsx” also uploaded to the GitHub repo above) and define “churn” as a
custom variable based on a target number of months (N) elapsed since a customer
made a purchase.
ii. Aggregate order data by customer, engineering any new transactional features
which could be indicative of churn (e.g. average purchase amount, length of time as
customer etc.).
iii. Finally, split the data into a training and test set, train a predictive model in the
same way, then predict whether new customers are likely to churn or not in the
next N months
Our final lab in this section takes a look at how to implement social network analysis on
tweets. Here we take a look at tweets on four UK supermarkets and perform network analysis
on the location of those tweets focussed around major cities
https://github.com/bw-cetech/apress-8.3b.git
2. Run through the notebook in with Jupyter Notebook or Google Colab, following
the steps below:
264
Chapter 8 AI Case Studies
3. Finally create the network analysis showing tweets by supermarket and (major
city) location
I ndustry Challenges
Across multiple industries including Banking and Financial Services, AI is rapidly
reshaping risk management practices and transforming client and internal services. Key
areas of disruption are shown below:
• Governance, Risk Management and Compliance (GRC)
• Personalized banking
• Trading
• Process automation
• Customer complaints
3
The closely coupled Insurance sector is covered separately at the end of this chapter
265
Chapter 8 AI Case Studies
But it's not just in risk control and cost prevention where AI-as-a-service can add
value in Banking and FinTech. AI innovation is also helping to:
• Decrease customer support costs through the use of chatbot and IVA
technologies and automated systems including document search,
retrieval, and archiving
• Build trust and loyalty through robust fraud detection measures and
periodic personalized offers (loans, cashback incentives, etc.)
Figure 8-10. Deloitte – key focus B&FS use cases from C-Level Executives
266
Chapter 8 AI Case Studies
While there are clearly many areas where Predictive Analytics is adding value, our
main focus in this section is on the “flagship” AI use case for Banking and FinTech –
Fraud Detection.
Fraud Detection
Business and organizational fraud risk goes beyond a simple identification of large
transactions – a robust solution must address key operational risks too. With fraud an
ongoing problem costing businesses billions of dollars annually, applying learning
capability, as opposed to simpler, rule-based methods, is perceived as an existential
need in the highly disruptive B&FS sector.
The main Fraud Detection applications of Machine and Deep Learning today
are able to achieve better sophistication in model outcomes and by virtue of cloud-
connectivity have embedded auditability to prevent fraud out-of-the-box. These
applications can detect a wider range of fraud incidents by combining machine learning
with an advanced rule engine, identifying along the way:
• Duplicate payments
• Duplicate invoices
• One-time payments
• Entities of interest
Besides the above, Fraud Detection bound to a company’s internal Data and AI
Strategy and fully integrated with internal and external data sources – claims systems,
watch lists, third-party systems is a matter of course, as are model features including
dynamic approaches to authentication flows in an age of multifactor authentication
(2FA, 3FA), addressing specific fraud challenges associated with mobile channels, plus
encoding of unstructured text in the feature engineering process.
267
Chapter 8 AI Case Studies
In the lab that follows we will make use of an AWS CloudFormation template to
provision the multiple resource requirements. Note these resources do incur costs:
• $1.50 – one-time cost for Amazon SageMaker ml.c4.large instance
4
In the hands-on lab that follows the first S3 bucket contains example (PCA’d) credit card
transactions, while the second bucket is the post-modelling data for downstream analytics in
QuickSight (AWS’s BI platform)
268
Chapter 8 AI Case Studies
Our goal in this lab is to build an end-to-end implementation of the AWS Fraud Detection
solution, from resource setup via a Cloud Formation template to creation of dashboard
analytics via QuickSight.
The operational process for this lab, from Data Import to QuickSight Analytics is extensively
described in the section above and summarized in the image process map below:
269
Chapter 8 AI Case Studies
https://console.aws.amazon.com/cloudformation/home?
region=us-east-1#/stacks/new?&templateURL=https:%2F%2Fs3.
amazonaws.com%2Fsolutions-reference%2Ffraud-detection-
using-machine-learning%2Flatest%2Ffraud-detection-using-
machine-learning.template
a. Data Import
b. Data Exploration
c. Verify the Lambda Function Is Processing Transactions then proceed through the
modeling steps (SageMaker through Lambda to Kinesis):
i. Anomaly detection
b. CloudWatch logs
270
Chapter 8 AI Case Studies
Registrations Data
Model deployment
Process Import & Anomaly Detection Fraud Event Detection Model Evaluation
(API)
Data Exploration
Connect to S3 /
zip / csv Model Training Upload data to S3 Batch predict using trained model Invoke Lambda
Sub-processes
Import to Pandas Cluster / anomaly Evaluate Create HTTP Log outout to
dataframe testing Train model using XGBoost performance requests Kinesis
. Download and reading in . We use an unsupervised algo . Run as post process after . balanced accuracy score and . Deploy both RCF and GBT
models to PRODUCTION
.
Cohen’s Kappa used to
.
the credit card fraud data first with a 10% holdout set as Anomaly Detection
set labelled fraud data is difficult Idea is that we can build up a measure performance of Create background HTTP
. to come by sufficient amount of labelled
.
unbalanced classes requests thread to an AWS
CSAT Measures .
Explore data
Verify imbalance in data . have the model frind
. data EVENTUALLY Cohen’s Kappa increases as
.
Lambda function
threshold for fraud/no fraud
(upsampling minority class anomalies based solely on
.
Gradient Boosted Trees used as:
increases from 0.1 to 0.9
Invokes both unsupervised and
crucial due to relatively low
.
data features
.
Proven track record
. Turn threshold down if we DON”T .
supervised
SageMaker endpoints via AWS
.
% of fraud cases) Highly accurate and Highly scalable
. Separate features / target scalable state-of-the art Can deal with missing want to miss fraud cases OR turn
.
Lambda function
.
the threshold up if we want to
.
anomaly detection algonithm data Output logged to Kinesis
.
Reduce the need to reduce no. of FALSE POSITIVES Observed output also viewable
pre-process data Synthetic Minority Over-sampling via CloudWatch logs.
(SMOTE) compared againts
base case
271
Chapter 8 AI Case Studies
Organizations have often struggled with knowledge gaps around use of high-
performance, sophisticated planning, scheduling, and optimization software but AI
Engineers and Data Scientists today can employ advances in cloud computing, “visual”
machine learning5 and solver capability coupled with compelling front-end design
to engage a wider audience within the enterprise. Moreover, today’s AI solutions for
SCM can help operational staff understand and better manage the company’s delivery
logistics through finding answers to questions such as
• How to determine the location and capacity of warehouses
5
Drag and drop style Data Science or Machine Learning, or the use of No/Low code UIs
272
Chapter 8 AI Case Studies
273
Chapter 8 AI Case Studies
In this lab, we will deploy an end-to-end boilerplate Supply Chain Optimization project in IBM
CloudPak for Data:
a.
https://dataplatform.cloud.ibm.com/exchange/public/entry/vie
w/427846c7e99026edd5fa0022830bc002?context=cpdaass
274
Chapter 8 AI Case Studies
2. Create a project
i. Suppliers
ii. Manufacturers
iii. Warehouses
ii. Optimize cost to ship parts (from four suppliers to three manufacturers) and deliver
products (from three manufacturers to four warehouses)
7. Stretch Exercise – schedule a once per month retrain and model run with
new data
276
Chapter 8 AI Case Studies
Smaller-scale, or “Tiny AI” solutions are faster to deliver while single-cloud “Big
Tech” solutions have become in some sense “too big” and too costly. The productivity
benefits of AI have become too difficult to ignore and might be expected to give way to
a vicious culture of disruption especially if performance benchmarking against legacy
processes suggests a compelling case.
Risk/asset management tools such as customer risk engines with embedded
machine learning to “push-button” customer/load/price forecasting with fbprophet or
recurrent neural networks are starting to become commonplace, if not as production-
grade solutions, then as protype python/Dash scripts. Back office CRPA and document
retrieval solutions are in evidence, for example, for gas/electricity policy search,
settlements, and billing as well as contract management. And trade support solutions
including algorithmic trading and automated market analysis and intelligence with
support for “what-if” scenario analysis are being used to boost front-office profitability.
Many of the same Predictive AI tools mentioned above in the retail sector are also
proving valuable in the energy sector such as “live” customer decision engines with
social media sentiment analysis, chatbots and IVAs, automated customer service
delivery and dashboards and reporting solutions reflecting wholesale market dynamics
and energy news. Similarly, the same supply chain optimization solutions with robust
solver capability mentioned in the previous section are often applicable to delivery
planning, optimized scheduling, and routing from upstream energy sources through
transmission and distribution channels to suppliers and down to customers.
Edge AI is another exciting application area which can in time capitalize on
increasingly, decentralized, distributed asset generation and consumption (e.g., wind
turbines, solar panels, heat pumps, and small-scale battery storage).
277
Chapter 8 AI Case Studies
278
Chapter 8 AI Case Studies
H
R Solutions
For our final focused AI use case, our attention now turns to a key business function
served by AI in 2022 – Human resources.
6
For example, Chatbot corpus creation via consultant annotation of medical publications
and reports
7
Frequent patient trials coupled with recurring model retraining can also lead to continual
improvement in IVA-supported patient outcomes
8
In certain cases delivered as services by AI startups disrupting the healthcare
sector, such as Atomwise and Deepcell. See e.g. https://venturebeat.com/
ai/6-ai-companies-disrupting-healthcare-in-2022/
279
Chapter 8 AI Case Studies
H
R in 2002
The HR Analytics market is forecasted to grow into a $6.3b market @ 14.2% CAGR as
employers grapple with an uncertain postpandemic future and awakening employee
sentiment and attitudes to work. But only 21% of HR leaders believe their organization is
effective at using data to inform HR decisions.9
In an era of shifting demographics and uncertainty wrought by digital disruption,
Covid and now spiraling living costs HR executives are looking for accessible data and
HR analytics solutions for better talent management and support, reducing bias in
recruitment and in the workplace and to help improve employee performance and
attrition rates.
If Enterprise AI is to be successful, a company could do worse than starting with
challenges in Human Resources and where Predictive Analytics can help. As shown in
Figure 8-17, we see these current challenges (and associated HR analytics solutions)
falling into one of three areas mirroring the employee journey: Recruitment, Talent
management/Human Capital Management (HCM), and Employee Experience.
9
Gartner
280
Chapter 8 AI Case Studies
As we shall see below, managing the data deluge and ensuring performance metrics
are nonintrusive is the main challenge to creating a 360-degree employee analytics
solution today. AI solutions for Human Resources can help both employees manage a
balanced workload and by informing HR functions who is proactive, who wants to do
more, who is close to burnout, and who needs more help or support. If used in the right
way, these metrics can also lead to employee task-based efficiency and improved focus,
through defining achievable goals and helping employees achieve them.
Sample HR Solutions
Dashboards are everyone’s favorite analytics tool, and especially for HR Execs, vital to
cover various aspects of the employee journey, including hiring KPIs, employee on-the-
job metrics including workforce productivity and remote working,10 attendance11 and
disciplinary oversight as well as employee churn and company feedback.12
One of the best HR Dashboards on the market is from Agile Analytics – with 20+
descriptive and predictive PowerBI built-in reporting views.
As shown in Figure 8-18 there are multiple machine learning applications including
OCR, NLP, and text analytics applied to candidate information (LinkedIn, CVs). Deep
Learning can help with job description (JD) scanning (or NLG-support JD creation) and
drawing up employee contracts while NLP-supported bias stripping can also help ensure
equitable recruitment policies.
10
Integrated with, for example, Teams and able to display, for example, sign in/off times, email,
calls and meetings KPIs, application usage periods, employee location/dual-location, multi/
cross-platform productivity
11
Including holiday/sickness calendar integration
12
For example, Glass Door
281
Chapter 8 AI Case Studies
No/Lo code UIs are growing in popularity among HR managers, giving nontechnical,
intuitive access to predictive analytics and employee modeling. Chatbots and IVAs
can help with real-time “Employee Experience,” feedback on organizational changes
and company culture, take employee mood and sentiment “temperature-checks”
and identify potential employee attrition. Todays’ IVAs can also come with search
automation13 and are integrated with back-end systems for, for example, internal job
vacancies or company policy retrieval.
For onboarding, internal content personalization for employee professional
development can be supported with recommendation engines based on demographics,
skillsets, and psychometric scores. Anomaly detection can help identify amber or red
flags in employee attitudes and behavior.
13
For example, Watson Search as a value-added service on top of Watson Assistant
282
Chapter 8 AI Case Studies
Using Python in Jupyter Notebook, the goal of this exercise is to run AutoAI to perform feature
engineering and algorithmic benchmarking on an employee attrition dataset:
1. Login to IBM Cloud Pak at the link below and create a project
https://eu-gb.dataplatform.cloud.ibm.com/
home2?context=cpdaas
7. Exercise – test whether the predictions make sense, for example, by entering
data for an employee on a relatively low monthly income, with limited years
of service
Figure 8-19. Performing AutoAI on Employee Attrition data with IBM Cloud Pak
283
Chapter 8 AI Case Studies
14
researchandmarkets.com
15
marketsandmarkets.com
284
Chapter 8 AI Case Studies
Manufacturing
Much of the use of AI across Supply Chains described earlier also covers manufacturing
use cases with Manufacturing 4.0 intended to bring together IoT, cloud, and analytics to
revolutionize the production and distribution process.
Safety stock levels, delivery schedules, supply chain logistics, and underlying
expenses are all optimized parameters under an aspiring, if not functioning Smart
Manufacturing Solution. Warehouse use of blockchains is also increasing, with the
intention of improving auditability and making supply chains more efficient and reliable.
AI simulation in theory should help reduce product recalls by reducing
manufacturing processes rigidity, dynamically correcting product flaws and encouraging
product improvisation. Similarly knock-on impacts of equipment/machinery failures
on delivery schedules, budgets, and reputation are better managed today using AI
applications. Predictive Maintenance in particular, coupled with big (often IoT) data
statistical analysis, leads to better scheduled maintenance (without impacting delivery)
and reducing future failures.
Given this digital reliance and big data dependency, no Manufacturing 4.0 solution
today is complete without addressing cyber security concerns – the subject of our next
subsection.
Cybersecurity
There is no greater challenge to business and organizations in 2022 than Cybersecurity.
The Internet has well and truly penetrated into every industrial system today, but
at a local (production) level and across global supply chains. With it has come a huge
increase in risk of cyberattack from hackers trying to gain access to information or worse,
take control of sometimes highly strategic facilities tied to national security. Besides
critical Defense in Depth layers of security and provision of robust firewalls, machine,
285
Chapter 8 AI Case Studies
and deep learning is being leveraged to protect against these highly disruptive, real-time,
dynamic cybersecurity attacks.
We have already taken a look in Chapter 7 at a cybersecurity use case in our hands-
on lab looking at deploying an API endpoint for a trained machine learning model
trained to detect cybersecurity DDoS attacks. Readers are referred back to that lab and
in particular the model training process (step 4) which shows how various network
parameters are used as features to identify patterns in network activity that might
constitute, for example, raindrop malware from a network intrusion event.16
I nsurance/Telematics
Insurers' business models are changing with AI rapidly reshaping the sector and
transforming multiple client and internal services:
• Risk assessment and risk management to risk-based pricing
• Personalized services
L egal
The legal sector is experiencing something of a transformation from the use of AI
technologies, particularly in relation to research activity and contracting.17
Raindrop Malware was discovered in the Dec-20 SolarWinds supply chain attack
16
286
Chapter 8 AI Case Studies
An exhaustive analysis is beyond the scope of this document, however we lay out
in Figure 8-21 one key integrated application which is being used to help digitalize
legal documents and support the search and retrieval and archive process – Document
Classification and Text Extraction with OCR (optical character recognition).
Module Import Documents Scan Documents Prepare Image Data Deep Learning Text Extraction
Inceptio
Key ImageDataGenerator VGG16 nResNET cv2 PIL
V2
Technologies & GCP /
Azure /
Google
OpenCV
pyimage
Integration AWS
Colab search
Points Keras DenseNet pytesseract
. . Create a document scanner . Establish Document Type Classes . Convolutional Neural Network . Deep OCR-type classification
Identify archived documents
in ECM system using the OpenCV library of . Partition data into approach used to learn image to isolate subtext concepts and
.
. .
Establish Data Pipeline EDMS Computer Vision Training/Validation/Test images data following cross-validation textual themes
solution - connect to
. .
programming functions ImageDataGenerator class of on multiple CNN architectures After resizing images, text
document source
.
Detect image edges and Keras used to pre-process and Cyclic Learning Rate (CLR) extracted using PIL and
Main Features/
. .
Import zipped images from contour load training data “on-the-fly” technique employed to address pytesseract libraries.
Purpose of
. .
source (On Prem or Cloud) Obtain a top-down view of rather than in-memory non-convex loss functions Text fed to a user defined
Module document by applying a function to identify document
Unzip tar files in preparation
for scanning perspective transform . overloading of all data
Converts image files into a
Achieve high accuracy +90%
using VGG16 on document name etc.
TensorFlow dataframe format Classification
We have one more lab in this chapter – walking through implementation of one of the most
exciting developments in AI – OpenAI’s DALL-E.
Register an account with OpenAI https://openai.com/join/ and access the API Key
provided under “Personal” in the top right corner of your dashboard.
287
Chapter 8 AI Case Studies
https://github.com/bw-cetech/apress-10.9.git
Copy and paste the OpenAI API Key inside the double quotes of the string defined in the
openai_credentials.py file
Drag and drop the OpenAI credentials file to Colab temporary storage
Run the cell to call the DALL-E model function using the example given of a cow holding a
frying pan
NB calling the API function requires a paid account18 – make sure you have followed the
instructions in the notebook to create this first
Exercise (stretch) – modify the code to create an image of an astronaut on a bicycle
W
rap-up
The legal perspective above brings this chapter to a close. From a look at the digital
(and cybersecurity) challenges faced in Telco and Retail sectors, Financial Services,
Insurance and Government, across manufacturing, supply chain and energy industries,
in healthcare organizations and across HR functions our trawl through AI use cases here
sets the scene for our final chapter19 on productionizing AI.
Despite the very different nature of these vertical industries, AI solutions are
often surprisingly similar, “horizontal” in their application, and recyclable. Adopting
economies of scale makes sense. As such, this is the approach we will take in the next
chapter – not just to describe the end-to-end processes and implementation actions to
develop, build and deploy robust plug-and-play AI solutions, but additionally in our final
two end-to-end deployment labs in this book.
Costs are modest and lower for smaller scale images (256x256 pixels)
18
Chapter 10 on NLP is our last chapter, but Chapter 9 will complete the main theme of this book
19
on Productionizing AI
288
CHAPTER 9
289
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_9
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Our second last chapter attempts to ask these questions and provide a practical look
at “joining the dots,” addressing the barriers and simplifying the challenges to full-stack
deployment and productionization of Enterprise AI on Cloud.
Journeying from beta application to production and ultimately hosting apps
on cloud in our hands-on labs, we start by revisiting the project lifecycle and agile
techniques in development, delivery and testing phases of an AI project. We look at
mapping the user journey with best practice and define frameworks for success, as well
as process optimization and integration of the leading AI tools.
After another look at distributed storage, parallelization, and optimizing compute
(and storage) in the context of AI application scaling and elasticity, we end the chapter
with a couple of hands-on labs around two of the best ways to productionize an AI
solution – containerizing an AI app on Azure and hosting on Heroku.
1
Data buried away in one department
290
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Cloud/CSP Roulette
Whether it’s a start-up, SME, or corporate, implementable Enterprise AI solutions
generally have to overcome five key constraints:
• Innovation
Most AI solutions today have significant “solution concentration” around the three
main CSPs of AWS, Azure, and GCP (Figure 9-2). Whether it's storage or compute,
there is likely to be one or two services/resources provisioned on Amazon, Microsoft,
or Google.
291
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
With this in mind, planning around and scoping the project in four distinct phases
can help steer an AI project toward an agreed stakeholder solution:
• Adoption
292
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
a) Data preprocessing
b) Algo tuning
d) Model selection
e) Deployment
293
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Integrating two important components of an AI solution – Data Storage and Python, the goal of
this lab is to use Python to write to and then read back from a SQL database:
2. Create a folder on your local drive C:\sqlite and unzip the two files above
sqlite3
sqlite3 test.db
https://github.com/bw-cetech/apress-9.1.git
a. Create database
b. Create table
c. Insert data (single record)
e. Import the “HRE-short.csv” file downloaded from the GitHub link above to Python and
then bulk export to SQLite
f. View the data in DB Browser and
294
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
In this lab we look at deploying a simple machine learning model on Google Cloud Platform
4. Clone (in Cloud Shell) the sample model at the GitHub link below:
https://github.com/opeyemibami/deployment-of-titanic-on-
google-cloud
gcloud init.
https://heartbeat.comet.ml/deploying-machine-learning-
models-on-google-cloud-platform-gcp-7b1ff8140144
295
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
This short lab takes the previously completed “Python Data Ingestion – Met Office weather
data” lab from Chapter 3 and creates a PowerBI front-end from the Python-scraped output
1. If not already installed, download PowerBI desktop from the link below:
https://powerbi.microsoft.com/en-us/downloads/
and follow the steps here to set the Python path in PowerBI:
https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-
python-scripts
2. Paste the completed script from the Met Office Data Ingestion lab in Chapter 3,
under Data > More > Other > Python script in PowerBI
3. Create a line chart showing the minimum and maximum temperatures for the
next 5 days. Note you will need to:
296
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Figure 9-4. Design Thinking, Lean and Agile (Source: Jonny Schneider)
297
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
2
Source: DataKitchen
298
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
• Error rates – over time these should decrease as testing matures and
the number of tests increases
3
Chief Data Officer
299
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Data drift
Any continuous improvement cycle for delivering and deploying an AI solution also
needs resiliency in the face of post deployment interface/data changes, including robust
data drift mitigation and model re-training.
Degradation in model performance over time is usually attributed to one of (a) data
(or covariate) drift or (b) concept drift, that is, when data distributions have deviated
significantly from those of the original training set:
300
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
• Data drift refers to feature drift and the possibility that relationships
between features change over time
Figure 9-7. Model degradation over time as a result of data drift (Source: www.
kdnuggets.com)
Automated Retraining
Once we have started tracking data drift, we need to implement a process to ensure our
model remains valid/performant. Ideally, this should be automated – to implement an
automated retraining process based on data/model drift, there are two main options:
Scheduled retraining
Performance-based/Dynamic Retraining
Our next lab takes a boiler plate app and pushes it to cloud as a Heroku-hosted application:
2. Create a new app in Heroku in the Europe region, for example, my-heroku-app
4
MLFlow (https://mlflow.org) is an open source platform developed by Databricks to help
productionize models by managing the complete machine learning lifecycle with enterprise
reliability, security and scale.
302
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
https://github.com/bw-cetech/apress-9.2.git
7. After cd’ing into the cloned app on your local drive, login to Heroku from your
terminal with “heroku login” then push to Heroku by entering the following
commands in sequence:
303
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Whether our solution is served by a thick client application5 or a thin client,6 key
storage (or data), such as a Data Lake or NoSQL database, and compute, such as a
Virtual Machine or Apache Spark will serve the underlying AI services and tools we use
in developing and ultimately deploying our project. Figure 9-8 describes this ecosystem
of supporting infrastructure for AI.
5
For example, a local install, Jupyter notebook, GitHub Desktop, etc.
6
Web browser, Colab-based notebooks, GitHub, etc.
304
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Employ a Data Lake-Centric Design – view the data lake as the single source
of truth
Separate Compute from Data – achieve cost savings
Minimize Data Copies – reducing governance overhead
Determine a High-Level Data Lake Design Pattern – support user hierarchies and
compliance
Stay Open, Flexible, and Portable – keep architecture change/future-proof, support
multicloud
7
In production, whereas development and staging environments may include connectors to
legacy data sources
305
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
8
“To-Be” production architecture, whereas development, staging, and initial production
environments may include stopgap connectors to legacy data sources
306
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Dask
Dask has fewer features than Spark and is smaller and therefore more lightweight than
Spark. It also lacks Spark’s high-level optimization of uniformly applied computations
but Dask does have its advantages:10
• Moves computation to the data, rather than the other way around
Dask works with two key concepts: delayed/background execution and lazy
execution for stacking transformations/compute for parallel processing. We will look at
both of these in a hands-on lab below.
9
See https://www.databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-
one-reduced-processing-time-from-hours-to-minutes-with-koalas.html for a neat
comparison of Pandas, Koalas and PySpark runtime
10
There are some potential implementation challenges, for example, Dask DataFrames are used
to partition data and split across multiple nodes in a cluster but calculation with compute() may
run slowly and out of memory errors can occur when the data size exceeds the memory of a
single machine
307
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Using Python, the goal of this exercise is to interface to one of the most important storage
resource on cloud: Amazon Simple Storage Service (S3):
1. Clone files from the following GitHub link, and unzip into a local directory:
https://github.com/bw-cetech/apress-9.3.git
https://console.aws.amazon.com/iam/home?#/security_credentials
5. Push (upload) your local files to your S3 bucket on AWS by following the
steps below:
f. cd into folder where the local (unzipped) files have been extracted
g. sync the data to your S3 bucket with:
The download GitHub data has now been pushed to AWS and is stored in your
S3 bucket
6. Exercise - try to connect to S3 from Python to download the images. Check your answer
with the notebook ”AWS-S3-Download.ipynb provided in the GitHub link above
308
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
In this lab we revisit Databricks and Apache Spark to compare runtimes on an IoT dataset
NB due to the 61 MB file size, you may need to copy and paste the json file
contents into a .txt file and save locally as a json first
309
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
DASK PARALLELIZATION
Following the steps shown in the process diagram below, the goal of this exercise is to
familiarize ourselves with (Python-based) big data processing using Dask:
4. Walk through the code sample “dask.delayed.ipynb” to get a feel for how
Delayed Execution works in Dask
NB to see how complex extract, transform and load (ETL) over a cluster works see the
gif in the follow-on notebook:
https://github.com/dask/dask-tutorial/blob/main/01x_lazy.ipynb
5. Exercise – do the same with Lazy Execution, that is, run the 01x_lazy.ipynb
notebook
6. Compare runtime reading the README.md file word by word vs. line by line
310
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
11
See also Chapter 2 on DataOps
311
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Unfortunately building an AI app in this way can be rather cumbersome due to the
high dependency on local file and library configuration. This is where containers can
help – providing a means to ringfence dependencies and ensure there are no conflicts
when deploying across different end-user systems.
312
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Each pipeline in itself is a set of iterative stages typically defined by a set of files
and implemented using a range of tools, from scripts, source code, algorithms, html
configuration files, parameter files, and containers.
And code of course controls the entire data-analytics pipeline from end to end,
effectively from ideation, design, and requirements capture through training, testing,
deployment to operations, and postproduction maintenance.
313
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
Wrap-up
With that look at continuous AI delivery, we have reached the end of our journey
through how to deploy an AI solution. The two labs that follow are intended to give a
more “immersive” experience of productionizing AI, with the end goal being “full-stack”
AI application.
Although this chapter brings us to the end of the main theme of this book, we do
have one further chapter to go. Largely due to step-changes in enabling technology
(particularly transformers) and heightened focus on unstructured data, Natural
Language Processing has carved off as its own field of interest and is the subject of our
final chapter.
Our last two labs are end-to-end application deployments, this one trains a model first with
Google teachable Machine, uses Streamlit to create a front-end for inference, then deploys to
Heroku as a hosted app:
1. Go to https://teachablemachine.withgoogle.com/train/image and
create a trained model as described in steps 1-5 under "Steps to create the
model and app" at the link below:
https://towardsdatascience.com/build-a-machine-learning-
app-in-less-than-an-hour-300d97f0b620
2. Download files from "C:\Users\Barry Walsh\Testing\xray-automl\Streamlit-
Heroku-setup.zip and unzip to a test folder on your local drive
314
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
5. Exercise: finally repeat the steps in the lab above “Hosting on Heroku – End-to-
End: Hands-on Practice” to deploy as an endpoint solution on cloud
Using PyCaret in Colab, our final “marathon” lab trains an AutoML model to predict insurance
premiums, then exports the resultant pickle file. We then attach the trained model to a
boilerplate Flask app, run locally before Dockerizing as a container app and pushing to Azure
Container Registry for running as an (Azure) web app.
The process is described further in the section above “Deploying on Cloud with a Docker
Container” and is summarized below:
315
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
a. Install pycaret
https://github.com/bw-cetech/apress-9.4.git
and
d. Install dependencies
3. Install Docker
a. Create Dockerfile
c. Build image
12
Note the first part of this lab is our PyCaret AutoML Introduction to AI run through from the last
lab in Chapter 1
316
Chapter 9 Deploying an AI Solution (Productionizing and Containerization)
a. Docker run
6. Authenticate with Azure Container Registry (ACR) and push/deploy our container
solution to cloud
NB don’t forget to clean up (stop and delete) resources (Azure Web App, ACR instance,
and any VMs), to preserve cloud credits/prevent running up cloud costs!
317
CHAPTER 10
Natural Language
Processing
Any book on operationalizing Artificial Intelligence today cannot ignore growth in
Natural Language Processing (NLP), with market size projected to be $43b in 2025.1
Ostensibly the branch of AI concerned with programming computers to understand
both written and spoken text, NLP applications in 2022 are being pushed further to do
much more. Qualitative analysis, contextual/domain-specific reasoning, and thought
leadership creation2 are all in scope and performance improvements are dramatic, even
potentially seismic if you believe Google’s Engineers.3
Perceived by many as a critical skill in the race to digitalize, much of the attention
on NLP is taking techniques and best practice learned from structured data wrangling
and predictive modeling processes and expanding the scope to unstructured data. The
implicit goal is to transform unstructured data into machine readable formats where we
can then carry out similar, if not identical processes carried out in standard ML/DL.
While the overwhelming focus in this book is on applications of machine and deep
learning in their own right, we cover in this last chapter the main themes for Natural
Language Processing, basic NLP theory, and implementation before addressing all the
important tools and libraries for deploying NLP solutions.
1
www.statista.com/statistics/607891/worldwide-natural-language-processing-
market-revenues/
2
See e.g. https://hbr.org/2022/04/the-power-of-natural-language-processing
3
Advances in Google’s chatbot LaMDA have led to a former senior engineer claiming the
technology is “sentient.” Sign up to test it here: https://aitestkitchen.withgoogle.com/
319
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7_10
Chapter 10 Natural Language Processing
Many of these tools rely on machine and deep learning techniques described
elsewhere in this book already but we will go through in this final chapter the use of
the main Python NLP library, NLTK, alongside other libraries such as PyTorch and
spaCy used for solving NLP problems. We will also cover some of well-known APIs
which leverage NLP (such as the Twitter API) which are being applied to highly relevant
customer journey and public perception use cases in 2022.
We will also take our usual hands-on, lab-based approach, in this last chapter at
automating the process of understanding complex language by identifying and splitting
(parsing) words and extracting topics, entities, and “intents” – core subprocesses
for many NL applications today, including sentiment analysis and chatbots/
conversation agents.
Introduction to NLP
We start with a brief historical context and a look at a basic definition of Natural
Language Processing, its place in the wider “AI Ecosystem” and interaction with machine
and deep learning processes.
Our first section then proceeds to address how and why NLP is being applied at scale
in businesses and organizations worldwide before wrapping up on the NLP lifecycle and
in particular, a best practice roadmap for the sequence of tasks central to delivering a
successful NLP implementation.
NLP Fundamentals
Natural Language Processing is the branch of AI that deals with the interaction between
computers and humans using natural language. The objective is to read, decipher,
understand, and make sense of language in a manner that is valuable in some way to
end users and organizations as a whole. To do this, two key linguistic techniques are
adopted: syntactic analysis and semantic analysis.
Importantly, and linking back to the main theme of this book, NLP then applies
Machine and Deep Learning algorithms to unstructured data, converting into a form
that computers can understand. This “overlap” is shown in Figure 10-1.
320
Chapter 10 Natural Language Processing
AI
Artificial
Intelligence
ML
Machine
Learning
NLP
Language
DL Processing
Deep
Learning
321
Chapter 10 Natural Language Processing
• Text Classification
• Topic Identification/Detection
• Translation
• Text Summarization
• Sentiment Analysis
The above are often confused with actual NLP applications – however in reality,
businesses or organizational applications tend to incorporate one or more of the above
goals while the marketing or product names given to these NLP solutions often blurs the
line between goal and technique.
As an example, the business value from using Text Classification is typically driven
by the downstream customer – there is both “macro” value in, for example, sales emails
falling into a classification system (e.g., based on customer segment, product line, or
proposal stage) or “micro” value from the email being re-routed to specific departments
based on keywords identified in the email. Equally the same text classification may
trigger automated sending of a customer KBA or a chatbot response.
4
Also a data transformation step – see section below
322
Chapter 10 Natural Language Processing
323
Chapter 10 Natural Language Processing
Preprocessing/initial cleaning
• Regular expressions: remove symbols (e.g., “#” and “RT” from tweets)
5
See https://github.com/graykode/nlp-roadmap
324
Chapter 10 Natural Language Processing
Database
Input Sentence
Output Data
Lexical analysis:
• Tokenization
Syntactic analysis:
• Handling contractions6
• Stemming
Semantic analysis:
• Lemmatization
Such as ensuring slang expressions are converted to their full equivalent: I’ll = I will, You’d = You
6
would, etc.
325
Chapter 10 Natural Language Processing
• Word2Vec (Google)
• GloVe (Stanford)
• FastText (Facebook)
326
Chapter 10 Natural Language Processing
One of the simplest applications of NLP, widely used by marketing teams across the
globe, is a Word Cloud. Although basic, the techniques used here are replicated in many
of the high-profile industrial applications currently in existence – this lab provides an
introduction to those techniques:
1. Clone the GitHub repo below to your local drive:
https://github.com/bw-cetech/apress-10.1.git
2. Go to your local folder where you have cloned the GitHub files and set up a
virtual environment using the commands below (one after the other):7
python -m venv env
env\Scripts\Activate
NB the Chinese characters in the name will not be rendered in terminal but the above
command will still run
4. If prompted, install the dependencies one by one with
pip install wordcloud
pip install jieba
5. A word cloud will generate with the Chinese characters from the China Daily
news extract provided in the GitHub folder
6. Exercise: run the code using instead the English data example (pasta recipe)
7. Exercise: replace the pig image template with a different Chinese zodiac image
(use images from, for example, www.astrosage.com/chinese-zodiac/)
8. Exercise (stretch): update the code to generate the word cloud based on the
current Chinese Year (tiger, rabbit, dragon, etc.)
See e.g. Chapter 7 lab “Running Python from Terminal: Hands-on Practice” for further help
7
with this
327
Chapter 10 Natural Language Processing
Preprocessing and Linguistics
Because we are dealing with the most part with unstructured text data, natural language
processing is inextricably linked to linguistic structure.
Whether it’s taking source data and performing initial preprocessing and cleaning
tasks through the use of regular expressions, applying syntactic or semantic analysis
or implementing word embeddings to convert data into word vectors, having an
understanding of linguistics helps in both framing the journey from a raw to model-
ready data format and ultimately extracting insights.
This section covers the key NLP concepts, referenced to linguists and broken down
as described in the NLP lifecycle.
Preprocessing/Initial Cleaning
Regular Expressions
Regular expressions or “regex” are the bread and butter of unstructured data “string”
searches. Often employed as a first step after scraping/data import, the idea is to quickly
clean the data by searching for specific patterns in the data that we want to remove.
As tokenization and vectorization naturally follow this step, the aim is to remove
punctuation, symbols (such as smiley faces or emoticons from text messages or hashtags
from a tweet), and special characters (such as currency symbols or brackets) that are
unhelpful for converting encoding text as word vectors.
The below example shows an example implemented in Python – “re” is the
library used8
import re
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
8
See https://docs.python.org/3/library/re.html for a list of regular expressions
328
Chapter 10 Natural Language Processing
The “^” matches the start of the string, while the “$” matches the end of the string, so
in this case “Search successful” is returned
Lexical Analysis
The first of these steps, lexical analysis refers to the process of conversion of the text
data into its constituent building blocks (words, characters, or symbols according to the
underlying language).
329
Chapter 10 Natural Language Processing
Tokenization
We already mentioned one way to parse text when stripping text during the
preprocessing step. More commonly tokenization is performed using one of the tokenize
methods from the NLTK library, typically .word_tokenize to split sentences into words,
but depending on the goal, phrase tokenization may be required such as in the German
example below:
import nltk
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen? Gut, danke.')
print(german_tokens)
Syntactic Analysis
After obtaining our language building blocks from lexical analysis, the next step is to
transform the data using syntactic analysis; extracting logical meaning from the text
whilst considering rules of grammar.
Switch to Lowercase
The first step in syntactic analysis is to switch our data to lowercase using the built-in
Python function lower(). Because the same word in proper case (“Hello”) and lowercase
(“hello”) would be represented as two different words in vector space model, applying
the lower() function addresses sparsity, reduces the dimensional problem we are solving
and speeds up runtime.
9
https://huggingface.co/docs/tokenizers/index
10
While building from scratch is clearly computationally expensive, multiple GPUs
can achieve savings via distributed training. See e.g. www.determined.ai/blog/
faster-nlp-with-deep-learning-distributed-training
330
Chapter 10 Natural Language Processing
where: 11
CC is a coordinating conjunction
RB is an adverb, like “occasionally” and “swiftly”
IN is a preposition/subordinating conjunction
NN is a singular noun
JJ is a large adjective
Essentially Markov Chains are being employed here, indicating (as shown in
Figure 10-4) the probability of a specific grammatical term following another word.
11
A more extensive list of POS Tags is here: www.guru99.com/pos-tagging-chunking-nltk.html
331
Chapter 10 Natural Language Processing
NLP applications like all AI apps are highly domain-specific, and while POS tagging
and NER tasks can do the heavy lifting of annotating texts, they are always gaps with
certain lexical terms. As such, both POS and NER tagging are often supported by a
(manual) sector-specific entity curation and manual annotation process.
The below example shows how this process works with IBM Watson Knowledge
Catalogue, where an approved group of subject matter experts are able to highlight
words/terms and define entity types.
The NLTK application of NER involves a more complicated use of regular expressions to be
12
332
Chapter 10 Natural Language Processing
Handling Contractions
Contractions in NLP refer to ensuring slang expressions are expanded to their full
equivalent: I'll = I will, You’d = You would, etc. The purpose in this case is similar to
lower casing in that removing contractions before vectorizing helps with dimensionality
reduction.
Contractions can be implemented using the contractions library in Python.
Stemming
Stemming and lemmatization are closely coupled text normalization subtasks in natural
language processing. Many languages contain words with the same underlying root or
“stem” and stemming refers to the process of shortening these words to their root form,
regardless of meaning – essentially here we are just stripping down the end characters to a
common prefix, even if the prefix itself does not standalone itself as a grammatical term.
Stemming in useful for sentiment analysis as the stem word can convey negative or
positive sentiment.
Stemming is typically implemented using nltk.stem.
333
Chapter 10 Natural Language Processing
Semantic Analysis
Our final NLP transformation processes are essentially “semantic analysis” steps –
improving on the logical and grammatic tasks from syntactic analysis, semantic analysis
allows us to draw meaning from the underlying text – interpreting whole texts and analyzing
grammatical structure in order to identify (context-specific) relationships between lexical
terms. It is the final stage in natural language processing prior to vectorization.
Lemmatization
In contrast to stemming, lemmatization is “context-specific” and coverts words to a
meaningful root form. The inflected form of a word is considered, so a word such as
“better” will be lemmatized as “good,” while “caring” would be lemmatized as “care”
(while stemming would result in “car”). While the nltk.stem PorterStemmer() function is
used for stemming, lemmatization uses the WordNetLemmatizer() function.
Essentially an enhancement on stemming which considers semantics,
lemmatization is more commonly used in more sophisticated applications of sentiment
analysis, such as chatbots.
Disambiguation
Disambiguation is the process of determining the most probable meaning of a specific
phrase when there is underlying ambiguity in its definition. On its own a word such
as “bank” has several meanings,13 the NLTK WordNet module allows us to identify
probabilistically the actual meaning from its usage in a broader text, although Python
also has a Word Sense Disambiguation wrapper (pywsd) which works with NLTK.
N-grams
The NLTK library also comes with an ngrams module. N-grams are a string of connected
lexical terms – essentially a continuous sequences of words or phrases. “N” here refers
to the number of connected terms or words we are referring to, a bigram (2-gram) would
be “United States,” a trigram “gross domestic product”. The order is important as in
matching “red apple” in a corpus, as opposed to “apple red.”
See this link for a number of different definitions (synset’s) for the word bank: https://
13
notebook.community/dcavar/python-tutorial-for-ipython/notebooks/Python%20Word%20
Sense%20Disambiguation
334
Chapter 10 Natural Language Processing
N-grams are heavily used in natural language processing as typically we are creating
features from series of words rather than individual words themselves
Bringing together some of the techniques described in this section, the goal of this lab
is to scrape typical unstructured web data (here the SpaceX Wikipedia web page) and
walk-through preprocessing and lexical analysis steps in order to produce insights on
word count:
1. Download the Jupyter notebook from this GitHub repo:
https://github.com/bw-cetech/apress-10.2.git
c. Tokenize text
d. Count word frequency
3. Exercise – plot word frequency as a bar plot rather than a line plot
text-pre-processing-data-wrangling
335
Chapter 10 Natural Language Processing
This “phase” of NLP modelling can be seen as the equivalent process to encoding a structured
15
336
Chapter 10 Natural Language Processing
Rule-Based/Frequency-Based Embedding
We start with simpler “frequency”- or “rule”-based approaches to vectorizing/encoding
our data.
16
See https://github.com/graykode/nlp-roadmap
17
Such as “gender”
337
Chapter 10 Natural Language Processing
One hot encoding may be simple to implement and works well for binary or discrete
variables but when considering the almost infinite permutations of lexical terms in a
large corpus of text, as well as the possibility of storing N-grams, not just single words,
the number of features involved quickly becomes unmanageable. Each word is encoded
as a one hot “vector,” so that a sentence becomes an array of vectors, and a passage of
text an array of matrices (so a tensor).
Count vectorization takes one hot encoded terms as columns in a document term
matrix and counts up the number of occurrences before storing nonnull values for
onward text summarization.
18
Counted in sklearn with CountVectorizer or “hashed” with HashingVectorizer where the text
tokens are mapped to fixed-size values. See e.g. https://scikit-learn.org/stable/modules/
generated/sklearn.feature_extraction.text.HashingVectorizer.html
338
Chapter 10 Natural Language Processing
T F-IDF
Otherwise known as Term frequency–Inverse document frequency, TF-IDF attempts to
address the inherent weaknesses of the bag of words method by employing a numerical
statistic to reflect how important a word is to a document in a collection/corpus.
TF here refers to how frequently a term appears in the document, but it is the IDF
element, which measures how “important” that term is which distinguishes the method
from the Bag-of-words model. For any specific word, its IDF is the log of the number of
documents divided by the number of documents in which that word appears, so a word
like “the” will have a low score as it will appear in all documents, and log(1) is zero.21
TF-IDF is the product of the Term Frequency and IDF score and tends to weight
higher both ((a) words which are frequent in a single document and b) less frequent/rare
words over the entire collection.
For all its advantages over the BoW model, these deterministic, frequency-based
approaches are unable to scale to interpreting “context” – for these we need word
embeddings.
19
Instead of SVD, a variation of LSA, Probabilistic Latent Semantic Analysis (pLSA) builds a
probabilistic model to generate data observed in the document-term matrix. Another alternative.
Latent Dirichlet Allocation (LDA) is a Bayesian version of pLSA which approximates, and
therefore, better generalizes document-topic and word-topic distributions
20
Or hidden, “latent” here can be thought of as the underlying or implicit “theme” of a document
21
The Luhn Summarization algorithm is based on TF-IDF, filtering out further very low
frequency words as well as high frequency stop words which are not statistically significant
339
Chapter 10 Natural Language Processing
340
Chapter 10 Natural Language Processing
W
ord2Vec (Google)
Word2vec is probably the most famous word embedding model – it is the process used
by Google search engine to search similar text, phrases, sentences, or queries.22 A two-
layer neural net that processes text by “vectorizing” words, Word2vec is based on two
architectures: continuous bag of words or CBOW and the skip-gram model.23
The CBOW architecture allows the underlying model to predict words based on
similar-context words, but ignores the order of words, much like in the rule-based BoW
model. The skip-gram model takes into account the order of the words and applies a
greater weight to words closer in context in vector space. The value in CBOW is primarily
the speed of execution, while skip-gram is more performant on infrequent words.
Document “similarity” for both Word2Vec and Latent Semantic Analysis is calculated
using cosine similarity. Cosine similarity is….
O
ther Models
Word2Vec is by far the most common model for Word Embeddings, but there are many
variants.
Additionally, open source projects GloVe, developed by Stanford and fastText,
developed by Facebook also have their advantages.
Short for Global Vectors, GloVe model training uses word co-occurrence stats and
combines global matrix factorization and local context window24 methods. Its USP is in
finding relationships between words such as company-product couples, but its reliance
on co-occurrence matrix slows down runtime.
fastText is actually an extension of the Word2Vec model where words are effectively
modeled as an n-gram of characters.25 This approach means fastText performs better on
rare words as the underlying character n-grams are shared with other words but it is also
slower than Word2Vec.
22
As opposed to the original keyword searches performed in Google’s earlier days
23
The two architectures/models can be viewed as variants to Word2Vec. Lda2vec is another
variant which, as the name suggests, uses Latent Dirichlet Allocation where a document vector is
combined with the key “pivot word” vector used to predict context in Word2Vec
24
Similar to CBOW
25
So for the word “apple,” tri-gram vectors would be app, ppl, and ple word vectors and the word
embedding for apple would be the sum of these
341
Chapter 10 Natural Language Processing
N
LP Modeling
Although many of the processes described above implicitly leverage machine or deep
learning in order to achieve vector representation of text, word embeddings are typically
evolved to then perform natural language “predictions.” In this last subsection, we take
a look at the main predictive modeling techniques in NLP prior to moving onto our last
chapter on Python implementation and the main NLP use cases today.
T ext Summarization
Many Natural Language Processing applications involve Text Summarization, if not as a
direct goal, then as an intermediate process. Text Summarization automates the process
using either extraction-based summarization where keyphrases are pulled from the
source document, or abstraction-based summarization involving paraphrasing and
shortening the source document. Abstraction-based summarization is more performant,
but more complex.
The algorithmic implementation for the extraction-based approach first extracts
keywords using linguistics analysis (e.g., PoS), then collects documents with
keyphrases27 before employing a supervised machine learning technique to take the
document samples with key phrases and build a model, with features such as the length,
number of characters, most recurring word, and frequency28 of keyphrases determined.
LexRank, Luhn, and LSA are all text summarization techniques previously
mentioned and accessible from the Python sumy library, as is KL-Sum which uses word
distribution similarity to match sentences with original texts.
26
Where sentences are scored based on eigenvector centrality in a graph representation of
sentences
27
In practice documents with keyphrase (positive) samples and documents without keyphrases
(negative) are included to fit a binary classification model
28
Using, for example, TF-IDF for frequency-based summarization. Features can also be derived
using Word2Vec for distance-based (vectorization) summarization
342
Chapter 10 Natural Language Processing
Topic Modeling
Closely coupled to text summarization is the concept of topic modeling. Many of the
same algorithms mentioned above, specifically LSA, pLSA, LDA, and lda2Vec are
used for topic modeling – the underlying goal is to recognize words from the topics (or
themes) present in a document or corpus of data,29 rather than more laborious method of
extracting words from an entire document.
Sequence Models
Sequence models are machine learning models used to interpret word sequences in
texts. Applications include sequence modeling of text streams, audio clips, and video
clips, where, like time-series data, Recurrent Neural Networks (specifically LSTMs or
GRUs) are used.
Sequence to sequence or seq2seq is probably the most well-known technique with
applications in machine translation, text summarization, and chatbots.
seq2seq is in fact a special class of RNN which uses an Encoder-Decoder
architecture. The encoder feeds the input data, sequence by sequence into an LSTM/
GRU network, producing context vectors (hidden state vectors) as well as outputs.30
The decoder (also an LSTM/GRU) initializes with the final states (context vector) of the
encoder network and generates an output on one forward pass. The decoder trains by
repeatedly feeding back into the decoder previous outputs to generate future outputs.
29
A corpus is the collection of unstructured data in the underlying data/document set
30
Outputs are discarded
343
Chapter 10 Natural Language Processing
31
Transformers are also the other component of OpenAI’s DALL-E Generative AI model shown in
Chapter 8
32
See www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3
33
https://betterprogramming.pub/creating-philosopher-conversations-using-a-
generative-lstm-network-fd22a14e3318
34
And thought to be “sentient” by at least one Google Engineer
35
See “Large Example” here: https://pypi.org/project/bert-extractive-summarizer
344
Chapter 10 Natural Language Processing
We now take a look at some of the techniques discussed in this section, starting with
a look at implementing Word2Vec to vectorize unstructured data and plotting the
resultant Word Embeddings:
https://github.com/bw-cetech/apress-10.3.git
b. Clean the data using a combination of lexical, syntactic, and semantic techniques
3. Exercise – improve the visibility of word labels in the word embeddings plot
4. Exercise (stretch) – swap out the json intents data with a larger 10+ page pdf
and perform topic modeling before plotting word embeddings
345
Chapter 10 Natural Language Processing
Using Keras and LSTM deep learning the goal of this lab is to implement a character-
level sequence-to-sequence model to translate English to French36
https://github.com/prasoons075/Deep-Learning-Codes/tree/
master/Encoder%20Decoder%20Model
2. Open the Encoder_decoder_model with Jupyter notebook
NB you will need to change the path link to the training data from “fra.txt” to
“fra-eng/fra.txt”
b. Vectorize the data and display the number of unique input and output tokens and
maximum sequence lengths
4. Train the sequence model/LSTM on the training language samples – note this
may take up to 5-10 minutes to run through 100 epochs
5. The model is saved in the previous step. Reimport this and set up the decoder
model for sampling/testing the model
sequence-to-sequence-models/#:~:text=Sequence%20to%20Sequence%20(often%20
abbreviated,Chatbots%2C%20Text%20Summarization%2C%20etc for further context
346
Chapter 10 Natural Language Processing
6. Finally, test the model with a few sequences from the training set (up to 10000)
to see how the model translates English terms
7. Exercise – adapt the notebook to support a basic (in notebook) user interface
where the user’s phrase is translated to French
The goal of this lab is to predict which language a surname is from by reading in
text examples as a series of characters, then building and training a recurrent neural
network (RNN) with PyTorch
https://github.com/bw-cetech/apress-10.3b.git
5. Run through the notebook, importing the PyTorch RNN functions (in PyTorch_
Functions.py) to the notebook
b. Download the training data dictionaries for this problem (surnames for 18 different
languages), unzip and rename in your Jupyter working directory
Email servers block python scripts when sending, hence the file has been uploaded to GitHub
37
347
Chapter 10 Natural Language Processing
c. Display the last five names in the Portuguese dictionary to test the PyTorch custom
functions
d. Observe the model performance on the test set (as shown in the graphic below) and
export both the predictions and loss likelihood output (probability of the test sample
names belonging to a certain language) to csv files
348
Chapter 10 Natural Language Processing
P
ython Libraries
What are the main libraries for implementing Natural Language Processing? We outline
in Table 10-1 the main ones for general purpose NLP as well as a few of the specialist
libraries used for targeted industry applications.
NLTK Leading platform for NLP, sentence Versality as the Slow runtime, no
detection, tokenization, lemmatization, most commonly neural networks
stemming, parsing, chunking, and POS used Python library
tagging. UI to 50 corpora and lexical for NLP
resources
TEXTBLOB Access to common text-processing ops Data prep for NLP/ Slow runtime, not for
through TextBlob objects treated as DL, easy UI large scale production
Python strings
CORENLP Stanford-developed assortment of Speed, written in Requires Java
human language technology tools for Java install as underlying
linguistic analysis language38
SPACY Designed explicitly for production usage Big data handling Lack of flexibility in
for developers to create NLP apps that and multi-foreign comparison to NLTK
can process/understand large texts language support
(continued)
sumy can be used for most text summarization techniques including LSA, Luhn, LexRank,
38
and KL-Sum
349
Chapter 10 Natural Language Processing
Others with specific strengths, such as Pattern and PyNLPl (pronounced Pineapple)
are also used for their web data mining and file format handling capabilities respectively
while sumy, pysummarization, and BERT summarizer are great for text summarization.42
Although not exclusively tools for natural language processing, two other important
Python libraries are the Twitter API43 and the facebookinsights wrapper for the Facebook
Insights API. The use cases for those – social media sentiment (and metric) analysis will
be discussed below while a hands-on lab is provided for the Twitter API in Chapter 8.
N
LP Applications
Email/spam filters, word clouds,44 auto-correct in word processors, and (code) auto-
complete in programming are some of the earliest, and now established, applications
of natural language processing. But it is only relatively recently that the above Python
libraries, coupled with cloud, have unlocked higher-value natural language processing of
VAST volumes of unstructured data.
39
But provides an interface to python
40
TensorFlow also does NLP although perhaps not as widely for this purpose as PyTorch
41
See also Chapter 5
42
Gensim also supports training your own word embeddings, see e.g. www.analyticsvidhya.com/
blog/2017/06/word-embeddings-count-word2veec
43
Import tweepy or import snscrape. The latter doesn’t require a Twitter Developer Account.
44
And as we have seen in the first lab in this chapter, relatively easy to implement
350
Chapter 10 Natural Language Processing
Whether its extracting business value from unstructured data, deep document
information search and retrieval, acceleration of internal research or due diligence
processes, reporting productivity increases and content creation or seeking synergies
with overriding cognitive robotic process automation (CRPA) goals, companies are
scrambling to establish in-house NLP capabilities to deliver value.
While many of the trending, “peta-scale” NLP accelerators may only just be starting
to be employed, we take a look now current levels of sophistication around mainstream
NLP business and organizational applications.
T ext Analytics
Text analytics or text mining involves the extraction of high-quality information from
data. Essentially an enabler for more complex unstructured data analysis including
text-to-speech and sentiment analysis, a key value-add is the ability to augment domain-
specific data/corpora in the training process45 for enhanced classification tasks.
Microsoft is probably one of the leaders here, specifically with Azure Cognitive
Services which encompasses a wealth of text analytics including Content Moderator
and Language Understanding (LUIS). IBM Watson Knowledge Catalog is also a leading
product.
Text-to-Speech-to-Text
Text to Speech (and Speech to Text) is now a well-established and somewhat crowded
market. Market leaders include Amazon Polly and the API Cloud Service IBM Watson
Text to Speech.46
45
See POS Tagging and NER in above section
46
see e.g. demo @ https://speech-to-text-demo.ng.bluemix.net/
351
Chapter 10 Natural Language Processing
Sentiment Analysis today goes beyond just simple positive and negative sentiment
metrics – listening in or monitoring (real-time) online conversation can trigger related
key insight analysis on discussion categories, concepts and themes, and emotion
detection.
All global retailers, FMCG industries, and the TELCO sector are reliant on
sophisticated capture of sentiment analysis outcomes – with most apps reliant on
accessing the Twitter API (via the Python tweepy or snscrape library) and/or Facebook
Insights (via the facebookinsights wrapper). A hands-on lab for social media sentiment
analysis is provided for the Twitter API in Chapter 8.
352
Chapter 10 Natural Language Processing
N
LP 2.0
The limitless potential of natural language transformers has already been discussed in
the last section but we take a look below at other recent advances in state-of-the-art
(SOTA) NLP technologies expected to become core organizational apps in the
near future.
D
ebating
Sophistication in computational discussion, argumentation, and debating technologies
has reached a point where machines are able to credibly debate humans.48
IBM’s Project Debater is pitched as the “first AI system that can debate humans on
complex topics” and is composed of four core modules: argument mining, an argument
knowledge base (AKB), argument rebuttal and debate construction, where the first two
modules provide the content for debates.49 The tool uses similar sequence to sequence
47
Another tool, Automizy https://automizy.com/ is free and uses NLG for email
marketing content
48
See e.g. www.technologyreview.com/2020/01/21/276156/
ibms-debating-ai-just-got-a-lot-closer-to-being-a-useful-tool/
49
Debater datasets can be found at the link: https://research.ibm.com/haifa/dept/vst/
debating_data.shtml
353
Chapter 10 Natural Language Processing
and attention mechanisms intrinsic to transformers and has favorable ratings for, for
example, an opening speech when compared with human (nonexpert) speeches and
other NLP transformers such as GPT-2.
A
uto-NLP
Naturally, given the trends toward full automation in machine and deep learning, there
is much focus on automating the myriad of steps involved in the NLP lifecycle.
Hugging Face are a leader in this area, their AutoNLP tool integrates with the
Hugging Face Hub’s substantial collection of datasets and pretrained50 SOTA transformer
models. NeuralSpace are another, with multilingual support to train models with
AutoNLP in 87 languages.51
50
Including 215 sentiment analysis models. See https://huggingface.co/blog/
sentiment-analysis-python
51
https://docs.neuralspace.ai/. NeuralSpace specializes in local languages spoken across
Asia, the Middle East, and Africa.
354
Chapter 10 Natural Language Processing
The auto-nlp Python library also provides an abstraction layer and low code
functionality over existing Python NLP packages (such as spaCy) in much the same way
as auto-sklearn provides low code automation for sklearn machine learning. AutoVIML
(Automatic Variant Interpretable ML) is another Python library for Auto-NLP which
automates the preprocessing, linguistic analysis (stemming and lemmatization, etc.),
and vectorization steps.
W
rap-up
Moving on from these NLP trends, we complete this final chapter on natural language
processing with a couple of hands-on labs focused on some of the dominant
applications and tools in 2022. These labs bring us to the end of this journey and a
practical conclusion to Productionizing AI solutions with Cloud and Python. We do
though have some final words to close out on in our concluding few pages.
A recurring theme of this book has been the level of experimentation, and at times
(cloud) cost-based workarounds, employed to implement AI. Not every company
has budget to meet the costs of high-performance compute instances, or high
throughput, secure storage. The current climate for AI solution implementation and the
concentration of market power may mean these challenges remain for several years to
come but we look in these last pages at the possibility of game-changing innovations that
may allow the wider ecosystem to escape the gravitational pull of the CSPs.
As a Gartner Magic Quadrant leader for Enterprise Conversational AI, and with IBM
Watson users achieving a 337% ROI over three years,52 Watson Assistant is one of the
industry-leading tools for IVA automation.
The goal of this exercise is to create an HR IVA on Watson Assistant which uses growing
user interaction to re-train and improve responses to user questions on job applications
and internal company policy:
52
IBM Watson users claims an 337% ROI over three years
355
Chapter 10 Natural Language Processing
2. Select the Assistant option on the top LHS of the screen > Create Assistant
3. Add dialog skill > Upload Skill and upload the json dialogue from the link below
https://github.com/bw-cetech/apress-10.4.git
4. Open the newly created Assistant and observe the user intents (these are the
themes of the user’s questions such as org structure, payroll, complaint, admin,
humor, etc.) and entities (the subject of the user’s questions such as team,
people, services, pay, extras: stocks/shares/pension, etc.)
5. The dialogue is preconfigured – test it with the dialogue below54
I may be dead soon, how is my life insurance, salary and pension paid to
my family?
no
goodbye
53
Change “eu-de” in the url to uk or us-south region depending on nearest data center
54
The “Preview” option for the Assistant can be used for this, but the equivalent Skill UI is more
elegant – repeat the json file upload directly to a new Dialog Skill created from the Skill option on
the top LHS of the screen, then after creation, select “Try it”
356
Chapter 10 Natural Language Processing
6. Exercise – swap the dialog for a typical customer support IVA for an online
retailer
9. Exercise – set up an app routine which automatically re-trains the NLP model
each month based on the latest user conversations
In our final lab, we use OpenAI’s pretrained GPT-3 (Generative Pretrained Transformer) models
to (a) generate a cooking recipe based on ingredients entered by a user (using “zero-shot”
training where no examples are provided to the model) and (b) implement a “sarcastic”
chatbot (using “few-shot” training where a limited number of examples are provided to train
the model):
https://github.com/bw-cetech/apress-10.4b.git
3. Copy and paste the OpenAI API Key inside the double quotes of the string
defined in the openai_credentials.py file
b. Import libraries
c. Drag and drop the OpenAI credentials file to Colab temporary storage
357
Chapter 10 Natural Language Processing
d. Define the GPT-3 transformer function – this will interface to a high performance GPT-3
“text-davinci-002” engine
e. Call the function to create a recipe from a basic recipe with apple, flour, chicken,
and salt
5. Exercise – change the ingredients to e.g. fresh basil, garlic, pine nuts, extra-
virgin olive oil, parmesan cheese, fusilli, lemon, salt, pepper, red pepper flakes,
and toasted pine nuts55 and call the function again to create a pasta recipe
6. Double-click the “receipe.txt” file that gets written to Colab temporary storage
to validate the recipe
7. Exercise (stretch) – modify the code to ensure the recipe generated from the
GPT-3 model isn’t truncated
8. Using the same GPT-3 transformer model, proceed to run the last two cells
which provide contextual examples56 of “sarcasm” then calls the (same) GPT-3
transformer function. The function returns an (NLG-) generated “sarcastic”
response to the (empty57) last question “What time is it?” in the contextual
examples/chatbot text
55
Just update the recipe variable to: recipe = ‘fresh basil, garlic, pine nuts, extra-virgin olive oil,
parmesan cheese, fusilli, lemon, salt, pepper, red pepper flakes, toasted pine nuts ‘
56
That is, “few-shot” training - prompting a machine learning model to make predictions with a
limited number of training examples
57
Note that in the contextual examples the last response from “MARV” is deliberately blank –
MARV’s response is what we are trying to predict/generate
358
Postscript
Wrap-up
There are many trends waiting to happen in AI. The speed of change in the last ten
years, since ImageNet,1 has been phenomenal. Creating a summary of what is likely
to commercialize in the next five years, never mind the next decades, is fraught with
uncertainty. Rather than attempting to predict the future, and in line with the practical
nature of this book, we will outline some of the more credible innovations having
“disruptive” potential.
Next-generation AI or AI 2.0 is already here, addressing advancements in the
technology that address portability, accuracy, and security challenges.2 Hyperscalers,
or Data Centre operators that offer scalable cloud computing services, are expected to
leverage more transfer learning and reinforcement learning, and transformer networks
are expected to make AI smarter and more mobile. The problem is hyperscalers tend
to be the same group of Big Tech companies currently providing storage and compute
services together perhaps with Alibaba AliCloud, IBM, and Oracle.
1
See Chapter 1.
2
See https://www.forrester.com/report/AI-20-Upgrade-Your-Enterprise-With-Five-
NextGeneration-AI-Advances/RES163520?objectid=RES163520. The NLP 2.0 section from
our last chapter showcases additional AI 2.0 advances specially in relation to Natural Language
Processing.
359
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7
POSTSCRIPT
3
Both training and testing data. Appen see this transformation as having a flywheel effect for
businesses struggling to overcome data challenges for AI: https://appen.com/solutions/
training-data/
360
POSTSCRIPT
Epilogue
A great deal of frustration has gone into writing this book – frustration about the
dominance of Big Tech but mainly frustration about the job market’s sloping playing
field. When you are out of a job and an industry and there’s no money coming in, not
good at self-promotion and you’re not young either well…. let’s put it this way…the
world in 2022 doesn’t care to notice, and I’m not sure it ever did. At times that frustration
has turned to despair – like trying to clamber back into a boat in a turbulent ocean, at
night-time.
Much of this job market pain has been channeled into writing this book. I hope as
well that in some way the suggestions around avoiding obscured cloud costs and feeding
the Big Tech machine further provide some practical help to readers. None of us should
be charged for testing their services (it’s my job – I don’t have a choice!), while learning
from poorly documented solutions (ok not all are produced by Big Tech), and having no
successful use case at the end.
If there is any advice I can give to challenge cloud costs it’s to always request
itemized resource usage. Question that “EBS” instance you never knew you provisioned,
ask what the hell is “D13 HDInsight,” “D11 v2/DS11 v2 VM,” “Basic Registry Unit,” or
“Premium All Purpose Compute.” And why DO we have to empty S3 buckets before
361
POSTSCRIPT
deleting them and I’m certain information about only the first 10 billing alarms on
CloudWatch being free is not as visible as it could be4?
Certain tools and labs in this book are more problematic than others. Getting Kafka
to consume events and Databricks to talk to AWS MSK comes with an abundance of
errors and are both a complete pain in the butt. The arbitrary rent charged on cloud
often feels no different from costs imposed by a cartel and support is generally designed
to prevent raising a complaint. Raise a ticket if you can – but watch for the hard-coded
default response to a question you didn’t ask. And if you get through to the final form for
submitting to a human, well expect to spend half an hour manually reducing the length
of the complaint to fit the word limit and find and replace any characters that are not
alphanumeric.5
At the end of the day, we all need them, but the new hyperscalers do get rich on this
behavior, charging individuals costs for using services they need to survive in the Data
market. When the purpose is for training, development, or testing,6 shouldn’t the service
be free, especially when upskilling ends up requiring cloud native accreditation or drives
further B2B cloud revenues, for which Big Tech already make a huge amount of money?
4
So if you want billing alerts to ensure you are keeping an eye on cloud usage, you will be charged
for those as well.
5
If required, raise a billing dispute in AWS by going to https://console.aws.amazon.com/
support/home#/case/create and selecting Account & Billing > Service: Billing > Category:
Dispute a Charge. For Azure, choose new support request > billing > refund request at
the following link: https://portal.azure.com/#blade/Microsoft_Azure_Support/
HelpAndSupportBlade/overview
6
And not for production – which should be obvious/auditable to the CSPs from resources being
deleted shortly afterward.
362
Index
A software tools, 222
TPUs, 216
Abstraction-based summarization, 342
virtual environments, 218
A/B testing, 261
AI ecosystem
Activation functions, 28
agile delivery models, 9
definition, 172
applications, 2
hyperbolic tangent function
automata, 3
(tanh), 173
CSPs, 7, 8
ReLU, 173
definitions, 4
sigmoid function, 173
evolution, 3
softmax, 173
full-stack AI, 10
Adadelta, 176
hype cycle, 2
AdaGrad/Adaptive Gradient
AI Ladder methodology, 76
Algorithm, 176
AlexNet Deep Neural Networks, 135
Adaptability, 50 Amazon API Gateway, 79
Agile Amazon Elastic Block Store (EBS, 93
adaptability, 50, 51 Amazon SageMaker Autopilot, 201
benefits, 50 Amazon Simple Storage
development/product sprints, 48, 49 Service (S3), 35
react.js, 52 Amazon Web Services, 7
teams /collaboration, 47, 48 Apache Cassandra, 89
AI application development Apache Hadoop, 81, 93, 94, 215
AI accelerators, 221 Apache HBase, 89
AI solutions, 212 Apache Kafka, 10, 46, 105
APIs/endpoints, 214, 215 Apache Maven, 62
API web services/endpoints, 220 Apache Parquet, 81
apps, running, 213 Apache Spark, 10, 95, 105, 159, 309
clusters, 215 Apache tooling suite, 95
GPUs, 216 Application Programming Interface
IDC growth forecast, 212 (API), 79, 214
running Python, 219 Argument knowledge base (AKB), 353
sharding, 217 Artificial General Intelligence (AGI), 6
363
© Barry Walsh 2023
B. Walsh, Productionizing AI, https://doi.org/10.1007/978-1-4842-8817-7
INDEX
364
INDEX
365
INDEX
366
INDEX
367
INDEX
G
Gartner Hype Cycles, 136 I
Gated recurrent units (GRUs), 166 Image data augmentation, 181
Generative Adversarial Networks Industry case studies
(GANs), 30, 152 AI enablers, 248
GitHub, 54 business/organizational
Global search algorithms demand, AI, 248
Bayesian optimization and cybersecurity, 285
inference, 191 insurance/telematics, 286
python libraries, 192 legal sector, 286, 287
GloVe model, 341 manufacturing, 285
Google BigLake, 93 public sector/government, 284, 285
Google Brain, 134 solution framework, 249–251
Google Cloud Platform (GCP), 7, 227 use cases, 249
Google Cloud Storage, 35 Information Retrieval and Extraction
Google Colab, 157 (IR/IE), 322
Google teachable Machine, 205, 314 Insurers’ business models, 286
Governance, Risk Management and Intelligent virtual agents (IVAs), 16, 188,
Compliance (GRC), 265 323, 352
Gradient descent, 174 Interactive Voice Response (IVR)
Granular approach, 127 applications, 16
368
INDEX
369
INDEX
370
INDEX
372
INDEX
373