0% found this document useful (0 votes)
118 views45 pages

Advanced Data Analytics and Visualization Course Material

Uploaded by

prabanjanpappu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views45 pages

Advanced Data Analytics and Visualization Course Material

Uploaded by

prabanjanpappu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

InGage Technologies Pvt Ltd

KG 360 Degree, 7th Floor, Plot 231/1 MGR


Salai, Perungudi, Chennai 600096
www.myingage.com

Advanced Data Analytics and Visualization

Naan Mudhalvan Course for


Arts & Science

1
Table of Contents

Chapter 1: FOUNDATION OF DATA ANALYTICS


1.1 INTRODUCTION TO DATA ANALYTICS​ 03
1.2 Exploratory Data Analysis (EDA)​ 07
1.3 Applications of Big Data Analytics in Different Industries​ 08
Chapter 2: DATA EXTRACTION USING SQL
2.1 INTRODUCTION TO SQL​ 11
2.2 SQL Query Lifecycle​ 12
2.3 Data Relationships & Keys​ 13
2.4 DATA HANDLING AND PREPROCESSING​ 15
2.5 An Introduction to APIs​ 21
Chapter 3: DATA ANALYSIS WITH PYTHON & PANDAS
3.1 Introduction To Python​ 24
3.2 Classes & Objects​ 29
3.3 OOPs Concepts​ 30
3.4 Pandas​ 31
3.5 Data Cleaning with Pandas​ 33
Chapter 4: DATA VISUALIZATION WITH LOOKER STUDIO
4.1 INTRODUCTION TO DATA VISUALIZATION​ 35
4.2 Creating Charts and Visual Elements​ 37
Chapter 5: PREDICTIVE ANALYTICS AND REPORTING
5.1 Introduction to Predictive Analytics​ 40
5.2 Tools and Technologies in Predictive Analytics​ 43

2
CHAPTER 1 - FOUNDATION OF DATA ANALYTICS

1.1 INTRODUCTION TO DATA ANALYTICS


Data analytics is the process of analyzing raw data in order to draw out
meaningful, actionable insights, which are then used to inform and drive smart
business decisions. A data analyst will extract raw data, organize it, and then analyze
it, transforming it from incomprehensible numbers into coherent, intelligible
information.

Why Data Analytics Using Python?


●​ Python is easy to learn and understand and has a simple syntax.
●​ The programming language is scalable and flexible.
●​ It has a vast collection of libraries for numerical computation and data
manipulation.
●​ Python provides libraries for graphics and data visualization to build plots.

What Is the Data Analysis Process?


Answering the question “what is data analysis” is only the first step. Now we
will look at how it’s performed. The process of data analysis, or alternately, data
analysis steps, involves gathering all the information, processing it, exploring the
data, and using it to find patterns and other insights. The process of data analysis
consists of below topics

Data Requirement Gathering


Ask yourself why you’re doing this analysis, what type of data you want to
use, and what data you plan to analyze.

3
Data Collection
Guided by your identified requirements, it’s time to collect the data from your
sources. Sources include case studies, surveys, interviews, questionnaires, direct
observation, and focus groups. Make sure to organize the collected data for analysis.

Data Cleaning:
Not all of the data you collect will be useful, so it’s time to clean it up. This
process is where you remove white spaces, duplicate records, and basic errors. Data
cleaning is mandatory before sending the information on for analysis.

4
Data Analysis
Here is where you use data analysis software and other tools to help you
interpret and understand the data and arrive at conclusions. Data analysis tools
include Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase, Redash, and
Microsoft Power BI.

5
Data Interpretation:
Now that you have your results, you need to interpret them and come up
with the best courses of action based on your findings.

Data Visualization:
Data visualization is a fancy way of saying, “graphically show your information
in a way that people can read and understand it.” You can use charts, graphs, maps,
bullet points, or a host of other methods. Visualization helps you derive valuable

insights by helping you compare datasets and observe relationships.

Data Analysis Methods:


Some professionals use the terms “data analysis methods” and “data analysis
techniques” interchangeably. To further complicate matters, sometimes people
throw in the previously discussed “data analysis types” into the fray as well! Our
hope here is to establish a distinction between what kinds of data analysis exist and
the various ways it’s used.

Descriptive Analysis:
Descriptive analysis involves summarizing and describing the main features of
a dataset. It focuses on organizing and presenting the data in a meaningful way, often
using measures such as mean, median, mode, and standard deviation. It provides an
overview of the data and helps identify patterns or trends.

Inferential Analysis:
Inferential analysis aims to make inferences or predictions about a larger
population based on sample data. It involves applying statistical techniques such as
hypothesis testing, confidence intervals, and regression analysis. It helps generalize
findings from a sample to a larger population.

6
1.2 Exploratory Data Analysis (EDA):
EDA focuses on exploring and understanding the data without preconceived
hypotheses. It involves visualizations, summary statistics, and data profiling
techniques to uncover patterns, relationships, and interesting features. It helps
generate hypotheses for further analysis.

Diagnostic Analysis:

Diagnostic analysis aims to understand the cause-and-effect relationships


within the data. It investigates the factors or variables that contribute to specific
outcomes or behaviors. Techniques such as regression analysis, ANOVA (Analysis of
Variance), or correlation analysis are commonly used in diagnostic analysis.

7
Predictive Analysis:

Predictive analysis involves using historical data to make predictions or


forecasts about future outcomes. It utilizes statistical modeling techniques, machine
learning algorithms, and time series analysis to identify patterns and build predictive
models. It is often used for forecasting sales, predicting customer behavior, or
estimating risk.

Prescriptive Analysis:
Prescriptive analysis goes beyond predictive analysis by recommending
actions or decisions based on the predictions. It combines historical data,
optimization algorithms, and business rules to provide actionable insights and
optimize outcomes. It helps in decision-making and resource allocation.

1.3 Applications of Big Data Analytics in Different Industries:


The following section will provide you with an in-depth understanding of Big
Data Applications in different industries with big data examples.

1. Banking and Securities:


In the banking and securities sector, Big Data plays a pivotal role in fraud detection
and prevention. Through advanced analytics, anomalies in transaction patterns are
swiftly identified, enabling real-time intervention and safeguarding against financial
fraud.

8
2. Communications, Media, and Entertainment:
The realm of communications, media, and entertainment thrives on Big Data’s
capabilities. Audience analytics empower media companies to decipher viewer
habits, preferences, and engagement patterns, optimizing content creation and
targeted advertising. Content recommendation algorithms utilize Big Data
applications to suggest personalized content, enhancing user experiences.

3. Healthcare Providers:
Big Data revolutionizes healthcare by enabling personalized medicine. Analyzing
extensive genetic and patient data allows providers to tailor treatment plans for
individual needs, enhancing medical outcomes. Healthcare analytics drive
operational efficiency by optimizing patient flow, resource allocation, and
administrative processes.

4. Education:
In education, Big Data empowers institutions with learning analytics, offering insights
into student performance and learning patterns. Predictive analytics aids in
identifying students at risk of dropping out, allowing timely intervention and support.

5. Manufacturing and Natural Resources:


For manufacturing and natural resource industries, Big Data’s value lies in supply
chain optimization. Analyzing data enhances logistical efficiency, cost-effectiveness,
and resource allocation. Predictive maintenance based on data analysis ensures
equipment reliability and minimizes downtime.

6. Government:
Big Data applications to the government are vast. Urban planning benefits from
data-driven insights into traffic patterns, resource allocation, and infrastructure
needs. Public safety gains from predictive crime analysis, enabling law enforcement
agencies to allocate resources effectively.

7. Insurance:
In the insurance sector, Big Data applications is pivotal for underwriting and risk
assessment. By analyzing vast datasets, insurers can accurately assess risks and
determine pricing. Claims processing benefits from data analytics, expediting
verification processes and detecting fraudulent claims.

8. Retail and Wholesale Trade:


Big Data’s impact on retail is substantial, driving customer behavior analysis to
optimize marketing strategies and inventory management. Retailers analyze purchase
patterns and preferences to tailor offerings, enhancing customer satisfaction.

9
9. Transportation:
In the transportation industry, Big Data transforms traffic management by analyzing
real-time data from vehicles and sensors, leading to optimized traffic flow. Fleet
management benefits from data insights, optimizing routes, and improving fuel
efficiency.

10. Energy and Utilities:


Big Data is a driving force in the energy and utilities sector, enabling smart grid
management for efficient energy distribution and demand response. Energy
consumption analysis provides customers with insights to conserve energy.

10
CHAPTER 2 - DATA EXTRACTION USING SQL

2.1 INTRODUCTION TO SQL


Structured Query Language (SQL) is the cornerstone of modern data
management. It provides a standardized way to interact with relational databases,
enabling users to store, retrieve, manipulate, and analyze data efficiently. In this unit,
we will dive deep into SQL fundamentals, syntax, data relationships, and practical
applications across industries, using real-world datasets. This chapter is elaborated to
span approximately 20+ pages, providing comprehensive theoretical foundations,
extensive code examples, diagrams, best practices, and context-based
problem-solving.

Introduction to Databases and SQL

A database is a structured collection of data that can be easily accessed,


managed, and updated. SQL, or Structured Query Language, is used to interact with
relational databases—systems that store data in tabular format with rows and
columns. SQL makes it possible to insert, update, delete, and query data across
multiple tables, maintaining data integrity and efficiency.

Key Components:
●​ Tables: Consist of rows (records) and columns (fields).
●​ Schemas: Define the structure and organization of tables.
●​ Keys: Primary keys ensure uniqueness; foreign keys maintain relationships.

11
Types of Databases
1.​ Relational Databases (RDBMS): Use structured schemas. Examples: MySQL,
PostgreSQL.
2.​ NoSQL Databases: Store unstructured data. Examples: MongoDB, Cassandra.
We focus on RDBMS because SQL is the primary language for these systems.

2.2 SQL Query Lifecycle:

Every SQL query goes through a process:


1.​ Parsing – Breaks the query into elements.
2.​ Optimization – Chooses the most efficient execution plan.
3.​ Execution – Retrieves the result from the storage engine.

Understanding this helps us write efficient queries that run fast even on big datasets.

SELECT Queries in Depth


The SELECT statement is the foundation of data retrieval. Its syntax is:
SELECT column1, column2
FROM table_name
WHERE condition
GROUP BY column
HAVING condition
ORDER BY column ASC|DESC;

12
Examples:
-- Basic retrieval
SELECT name, salary FROM employees;
-- Using arithmetic
SELECT name, salary, salary*1.10 AS incremented_salary FROM employees;
-- Renaming columns
SELECT name AS employee_name FROM employees;
Filtering Data with WHERE
The WHERE clause helps restrict which rows appear in your result.
SELECT * FROM customers WHERE city = 'Chennai';
You can combine multiple conditions using logical operators:
●​ AND, OR, NOT
●​ BETWEEN, IN, LIKE, IS NULL

Advanced Example:
SELECT * FROM sales WHERE (amount > 1000 AND region IN ('South', 'East')) OR
customer_id IS NULL;

Sorting and Limiting Results


Sorting is handled by ORDER BY, and we can control output rows using LIMIT (in
MySQL) or TOP (in SQL Server).
SELECT * FROM products ORDER BY price DESC LIMIT 10;
This returns the 10 most expensive products.
Understanding NULL Values
NULL means unknown or missing. Important functions:
SELECT * FROM employees WHERE manager_id IS NULL;
SELECT COALESCE(manager_id, 'N/A') FROM employees;
COALESCE() replaces NULLs with a default.
Aggregations and Grouping
Aggregates give summary data:
SELECT department, AVG(salary), MAX(salary), COUNT(*) FROM employees GROUP
BY department;
Use HAVING to filter aggregate results:
SELECT department, COUNT(*) FROM employees GROUP BY department HAVING
COUNT(*) > 5;

2.3 Data Relationships & Keys


●​ Primary Key – Uniquely identifies each row.
●​ Foreign Key – Links one table to another.
Example:
●​ employees table with dept_id linked to departments(id).

13
Relationships:
●​ One-to-One
●​ One-to-Many
●​ Many-to-Many (via bridge table)

Mastering SQL Joins


Join Type Description

INNER JOIN Only matched records

LEFT JOIN All records from left + matched from right

RIGHT JOIN All from right + matched from left

FULL OUTER JOIN All records with NULLs where no match

SELECT e.name, d.name FROM employees e


LEFT JOIN departments d ON e.dept_id = d.id;

Subqueries and Nested Queries


Subqueries can exist in WHERE, FROM, or SELECT clauses.
WHERE clause
SELECT name FROM employees WHERE salary > (SELECT AVG(salary) FROM
employees);
FROM clause
SELECT avg_salary FROM (SELECT department, AVG(salary) AS avg_salary FROM
employees GROUP BY department) AS dept_salaries;

Common Table Expressions (CTEs)


CTEs are readable, reusable query blocks.
WITH dept_avg AS (
SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY
department
)
SELECT * FROM dept_avg WHERE avg_salary > 50000;

Indexing for Performance


Indexes accelerate query performance by avoiding full table scans.
CREATE INDEX idx_salary ON employees(salary);
Views for Reusability
Views are saved queries treated as tables.

14
CREATE VIEW high_earners AS
SELECT name, salary FROM employees WHERE salary > 100000;
Transactions and ACID Principles
Transactions allow for grouped operations.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

ACID:
●​ Atomicity – All or none
●​ Consistency – Maintain integrity
●​ Isolation – One transaction doesn’t affect another
●​ Durability – Survives failures​

Sample Dataset Structure


customers: id, name, city, email
orders: id, customer_id, product_id, date, quantity
products: id, name, category, price
Example Query:
SELECT c.name, p.name, o.quantity FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

2.4 DATA HANDLING AND PREPROCESSING

15
What is Data Preprocessing?
Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we cannot work
with raw data. The quality of the data should be checked before applying machine
learning or data mining algorithms.

Why is Data Preprocessing Important?


Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following:
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do
not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing


There are 4 major tasks in data preprocessing – Data cleaning, Data
integration, Data reduction, and Data transformation.

Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and
inaccurate data from the datasets, and it also replaces the missing values.

Here are some techniques for data cleaning:


Handling Missing Values
●​ Standard values like “Not Available” or “NA” can be used to replace the
missing values.
●​ Missing values can also be filled manually, but it is not recommended when
that dataset is big.
●​ The attribute’s mean value can be used to replace the missing value when the
data is normally distributed
●​ wherein in the case of non-normal distribution the median value of the
attribute can be used.
●​ While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.

16
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points.
Handling noisy data is one of the most important steps as it leads to the optimization
of the model we are using. Here are some of the methods to handle noisy data.
Binning: This method is to smooth or handle noisy data. First, the data is
sorted then, and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin; Smoothing by bin median: In this method, the
values in the bin are replaced by the median value; Smoothing by bin boundary: In
this method, the using minimum and maximum values of the bin values are taken,
and the closest boundary value replaces the values.

Regression:

This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide the
variable which is suitable for our analysis.

17
Clustering:

This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.

Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components of data management. There are
some problems to be considered during data integration.
Schema integration: Integrates metadata(a set of data that describes other
data) from different sources.
Entity identification problem: Identifying entities from multiple databases.
For example, the system or the user should know the student id of one database and
student name of another database belonging to the same entity.
Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. The attribute values from one database may differ from
another database. For example, the date format may differ, like “MM/DD/YYYY” or
“DD/MM/YYYY”.

18
Data Reduction

This process helps in the reduction of the volume of the data, which makes
the analysis easier yet produces the same or almost the same result. This reduction
also helps to reduce storage space. Some of the data reduction techniques are
dimensionality reduction, numerosity reduction, and data compression.
Dimensionality reduction: This process is necessary for real-world
applications as the data size is big. In this process, the reduction of random variables
or attributes is done so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space, and computation
time is reduced. When the data is highly dimensional, a problem called the “Curse of
Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is
made smaller by reducing the volume. There will not be any loss of data in this
reduction.
Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.

19
Data Transformation

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods for data transformation.
Smoothing: With the help of algorithms, we can remove noise from the
dataset, which helps in knowing the important features of the dataset. By smoothing,
we can find even a simple change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into a data
analysis description. This is an important step since the accuracy of the data depends
on the quantity and quality of the data. When the quality and the quantity of the
data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can set
an interval like (3 pm-5 pm, or 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0

What is web scraping


Web scraping is the process of using bots to extract content and data from a
website.
Unlike screen scraping, which only copies pixels displayed onscreen, web
scraping extracts underlying HTML code and, with it, data stored in a database. The
scraper can then replicate entire website content elsewhere.

20
Scraper tools and bots
Web scraping tools are software (i.e., bots) programmed to sift through
databases and extract information. A variety of bot types are used, many being fully
customizable to:
●​ Recognize unique HTML site structures
●​ Extract and transform content
●​ Store scraped data
●​ Extract data from APIs

2.5 An Introduction to APIs

In simple words, an API is a (hypothetical) contract between 2 softwares


saying if the user software provides input in a pre-defined format, the latter will
extend its functionality and provide the outcome to the user software. Think of it like
this, Graphical user interface (GUI) or command line interface (CLI) allows humans to

21
Interact with code, whereas an Application programmable interface (API) allows one
piece of code to interact with other code.
●​ Categories of API
●​ Web-based system
A web API is an interface to either a web server or a web browser. These APIs
are used extensively for the development of web applications. These APIs work at
either the server end or the client end. Companies like Google, Amazon, eBay all
provide web-based API.
Some popular examples of web based API are Twitter REST API, Facebook
Graph API, Amazon S3 REST API, etc.

Operating system
There are multiple OS based API that offers the functionality of various OS features
that can be incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.

Database system
Interaction with most of the database is done using the API calls to the
database. These APIs are defined in a manner to pass out the requested data in a
predefined format that is understandable by the requesting client.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API,
Django API.

Hardware System
These APIs allow access to the various hardware components of a system.
They are extremely crucial for establishing communication to the hardware. Due to
which it makes possible for a range of functions from the collection of sensor data to
even display on your screens.
Some other examples of Hardware APIs are: QUANT Electronic, WareNet
CheckWare,OpenVX Hardware Acceleration, CubeSensore, etc.

5 APIs every Data Scientists should know


Facebook API
Facebook API provides an interface to a large amount of data generated
everyday. The innumerable post, comments and shares in various groups & pages
produces massive data. And this massive public data provides a large number of
opportunities for analyzing the crowd.
Google Map API
Google Map API is one of the commonly used API. Its applications vary from
integration in a cab service application to the popular Pokemon Go.

22
Twitter API
Just like Facebook Graph API, Twitter data can be accessed using the Twitter
API as well. You can access all the data like tweets made by any user, the tweets
containing a particular term or even a combination of terms, tweets done on the
topic in a particular date range, etc.

IBM Watson API


IBM Watson offers a set of APIs for performing a host of complex tasks such
as Tone analyzer, document conversion, personality insights, visual recognition, text
to speech, speech to text, etc by using just few lines of code.

Quandl API
Quandl lets you invoke the time series information of a large number of
stocks for the specified date range. The setting up of Quandl API is very easy and
provides a great resource for projects like Stock price prediction, stock profiling,
etc.List of 5 cool data science projects using API
Here is the list of the projects for you. I’ll leave the execution of these ideas to
you.

23
CHAPTER 3 - DATA ANALYSIS WITH PYTHON & PANDAS

3.1 Introduction To Python


Python is a general purpose programming language. It is very easy to learn,
easy syntax and readability is one of the reasons why developers are switching to
python from other programming languages.
It was created in 1991 by Guido Van Rossum. The origin of its name is inspired
by the comedy series called ‘Monty Python’.

Variables & Data Types


Variables are like a memory location where you can store a value. This value,
you may or may not change in the future.​
x = 10
y = 20
name = 'ingage'

To declare a variable in python, you only have to assign a value to it. There are
no additional commands needed to declare a variable in python.

Data Types in Python:


●​ Numbers
●​ String
●​ List
●​ Dictionary
●​ Set
●​ Tuple

Operators
Operators in python are used to do operations between two values or
variables. Following are the different types of operators that we have in python:
●​ Arithmetic Operators
●​ Logical Operators
●​ Assignment Operators
●​ Comparison Operators
●​ Membership Operators
●​ Identity Operators
●​ Bitwise Operators

Loops In Python
A loop allows us to execute a group of statements several times. To
understand why we use loops, let's take an example.

24
Suppose you want to print the sum of all even numbers until 1000. If you
write the logic for this task without using loops, it is going to be a long and tiresome
task.

But if we use a loop, we can write the logic to find the even number, give a
condition to iterate until the number reaches 1000 and print the sum of all the
numbers. This will reduce the complexity of the code and also make it readable as
well.

There are following types of loops in python:


●​ for loop
●​ while loop
●​ nested loops

For Loop
A ‘for loop’ is used to execute statements once every iteration. We already
know the number of iterations that are going to execute.
A for loop has two blocks, one is where we specify the conditions and then
we have the body where the statements are specified which gets executed on each
iteration.

for x in range(10):
print(x)

While Loop
The while loop executes the statements as long as the condition is true. We
specify the condition in the beginning of the loop and as soon as the condition is
false, the execution stops.

i=1
while i < 6:
print(i)
i += 1
#the output will be numbers from 1-5.

Nested Loops
Nested loops are combination of loops. If we incorporate a while loop into a
for loop or vis-a-vis.

25
Following are a few examples of nested loops:

for i in range(1,6):
for j in range(i):
print(i , end = "")
print()
# the output will be
1
22
333
4444
55555

Conditional and Control Statements


Conditional statements in python support the usual logic in the logical
statements that we have in python.

Following are the conditional statements that we have in python:


●​ if
●​ elif
●​ else

if statement

x = 10
if x > 5:
print('greater')

The if statement tests the condition, when the condition is true, it executes
the statements in the if block.

elif statement

x = 10
if x > 5:
print('greater')
elif x == 5:
print('equal')
#else statement

x =10

26
if x > 5:
print('greater')
elif x == 5:
print('equal')
else:
print('smaller')

When both if and elif statements are false, the execution will move to else
statement.

Control statements
Control statements are used to control the flow of execution in the program.

Following are the control statements that we have in python:


●​ break
●​ continue
●​ pass

Break

name = 'ingage'
for val in name:
if val == 'a':
break
print(i)
#the output will be
i
n
g

The execution will stop as soon as the loop encounters break.

Continue

name = 'ingage'
for val in name:
if val == 'a':
continue
print(i)
#the output will be

27
i
n
g
a
g
e

When the loop encounters continue, the current iteration is skipped and rest
of the iterations get executed.

Functions
A function in python is a block of code which will execute whenever it is
called. We can pass parameters in the functions as well. To understand the concept
of functions, let's take an example.
Suppose you want to calculate the factorial of a number. You can do this by
simply executing the logic to calculate a factorial. But what if you have to do it ten
times in a day, writing the same logic again and again is going to be a long task.
Instead, what you can do is, write the logic in a function. Call that function
every time you need to calculate the factorial. This will reduce the complexity of your
code and save your time as well.

How to Create a Function?


# we use the def keyword to declare a function

def function_name():
#expression
print('abc')

How to Call a Function?

def my_func():
print('function created')

#this is a function call


my_func()

Function Parameters
We can pass values in a function using the parameters. We can use also give
default values for a parameter in a function as well.

def my_func(name = 'edureka'):

28
print(name)
#default parameter

my_func()
#userdefined parameter
my_func('python')

Now that we have understood function calls, parameters and why we use
them, let's take a look at classes and objects in python.

3.2 Classes & Objects


What are Classes?
Classes are like a blueprint for creating objects. We can store various
methods/functions in a class.

class classname:
def functionname():
print(expression)

What are Objects?


We create objects to call the methods in a class, or to access the properties of
a class.

class myclass:
def func():
print('my function')

ob1 = myclass()
ob.func()

__init__ function

It is an inbuilt function which is called when a class is being initiated. All


classes have __init__ function. We use the __init__ function to assign values to
objects or other operations which are required when an object is being created.

class myclass:
def __init__(self, name):
self.name = name
ob1 = myclass('ingage')

29
ob1.name
#the output will be- ingage

Now that we have understood the concept of classes and objects, lets take a
look at a few oops concepts that we have in python.

3.3 OOPs Concepts


Python can be used as an object oriented programming language. Hence, we
can use the following concepts in python:
●​ Abstraction
●​ Encapsulation
●​ Inheritance
●​ Polymorphism

Abstraction
Data abstraction refers to displaying only the necessary details and hiding the
background tasks. Abstraction in python is similar to any other programming
language.
Like when we print a statement, we don’t know what is happening in the
background.

Encapsulation
Encapsulation is the process of wrapping up data. In python, classes can be an
example of encapsulation where the member functions and variables etc are
wrapped into a class.

Inheritance
Inheritance is an object oriented concept where a child class inherits all the
properties from a parent class. Following are the types of inheritance we have in
python:
●​ Single Inheritance
●​ Multiple Inheritance
●​ Multilevel Inheritance

Polymorphism
Polymorphism is the process in which an object can be used in many forms.
The most common example would be when a parent class reference is used to refer
to a child class object.

30
NumPy
Overview:
NumPy, which stands for Numerical Python, is a fundamental package for
scientific computing in Python.
It provides support for large, multi-dimensional arrays and matrices, along
with mathematical functions to operate on these arrays.
NumPy is the foundation for various data science and machine learning
libraries in Python.

Key Features:
●​ Arrays: The primary data structure in NumPy is the numpy.ndarray, which is
an n-dimensional array.
●​ Mathematical Operations: NumPy provides a vast array of mathematical
functions for performing operations on arrays, such as linear algebra,
statistical, and Fourier analysis functions.
●​ Broadcasting: NumPy allows for operations between arrays of different
shapes and sizes, making it convenient for mathematical computations.
●​ Random Module: NumPy includes a submodule for generating random
numbers, which is crucial in simulations and statistical analyses.

What is NumPy Used for?


NumPy is an important library generally used for:
●​ Machine Learning
●​ Data Science
●​ Image and Signal Processing
●​ Scientific Computing
●​ Quantum Computing

3.4 Pandas
Overview:
Pandas is a powerful data manipulation library built on top of NumPy. It
provides data structures for efficiently storing and manipulating large datasets.
The primary data structures in Pandas are Series and DataFrame.
Pandas is widely used for data cleaning, preparation, and analysis in data
science and machine learning workflows.

Key Features:
●​ DataFrame: A two-dimensional table with labeled axes (rows and columns),
similar to a spreadsheet or SQL table.
●​ Series: A one-dimensional labeled array that can hold any data type.

31
●​ Data Alignment: Pandas automatically aligns data based on label names,
making it easy to work with datasets from different sources.
●​ Data Cleaning and Handling Missing Data: Pandas provides methods for
handling missing data, such as filling missing values or dropping
rows/columns.
●​ GroupBy: Pandas supports the splitting of data into groups based on some
criteria and applying a function to each group independently.

Matplotlib
Overview:
●​ Purpose: Matplotlib is a 2D plotting library for creating static, animated, and
interactive visualizations in Python.
●​ Versatility: It is highly customizable and supports a wide variety of plot types,
from simple line charts to complex contour plots.
●​ Publication-Quality Graphics: Matplotlib is widely used for creating
publication-quality graphics in scientific and engineering publications.

Key Features:
●​ Plots and Charts: Matplotlib supports various types of plots, including line
plots, bar plots, scatter plots, histograms, and more.
●​ Customization: Users have full control over the appearance of plots. They can
customize colors, markers, labels, and other visual elements.
●​ Subplots: Multiple plots can be organized in a single figure, allowing for the
comparison of different datasets.
●​ Animations: Matplotlib supports creating animations, which is useful for
visualizing dynamic data.

Seaborn
Overview:
●​ Built on Matplotlib: Seaborn is a statistical data visualization library built on
top of Matplotlib.
●​ Focus on Aesthetics: Seaborn is designed to make attractive and informative
statistical graphics with fewer lines of code than Matplotlib.
●​ Statistical Plots: It includes several high-level functions for creating
informative statistical graphics, such as violin plots, box plots, and pair plots.

Key Features:
●​ High-Level Interface: Seaborn provides functions for creating complex
visualizations with minimal code, simplifying the process of creating statistical
plots.

32
●​ Color Palettes: Seaborn includes several color palettes, making it easy to
create visually appealing plots with consistent color schemes.
●​ Themes: Seaborn has built-in themes that can be applied to enhance the
aesthetics of plots.
●​ Regression Plots: Seaborn provides functions for visualizing linear regression
models, such as lmplot and regplot.

Matplotlib vs. Seaborn:


Seaborn simplifies the process of creating statistical plots by providing a
high-level interface, while Matplotlib is a more low-level and customizable library.
Matplotlib is a general-purpose plotting library, while Seaborn is specialized
for statistical data visualization.
Users often use Seaborn for quick exploratory data analysis and Matplotlib for
more customized and complex visualizations
Setting Up Google Colab
Google Colab is a cloud-based Jupyter notebook environment that requires
no installation. It provides a ready-to-use Python environment with free GPU
support.

Steps to Start with Colab:


1.​ Go to https://colab.research.google.com
2.​ Sign in with a Google account.
3.​ Click on "New Notebook."
4.​ Begin coding with Python and use Pandas for data analysis.
\
Reading and Inspecting Data
Pandas supports reading data from multiple sources like CSV, Excel, SQL, and JSON.

# Read CSV
df = pd.read_csv("data.csv")

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

3.5 Data Cleaning with Pandas


Handling Missing Data:
# Fill missing values
df.fillna(0, inplace=True)

33
# Drop rows with missing values
df.dropna(inplace=True)

Renaming Columns:
df.rename(columns={'old_name': 'new_name'}, inplace=True)

Data Type Conversion:


df['date'] = pd.to_datetime(df['date'])

Filtering and Selecting Data


# Filter rows
df[df['Age'] > 25]

# Select columns
df[['Name', 'Age']]

Data Aggregation and Grouping


# Group by column and compute mean
print(df.groupby('Department')['Salary'].mean())

Applying Functions to Columns


def categorize(age):
return 'Adult' if age >= 18 else 'Minor'

df['Category'] = df['Age'].apply(categorize)
Merging and Joining DataFrames

# Merging two DataFrames on a common column


merged_df = pd.merge(df1, df2, on='ID', how='inner')
Saving Processed Data

# Save to CSV
df.to_csv('cleaned_data.csv', index=False)

34
CHAPTER 4 - DATA VISUALIZATION WITH LOOKER STUDIO

4.1 INTRODUCTION TO DATA VISUALIZATION


Data visualization is an essential part of data analytics because it transforms
raw data into meaningful insights through graphical representation. When data is
presented visually, complex information becomes more accessible, easier to

understand, and more actionable. This is especially crucial in business, education,


and any field where decision-making depends on data.

Looker Studio (formerly Google Data Studio) is a powerful, free, web-based


platform designed specifically for creating dynamic, interactive data visualizations
and dashboards. It enables users — regardless of coding skills — to connect multiple
data sources, design customizable reports, and share insights instantly. Looker Studio
bridges the gap between data complexity and user comprehension by offering
intuitive drag-and-drop tools and seamless integration with popular Google products
like Google Sheets, BigQuery, and Google Analytics.

The Role of Data Visualization


Before diving into Looker Studio itself, it’s important to understand why
visualization matters:
●​ Enhances Understanding: Humans process visual information faster than text
or tables. Visualizations like charts, graphs, and maps reveal patterns, trends,
and outliers that might otherwise go unnoticed.
●​ Improves Communication: Visual reports make it easier to share insights with
stakeholders who may not have technical expertise.

35
●​ Supports Decision-Making: Well-designed visuals help identify opportunities,
risks, and areas needing attention.
●​ Encourages Exploration: Interactive dashboards allow users to drill down into
data and customize views based on their questions.​

In the world of data analytics, raw numbers alone are rarely sufficient.
Visualization translates data into a narrative that supports strategic actions.
Connecting Data Sources to Looker Studio
The journey begins with linking your data to Looker Studio. Looker Studio supports
dozens of data connectors, including:
●​ Google Sheets (ideal for beginners and small datasets)
●​ BigQuery (for large-scale analytics)
●​ Google Analytics (web traffic insights)
●​ SQL databases
●​ CSV files and more​

For most learning scenarios, Google Sheets is the simplest and most
accessible option, allowing you to store your data in a familiar spreadsheet format
and visualize it instantly.
To connect:
1.​ Navigate to lookerstudio.google.com and sign in with your Google account.
2.​ Click on “Create” > “Report.”
3.​ Click “Add Data” and select “Google Sheets.”
4.​ Browse and select your spreadsheet file.
5.​ Choose the worksheet tab containing your data.
6.​ Click “Add” to link the data source to your report.​

Once connected, Looker Studio automatically detects column headers, which


it treats as fields — these are the building blocks for your charts.
Dimensions and Metrics
Understanding dimensions and metrics is fundamental:
●​ Dimensions are categorical variables or attributes that describe data, such as
“Region,” “Product,” or “Date.”
●​ Metrics are numerical measurements or quantities, like “Sales,” “Revenue,” or
“Clicks.”​

When creating charts, you assign dimensions to define the categories or


grouping and metrics to define the values or measurements.

36
4.2 Creating Charts and Visual Elements
Looker Studio offers a rich variety of chart types to suit different analytical
needs:

●​ Bar and Column Charts: Ideal for comparing quantities across


categories. For example, sales by region.
●​ Line Charts: Show trends over time, such as monthly website visits.
●​ Pie Charts: Display proportions of a whole, such as market share by
product.
●​ Tables: Present detailed raw data for in-depth inspection.
●​ Geo Maps: Visualize data geographically, for example, sales by country
or city.
●​ Scorecards: Highlight key single-value metrics like total revenue or
number of users.​

To insert a chart, simply select “Add a chart,” choose the desired type, then
drag and drop the appropriate fields into “Dimension” and “Metric” slots.
Example: Imagine you have sales data with columns: Region, Product
Category, Date, and Sales Amount. You can create:
●​ A bar chart showing total sales per region (Dimension: Region, Metric:
Sales Amount)
●​ A line chart showing sales trends over time (Dimension: Date, Metric:
Sales Amount)

37
●​ A pie chart showing sales distribution across product categories​

Adding Interactivity with Filters and Controls


Static reports limit exploration. Looker Studio allows adding interactive filters and
controls such as:
●​ Drop-down selectors: Users can filter charts by category (e.g., selecting one
product).
●​ Date range controls: Enables viewers to specify the time period for the
displayed data.
●​ Search boxes: To find specific data points in tables.​

These controls help create customizable dashboards that empower users to


explore data on their terms without needing to edit or recreate reports.
For instance, a dashboard showing sales data can include a date filter to let users
switch from yearly to monthly or weekly views dynamically.

Blending Data for Deeper Insights


Looker Studio’s data blending feature lets you combine two or more data
sources on a common key, akin to joining tables in SQL.
Imagine you have:
●​ A sales dataset with “Product ID” and sales figures.
●​ A customer feedback dataset with “Product ID” and ratings.​

By blending these, you can create visuals that correlate customer satisfaction
with sales performance, providing actionable insights for product managers.

Calculated Fields for Custom Metrics


Not all required metrics exist in your raw data. Looker Studio allows you to
create calculated fields using arithmetic operations and functions.
For example, if you want to calculate Profit Margin, but only have “Revenue”
and “Cost” columns, you can create:
java
CopyEdit
Profit Margin = (Revenue - Cost) / Revenue

This new metric can then be used like any other field in charts and filters.
Looker Studio supports complex formulas and conditional statements (using
CASE) to tailor metrics precisely.

38
Best Practices for Designing Dashboards
●​ Keep it Simple: Avoid clutter: Use only necessary charts and controls per
dashboard page.
●​ Use Consistent Colors and Fonts: This improves readability and
professionalism.
●​ Align Visuals: Proper alignment and spacing make reports easier to scan.
●​ Add Clear Titles and Labels: Each chart should have a meaningful title and axis
labels.
●​ Use Legends Effectively: When multiple categories exist, legends help viewers
interpret colors and symbols.
●​ Test Responsiveness: Preview dashboards on various devices to ensure
usability.​

Sharing and Collaboration


One of Looker Studio’s biggest strengths is easy sharing:
●​ Share reports via email or link.
●​ Set permission levels (Viewer or Editor).
●​ Embed dashboards into websites, learning management systems, or
internal portals using HTML embed code.​

This enables seamless collaboration and dissemination of insights across


teams and stakeholders.

39
CHAPTER 5 - PREDICTIVE ANALYTICS AND REPORTING

5.1 Introduction to Predictive Analytics


Predictive analytics is a powerful discipline within data analytics that focuses
on forecasting future events or behaviors by analyzing historical data and applying
statistical algorithms, machine learning, and data mining techniques. The objective is
to identify patterns and trends in existing data to make informed predictions about
future outcomes. This contrasts with descriptive analytics, which focuses on

summarizing what has already happened, and diagnostic analytics, which explores
why events occurred.
At its core, predictive analytics uses mathematical models and computational
algorithms to answer questions such as: What is likely to happen? Which factors are
most influential? How can future risks or opportunities be anticipated? These
insights enable organizations and individuals to optimize decision-making, reduce
uncertainty, and improve performance across numerous domains.

The Importance of Predictive Analytics


Predictive analytics is transforming how decisions are made in industries such
as healthcare, finance, marketing, education, manufacturing, and public policy. By
leveraging predictive models, stakeholders can:
●​ Identify potential risks before they materialize.
●​ Optimize resource allocation and operational efficiency.
●​ Personalize services and products to meet individual needs.
●​ Anticipate market trends and consumer behavior.

40
●​ Enhance planning and forecasting accuracy.​

For example, a company might predict customer churn and proactively retain
clients, or a healthcare provider may forecast disease outbreaks to allocate medical
resources efficiently. The versatility and actionable insights offered by predictive
analytics have made it a cornerstone of data-driven strategy.

Key Concepts in Predictive Analytics


Data Collection and Preparation
The foundation of effective predictive analytics lies in quality data. Raw data
collected from various sources—databases, sensors, surveys, transaction
records—must undergo several critical steps before it is suitable for modeling:

●​ Data Cleaning: Remove errors, inconsistencies, and duplicates to improve


accuracy.
●​ Data Integration: Combine data from multiple sources for a comprehensive
dataset.
●​ Data Transformation: Normalize, scale, or encode data to fit model
requirements.
●​ Feature Selection: Identify relevant variables that influence the prediction
target.​

These processes ensure that models are trained on reliable and meaningful
data, increasing the likelihood of accurate predictions.
Features and Target Variables

41
●​ Features (Independent Variables): These are the input variables used
to predict the outcome. Features can be numeric (e.g., age,
temperature) or categorical (e.g., gender, region).
●​ Target Variable (Dependent Variable): This is the outcome or result the
model aims to predict, such as a customer’s purchase decision or the
price of a stock.
●​ Correctly defining and preparing these variables is crucial for building
effective models.
Types of Predictive Models
Predictive analytics utilizes various modeling approaches depending on the
problem type:
●​ Regression Models: Used for predicting continuous numerical values.
For example, forecasting sales revenue or temperature.
●​ Classification Models: Predict categorical outcomes such as fraud
detection (fraud/not fraud) or email filtering (spam/not spam).
●​ Time Series Forecasting: Analyzes data collected over time to forecast
future values, such as stock prices or demand forecasting.
●​ Clustering: Groups similar data points to uncover hidden patterns
without predefined categories.​

Selecting the appropriate model depends on the nature of the data and the
specific prediction goal.

The Predictive Analytics Workflow

The process of building and deploying predictive models typically follows


these steps:

42
1.​ Problem Definition: Clearly articulate the question or problem to be
solved.
2.​ Data Collection: Gather historical and relevant data.
3.​ Data Preparation: Clean, transform, and select features.
4.​ Model Selection: Choose a suitable algorithm or approach.
5.​ Model Training: Use the prepared data to teach the model how to
make predictions.
6.​ Model Evaluation: Test the model on new data to assess accuracy and
robustness.​

7.​ Deployment: Implement the model to generate predictions in


real-world scenarios.
8.​ Monitoring and Maintenance: Continuously evaluate and update the
model to ensure consistent performance.​

Following this structured approach ensures that predictions are reliable,


interpretable, and actionable.

5.2 Tools and Technologies in Predictive Analytics


A variety of tools support predictive analytics, catering to different levels of
expertise:
●​ Programming Languages: Python and R are widely used for their rich
ecosystems of data science libraries such as scikit-learn, TensorFlow,
and caret. These tools allow customization, automation, and complex
modeling.
●​ Cloud Platforms: Google Colab provides an accessible, cloud-based
environment for Python coding and running predictive models
without local setup.
●​ Visualization Tools: Reporting and dashboard platforms like Looker
Studio help communicate predictive insights through visualizations,
making results understandable to stakeholders.
●​ AutoML Platforms: Services like Google AutoML, Microsoft Azure ML,
and IBM Watson simplify model creation with automated workflows.​

For many learners and professionals, combining Python programming with


visualization tools offers a powerful and flexible predictive analytics environment.
Example: Predicting Outcomes with Regression
Consider the task of predicting sales revenue based on advertising spend,
seasonality, and market conditions.
●​ Data Collection: Historical sales data with columns for monthly sales,
advertising budget, and season.

43
●​ Feature Engineering: Creating variables such as “Ad Spend,” “Month,”
and “Competitor Price.”
●​ Model Training: Using linear regression to find relationships between
features and sales.
●​ Evaluation: Assessing model accuracy using metrics like Mean Squared
Error (MSE) and R-squared.
●​ Prediction: Forecasting next quarter’s sales given planned advertising
budgets.​

This example illustrates how numeric prediction tasks can guide strategic
decisions.

Reporting Predictive Analytics Results


Reporting is essential to translate predictive model outputs into meaningful
business or operational insights. Effective reports typically include:

●​ Summary Statistics: Model accuracy, error rates, and confidence


levels.
●​ Visualizations: Graphs and charts such as line plots for trends, scatter
plots for relationships, or heatmaps for correlation.
●​ Interpretation: Explanation of which features influenced predictions
most significantly.
●​ Actionable Recommendations: Suggested steps based on predictions,
such as targeting marketing efforts or allocating resources.
●​ Limitations and Assumptions: Transparent discussion of model
constraints and potential biases.​

Interactive dashboards built with tools like Looker Studio allow users to filter and
explore predictions dynamically, improving decision-making.

44
Ethical Considerations
Predictive analytics raises important ethical questions:
●​ Bias and Fairness: Models trained on biased data can perpetuate inequalities
or unfair treatment.
●​ Privacy: Protecting personal and sensitive data during collection, modeling,
and reporting.
●​ Transparency: Ensuring stakeholders understand how predictions are made
and their limitations.
●​ Accountability: Using predictions responsibly, avoiding misuse or
over-reliance on automated decisions.​

Addressing these concerns is critical to maintaining trust and integrity in predictive


analytics applications.

45

You might also like