Advanced Data Analytics and Visualization Course Material
Advanced Data Analytics and Visualization Course Material
1
Table of Contents
2
CHAPTER 1 - FOUNDATION OF DATA ANALYTICS
3
Data Collection
Guided by your identified requirements, it’s time to collect the data from your
sources. Sources include case studies, surveys, interviews, questionnaires, direct
observation, and focus groups. Make sure to organize the collected data for analysis.
Data Cleaning:
Not all of the data you collect will be useful, so it’s time to clean it up. This
process is where you remove white spaces, duplicate records, and basic errors. Data
cleaning is mandatory before sending the information on for analysis.
4
Data Analysis
Here is where you use data analysis software and other tools to help you
interpret and understand the data and arrive at conclusions. Data analysis tools
include Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase, Redash, and
Microsoft Power BI.
5
Data Interpretation:
Now that you have your results, you need to interpret them and come up
with the best courses of action based on your findings.
Data Visualization:
Data visualization is a fancy way of saying, “graphically show your information
in a way that people can read and understand it.” You can use charts, graphs, maps,
bullet points, or a host of other methods. Visualization helps you derive valuable
Descriptive Analysis:
Descriptive analysis involves summarizing and describing the main features of
a dataset. It focuses on organizing and presenting the data in a meaningful way, often
using measures such as mean, median, mode, and standard deviation. It provides an
overview of the data and helps identify patterns or trends.
Inferential Analysis:
Inferential analysis aims to make inferences or predictions about a larger
population based on sample data. It involves applying statistical techniques such as
hypothesis testing, confidence intervals, and regression analysis. It helps generalize
findings from a sample to a larger population.
6
1.2 Exploratory Data Analysis (EDA):
EDA focuses on exploring and understanding the data without preconceived
hypotheses. It involves visualizations, summary statistics, and data profiling
techniques to uncover patterns, relationships, and interesting features. It helps
generate hypotheses for further analysis.
Diagnostic Analysis:
7
Predictive Analysis:
Prescriptive Analysis:
Prescriptive analysis goes beyond predictive analysis by recommending
actions or decisions based on the predictions. It combines historical data,
optimization algorithms, and business rules to provide actionable insights and
optimize outcomes. It helps in decision-making and resource allocation.
8
2. Communications, Media, and Entertainment:
The realm of communications, media, and entertainment thrives on Big Data’s
capabilities. Audience analytics empower media companies to decipher viewer
habits, preferences, and engagement patterns, optimizing content creation and
targeted advertising. Content recommendation algorithms utilize Big Data
applications to suggest personalized content, enhancing user experiences.
3. Healthcare Providers:
Big Data revolutionizes healthcare by enabling personalized medicine. Analyzing
extensive genetic and patient data allows providers to tailor treatment plans for
individual needs, enhancing medical outcomes. Healthcare analytics drive
operational efficiency by optimizing patient flow, resource allocation, and
administrative processes.
4. Education:
In education, Big Data empowers institutions with learning analytics, offering insights
into student performance and learning patterns. Predictive analytics aids in
identifying students at risk of dropping out, allowing timely intervention and support.
6. Government:
Big Data applications to the government are vast. Urban planning benefits from
data-driven insights into traffic patterns, resource allocation, and infrastructure
needs. Public safety gains from predictive crime analysis, enabling law enforcement
agencies to allocate resources effectively.
7. Insurance:
In the insurance sector, Big Data applications is pivotal for underwriting and risk
assessment. By analyzing vast datasets, insurers can accurately assess risks and
determine pricing. Claims processing benefits from data analytics, expediting
verification processes and detecting fraudulent claims.
9
9. Transportation:
In the transportation industry, Big Data transforms traffic management by analyzing
real-time data from vehicles and sensors, leading to optimized traffic flow. Fleet
management benefits from data insights, optimizing routes, and improving fuel
efficiency.
10
CHAPTER 2 - DATA EXTRACTION USING SQL
Key Components:
● Tables: Consist of rows (records) and columns (fields).
● Schemas: Define the structure and organization of tables.
● Keys: Primary keys ensure uniqueness; foreign keys maintain relationships.
11
Types of Databases
1. Relational Databases (RDBMS): Use structured schemas. Examples: MySQL,
PostgreSQL.
2. NoSQL Databases: Store unstructured data. Examples: MongoDB, Cassandra.
We focus on RDBMS because SQL is the primary language for these systems.
Understanding this helps us write efficient queries that run fast even on big datasets.
12
Examples:
-- Basic retrieval
SELECT name, salary FROM employees;
-- Using arithmetic
SELECT name, salary, salary*1.10 AS incremented_salary FROM employees;
-- Renaming columns
SELECT name AS employee_name FROM employees;
Filtering Data with WHERE
The WHERE clause helps restrict which rows appear in your result.
SELECT * FROM customers WHERE city = 'Chennai';
You can combine multiple conditions using logical operators:
● AND, OR, NOT
● BETWEEN, IN, LIKE, IS NULL
Advanced Example:
SELECT * FROM sales WHERE (amount > 1000 AND region IN ('South', 'East')) OR
customer_id IS NULL;
13
Relationships:
● One-to-One
● One-to-Many
● Many-to-Many (via bridge table)
14
CREATE VIEW high_earners AS
SELECT name, salary FROM employees WHERE salary > 100000;
Transactions and ACID Principles
Transactions allow for grouped operations.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
ACID:
● Atomicity – All or none
● Consistency – Maintain integrity
● Isolation – One transaction doesn’t affect another
● Durability – Survives failures
15
What is Data Preprocessing?
Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we cannot work
with raw data. The quality of the data should be checked before applying machine
learning or data mining algorithms.
Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and
inaccurate data from the datasets, and it also replaces the missing values.
16
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points.
Handling noisy data is one of the most important steps as it leads to the optimization
of the model we are using. Here are some of the methods to handle noisy data.
Binning: This method is to smooth or handle noisy data. First, the data is
sorted then, and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin; Smoothing by bin median: In this method, the
values in the bin are replaced by the median value; Smoothing by bin boundary: In
this method, the using minimum and maximum values of the bin values are taken,
and the closest boundary value replaces the values.
Regression:
This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide the
variable which is suitable for our analysis.
17
Clustering:
This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components of data management. There are
some problems to be considered during data integration.
Schema integration: Integrates metadata(a set of data that describes other
data) from different sources.
Entity identification problem: Identifying entities from multiple databases.
For example, the system or the user should know the student id of one database and
student name of another database belonging to the same entity.
Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. The attribute values from one database may differ from
another database. For example, the date format may differ, like “MM/DD/YYYY” or
“DD/MM/YYYY”.
18
Data Reduction
This process helps in the reduction of the volume of the data, which makes
the analysis easier yet produces the same or almost the same result. This reduction
also helps to reduce storage space. Some of the data reduction techniques are
dimensionality reduction, numerosity reduction, and data compression.
Dimensionality reduction: This process is necessary for real-world
applications as the data size is big. In this process, the reduction of random variables
or attributes is done so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space, and computation
time is reduced. When the data is highly dimensional, a problem called the “Curse of
Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is
made smaller by reducing the volume. There will not be any loss of data in this
reduction.
Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.
19
Data Transformation
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods for data transformation.
Smoothing: With the help of algorithms, we can remove noise from the
dataset, which helps in knowing the important features of the dataset. By smoothing,
we can find even a simple change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into a data
analysis description. This is an important step since the accuracy of the data depends
on the quantity and quality of the data. When the quality and the quantity of the
data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can set
an interval like (3 pm-5 pm, or 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0
20
Scraper tools and bots
Web scraping tools are software (i.e., bots) programmed to sift through
databases and extract information. A variety of bot types are used, many being fully
customizable to:
● Recognize unique HTML site structures
● Extract and transform content
● Store scraped data
● Extract data from APIs
21
Interact with code, whereas an Application programmable interface (API) allows one
piece of code to interact with other code.
● Categories of API
● Web-based system
A web API is an interface to either a web server or a web browser. These APIs
are used extensively for the development of web applications. These APIs work at
either the server end or the client end. Companies like Google, Amazon, eBay all
provide web-based API.
Some popular examples of web based API are Twitter REST API, Facebook
Graph API, Amazon S3 REST API, etc.
Operating system
There are multiple OS based API that offers the functionality of various OS features
that can be incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
Database system
Interaction with most of the database is done using the API calls to the
database. These APIs are defined in a manner to pass out the requested data in a
predefined format that is understandable by the requesting client.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API,
Django API.
Hardware System
These APIs allow access to the various hardware components of a system.
They are extremely crucial for establishing communication to the hardware. Due to
which it makes possible for a range of functions from the collection of sensor data to
even display on your screens.
Some other examples of Hardware APIs are: QUANT Electronic, WareNet
CheckWare,OpenVX Hardware Acceleration, CubeSensore, etc.
22
Twitter API
Just like Facebook Graph API, Twitter data can be accessed using the Twitter
API as well. You can access all the data like tweets made by any user, the tweets
containing a particular term or even a combination of terms, tweets done on the
topic in a particular date range, etc.
Quandl API
Quandl lets you invoke the time series information of a large number of
stocks for the specified date range. The setting up of Quandl API is very easy and
provides a great resource for projects like Stock price prediction, stock profiling,
etc.List of 5 cool data science projects using API
Here is the list of the projects for you. I’ll leave the execution of these ideas to
you.
23
CHAPTER 3 - DATA ANALYSIS WITH PYTHON & PANDAS
To declare a variable in python, you only have to assign a value to it. There are
no additional commands needed to declare a variable in python.
Operators
Operators in python are used to do operations between two values or
variables. Following are the different types of operators that we have in python:
● Arithmetic Operators
● Logical Operators
● Assignment Operators
● Comparison Operators
● Membership Operators
● Identity Operators
● Bitwise Operators
Loops In Python
A loop allows us to execute a group of statements several times. To
understand why we use loops, let's take an example.
24
Suppose you want to print the sum of all even numbers until 1000. If you
write the logic for this task without using loops, it is going to be a long and tiresome
task.
But if we use a loop, we can write the logic to find the even number, give a
condition to iterate until the number reaches 1000 and print the sum of all the
numbers. This will reduce the complexity of the code and also make it readable as
well.
For Loop
A ‘for loop’ is used to execute statements once every iteration. We already
know the number of iterations that are going to execute.
A for loop has two blocks, one is where we specify the conditions and then
we have the body where the statements are specified which gets executed on each
iteration.
for x in range(10):
print(x)
While Loop
The while loop executes the statements as long as the condition is true. We
specify the condition in the beginning of the loop and as soon as the condition is
false, the execution stops.
i=1
while i < 6:
print(i)
i += 1
#the output will be numbers from 1-5.
Nested Loops
Nested loops are combination of loops. If we incorporate a while loop into a
for loop or vis-a-vis.
25
Following are a few examples of nested loops:
for i in range(1,6):
for j in range(i):
print(i , end = "")
print()
# the output will be
1
22
333
4444
55555
if statement
x = 10
if x > 5:
print('greater')
The if statement tests the condition, when the condition is true, it executes
the statements in the if block.
elif statement
x = 10
if x > 5:
print('greater')
elif x == 5:
print('equal')
#else statement
x =10
26
if x > 5:
print('greater')
elif x == 5:
print('equal')
else:
print('smaller')
When both if and elif statements are false, the execution will move to else
statement.
Control statements
Control statements are used to control the flow of execution in the program.
Break
name = 'ingage'
for val in name:
if val == 'a':
break
print(i)
#the output will be
i
n
g
Continue
name = 'ingage'
for val in name:
if val == 'a':
continue
print(i)
#the output will be
27
i
n
g
a
g
e
When the loop encounters continue, the current iteration is skipped and rest
of the iterations get executed.
Functions
A function in python is a block of code which will execute whenever it is
called. We can pass parameters in the functions as well. To understand the concept
of functions, let's take an example.
Suppose you want to calculate the factorial of a number. You can do this by
simply executing the logic to calculate a factorial. But what if you have to do it ten
times in a day, writing the same logic again and again is going to be a long task.
Instead, what you can do is, write the logic in a function. Call that function
every time you need to calculate the factorial. This will reduce the complexity of your
code and save your time as well.
def function_name():
#expression
print('abc')
def my_func():
print('function created')
Function Parameters
We can pass values in a function using the parameters. We can use also give
default values for a parameter in a function as well.
28
print(name)
#default parameter
my_func()
#userdefined parameter
my_func('python')
Now that we have understood function calls, parameters and why we use
them, let's take a look at classes and objects in python.
class classname:
def functionname():
print(expression)
class myclass:
def func():
print('my function')
ob1 = myclass()
ob.func()
__init__ function
class myclass:
def __init__(self, name):
self.name = name
ob1 = myclass('ingage')
29
ob1.name
#the output will be- ingage
Now that we have understood the concept of classes and objects, lets take a
look at a few oops concepts that we have in python.
Abstraction
Data abstraction refers to displaying only the necessary details and hiding the
background tasks. Abstraction in python is similar to any other programming
language.
Like when we print a statement, we don’t know what is happening in the
background.
Encapsulation
Encapsulation is the process of wrapping up data. In python, classes can be an
example of encapsulation where the member functions and variables etc are
wrapped into a class.
Inheritance
Inheritance is an object oriented concept where a child class inherits all the
properties from a parent class. Following are the types of inheritance we have in
python:
● Single Inheritance
● Multiple Inheritance
● Multilevel Inheritance
Polymorphism
Polymorphism is the process in which an object can be used in many forms.
The most common example would be when a parent class reference is used to refer
to a child class object.
30
NumPy
Overview:
NumPy, which stands for Numerical Python, is a fundamental package for
scientific computing in Python.
It provides support for large, multi-dimensional arrays and matrices, along
with mathematical functions to operate on these arrays.
NumPy is the foundation for various data science and machine learning
libraries in Python.
Key Features:
● Arrays: The primary data structure in NumPy is the numpy.ndarray, which is
an n-dimensional array.
● Mathematical Operations: NumPy provides a vast array of mathematical
functions for performing operations on arrays, such as linear algebra,
statistical, and Fourier analysis functions.
● Broadcasting: NumPy allows for operations between arrays of different
shapes and sizes, making it convenient for mathematical computations.
● Random Module: NumPy includes a submodule for generating random
numbers, which is crucial in simulations and statistical analyses.
3.4 Pandas
Overview:
Pandas is a powerful data manipulation library built on top of NumPy. It
provides data structures for efficiently storing and manipulating large datasets.
The primary data structures in Pandas are Series and DataFrame.
Pandas is widely used for data cleaning, preparation, and analysis in data
science and machine learning workflows.
Key Features:
● DataFrame: A two-dimensional table with labeled axes (rows and columns),
similar to a spreadsheet or SQL table.
● Series: A one-dimensional labeled array that can hold any data type.
31
● Data Alignment: Pandas automatically aligns data based on label names,
making it easy to work with datasets from different sources.
● Data Cleaning and Handling Missing Data: Pandas provides methods for
handling missing data, such as filling missing values or dropping
rows/columns.
● GroupBy: Pandas supports the splitting of data into groups based on some
criteria and applying a function to each group independently.
Matplotlib
Overview:
● Purpose: Matplotlib is a 2D plotting library for creating static, animated, and
interactive visualizations in Python.
● Versatility: It is highly customizable and supports a wide variety of plot types,
from simple line charts to complex contour plots.
● Publication-Quality Graphics: Matplotlib is widely used for creating
publication-quality graphics in scientific and engineering publications.
Key Features:
● Plots and Charts: Matplotlib supports various types of plots, including line
plots, bar plots, scatter plots, histograms, and more.
● Customization: Users have full control over the appearance of plots. They can
customize colors, markers, labels, and other visual elements.
● Subplots: Multiple plots can be organized in a single figure, allowing for the
comparison of different datasets.
● Animations: Matplotlib supports creating animations, which is useful for
visualizing dynamic data.
Seaborn
Overview:
● Built on Matplotlib: Seaborn is a statistical data visualization library built on
top of Matplotlib.
● Focus on Aesthetics: Seaborn is designed to make attractive and informative
statistical graphics with fewer lines of code than Matplotlib.
● Statistical Plots: It includes several high-level functions for creating
informative statistical graphics, such as violin plots, box plots, and pair plots.
Key Features:
● High-Level Interface: Seaborn provides functions for creating complex
visualizations with minimal code, simplifying the process of creating statistical
plots.
32
● Color Palettes: Seaborn includes several color palettes, making it easy to
create visually appealing plots with consistent color schemes.
● Themes: Seaborn has built-in themes that can be applied to enhance the
aesthetics of plots.
● Regression Plots: Seaborn provides functions for visualizing linear regression
models, such as lmplot and regplot.
# Read CSV
df = pd.read_csv("data.csv")
# Inspect data
print(df.head())
print(df.info())
print(df.describe())
33
# Drop rows with missing values
df.dropna(inplace=True)
Renaming Columns:
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Select columns
df[['Name', 'Age']]
df['Category'] = df['Age'].apply(categorize)
Merging and Joining DataFrames
# Save to CSV
df.to_csv('cleaned_data.csv', index=False)
34
CHAPTER 4 - DATA VISUALIZATION WITH LOOKER STUDIO
35
● Supports Decision-Making: Well-designed visuals help identify opportunities,
risks, and areas needing attention.
● Encourages Exploration: Interactive dashboards allow users to drill down into
data and customize views based on their questions.
In the world of data analytics, raw numbers alone are rarely sufficient.
Visualization translates data into a narrative that supports strategic actions.
Connecting Data Sources to Looker Studio
The journey begins with linking your data to Looker Studio. Looker Studio supports
dozens of data connectors, including:
● Google Sheets (ideal for beginners and small datasets)
● BigQuery (for large-scale analytics)
● Google Analytics (web traffic insights)
● SQL databases
● CSV files and more
For most learning scenarios, Google Sheets is the simplest and most
accessible option, allowing you to store your data in a familiar spreadsheet format
and visualize it instantly.
To connect:
1. Navigate to lookerstudio.google.com and sign in with your Google account.
2. Click on “Create” > “Report.”
3. Click “Add Data” and select “Google Sheets.”
4. Browse and select your spreadsheet file.
5. Choose the worksheet tab containing your data.
6. Click “Add” to link the data source to your report.
36
4.2 Creating Charts and Visual Elements
Looker Studio offers a rich variety of chart types to suit different analytical
needs:
To insert a chart, simply select “Add a chart,” choose the desired type, then
drag and drop the appropriate fields into “Dimension” and “Metric” slots.
Example: Imagine you have sales data with columns: Region, Product
Category, Date, and Sales Amount. You can create:
● A bar chart showing total sales per region (Dimension: Region, Metric:
Sales Amount)
● A line chart showing sales trends over time (Dimension: Date, Metric:
Sales Amount)
37
● A pie chart showing sales distribution across product categories
By blending these, you can create visuals that correlate customer satisfaction
with sales performance, providing actionable insights for product managers.
This new metric can then be used like any other field in charts and filters.
Looker Studio supports complex formulas and conditional statements (using
CASE) to tailor metrics precisely.
38
Best Practices for Designing Dashboards
● Keep it Simple: Avoid clutter: Use only necessary charts and controls per
dashboard page.
● Use Consistent Colors and Fonts: This improves readability and
professionalism.
● Align Visuals: Proper alignment and spacing make reports easier to scan.
● Add Clear Titles and Labels: Each chart should have a meaningful title and axis
labels.
● Use Legends Effectively: When multiple categories exist, legends help viewers
interpret colors and symbols.
● Test Responsiveness: Preview dashboards on various devices to ensure
usability.
39
CHAPTER 5 - PREDICTIVE ANALYTICS AND REPORTING
summarizing what has already happened, and diagnostic analytics, which explores
why events occurred.
At its core, predictive analytics uses mathematical models and computational
algorithms to answer questions such as: What is likely to happen? Which factors are
most influential? How can future risks or opportunities be anticipated? These
insights enable organizations and individuals to optimize decision-making, reduce
uncertainty, and improve performance across numerous domains.
40
● Enhance planning and forecasting accuracy.
For example, a company might predict customer churn and proactively retain
clients, or a healthcare provider may forecast disease outbreaks to allocate medical
resources efficiently. The versatility and actionable insights offered by predictive
analytics have made it a cornerstone of data-driven strategy.
These processes ensure that models are trained on reliable and meaningful
data, increasing the likelihood of accurate predictions.
Features and Target Variables
41
● Features (Independent Variables): These are the input variables used
to predict the outcome. Features can be numeric (e.g., age,
temperature) or categorical (e.g., gender, region).
● Target Variable (Dependent Variable): This is the outcome or result the
model aims to predict, such as a customer’s purchase decision or the
price of a stock.
● Correctly defining and preparing these variables is crucial for building
effective models.
Types of Predictive Models
Predictive analytics utilizes various modeling approaches depending on the
problem type:
● Regression Models: Used for predicting continuous numerical values.
For example, forecasting sales revenue or temperature.
● Classification Models: Predict categorical outcomes such as fraud
detection (fraud/not fraud) or email filtering (spam/not spam).
● Time Series Forecasting: Analyzes data collected over time to forecast
future values, such as stock prices or demand forecasting.
● Clustering: Groups similar data points to uncover hidden patterns
without predefined categories.
Selecting the appropriate model depends on the nature of the data and the
specific prediction goal.
42
1. Problem Definition: Clearly articulate the question or problem to be
solved.
2. Data Collection: Gather historical and relevant data.
3. Data Preparation: Clean, transform, and select features.
4. Model Selection: Choose a suitable algorithm or approach.
5. Model Training: Use the prepared data to teach the model how to
make predictions.
6. Model Evaluation: Test the model on new data to assess accuracy and
robustness.
43
● Feature Engineering: Creating variables such as “Ad Spend,” “Month,”
and “Competitor Price.”
● Model Training: Using linear regression to find relationships between
features and sales.
● Evaluation: Assessing model accuracy using metrics like Mean Squared
Error (MSE) and R-squared.
● Prediction: Forecasting next quarter’s sales given planned advertising
budgets.
This example illustrates how numeric prediction tasks can guide strategic
decisions.
Interactive dashboards built with tools like Looker Studio allow users to filter and
explore predictions dynamically, improving decision-making.
44
Ethical Considerations
Predictive analytics raises important ethical questions:
● Bias and Fairness: Models trained on biased data can perpetuate inequalities
or unfair treatment.
● Privacy: Protecting personal and sensitive data during collection, modeling,
and reporting.
● Transparency: Ensuring stakeholders understand how predictions are made
and their limitations.
● Accountability: Using predictions responsibly, avoiding misuse or
over-reliance on automated decisions.
45