Data Analytics
Module 2
Syllabus
Module -2 DATA EXPLORATION
• Overview, Observations and variables, Types of
Variables, Central Tendency, Distribution of the data,
confidence intervals, preparing data tables, visualizing
relationships between variables calculating metrics
about relationships, case studies. 8 Hours
Introduction to Describing Data
• Understanding Data Tables
• The starting point for data analysis is a data table, also
referred to as a data set.
• These tables contain raw data values represented as
numbers or text.
• Examples include patient weight measurements (e.g., 150 lb)
or categorizations like industrial sectors (e.g.,
"telecommunications industry").
Structure of Data Tables
• Individual items about which data has been collected or
measured are typically shown as rows, referred to as
observations.
• Information considered interesting across these
observations is displayed as columns, called variables.
• For example, a table about cars might have each car as an
observation (row) and its attributes like weight or fuel
efficiency as variables (columns).
Observations
• Defining Observations
• Observations are the individual items or entities about which
data is collected.
• In a data table, each row represents a single
observation.
• Examples:
• Medical researchers collect data on patients.
• The automotive industry collects data on cars.
• Retail companies collect data on transactions.
Defining Variables
• When an attribute describes some aspect across all
observations, it is called a variable.
• In a data table, each column represents a variable.
• Example: For cars, variables could include Name, MPG,
Cylinders, Displacement, Horsepower, Weight, Acceleration,
Model year, and Origin.
• Importance of Understanding Variables
• It is essential to understand individual variables prior to
performing data analysis or data mining.
• Many data analysis techniques have restrictions on the
types of variables they can process.
Types of Variables: Discrete vs.
Continuous
• Initial Categorization based on Values
• Discrete Variable: A variable with a fixed number of distinct
values.
• Example: An "industrial sector" variable might have values like
"telecommunication industry," or "retail industry".
• Categorical in nature.
• Eg: Number of children in a family, Number of cars in a parking lot, Number
of defects in a batch
• Continuous Variable: A variable that can take any numeric
value within its range.
• Example: A "patient's weight" (e.g., 153.2 lb, 98.2 lb).
• Continuous variables may have an infinite number of possible values
within their range.
Types of Variables:
Measurement Scales
• Scales of Measurement
• Variables are also classified by the scale on which they
are measured.
• Scales help us understand the precision of an individual
variable and are used to make choices about data
visualizations as well as methods.
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale
Nominal Scale:
• Describes a variable with a limited number of different values that cannot
be ordered.
• No numeric meaning, Cannot perform arithmetic operations
• Values merely assign an observation to a particular category.
• Example:
• "Industry" with values such as "financial," "engineering," or "retail" – the order
of these values has no meaning.
• Gender (Male, Female), Blood type (A, B, AB, O), Eye color (Brown, Blue,
Green), Country of origin (India, USA, UK, Japan)
Ordinal Scale:
• Describes a variable whose values can be ordered or ranked.
• Values are assigned to a fixed number of categories.
• Example: A scale with values "low," "medium," and "high" indicates order (high
> medium > low), Education level(High School < Bachelor < Master), Customer
satisfaction (Very dissatisfied to Very satisfied), Pain level (Mild < Moderate <
Severe)
• However, it is impossible to determine the magnitude of the difference between
the values (e.g., difference between "high" and "medium" cannot be compared
• Interval Scale:
• Describes values where the interval between values can be compared.
• The intervals share the same unit of measurement.
• Example:
• Fahrenheit temperature (5°F, 10°F, 15°F) – the difference between 5 and 10 is 5°, and
between 10 and 15 is also 5°.
• IQ Scores (90-100,101-110-111-120)
• Lacks a meaningful zero, so ratios of values cannot be compared (e.g.,
10°F is not twice as hot as 5°F).
Ratio Scale:
• Describes variables where both intervals between values and ratios of
values can be compared.
• Numeric scale with a meaningful zero, allowing for all mathematical
operations, including ratios
• Ordered categories, Equal intervals, Has a true zero point (absence of
quantity),
• Eg: Weight (0-10kg,11-20kg,21-30kg), Income (0-50,000, 50,000-1,00,000), A
bank account balance ($5, $10, $15) – $10 is twice as much as $5.
Types of Variables: Special
Cases & Annotations
• Dichotomous Variable:
• A variable that can contain only two values.
• Example: "Gender" (e.g., "male" or "female").
• Binary Variable:
• A widely used dichotomous variable with values 0 or 1.
• Provides a convenient numeric representation for many
types of discrete data in analysis.
• Example: A "Purchase" variable using 0 for no purchase and 1
for purchase.
Types of Variables: Special
Cases & Annotations
• Unique Identifier Variable:
• Used to identify each observation uniquely in a data table
(e.g., a customer reference number).
• Never used directly in data analysis as its values are for
linking to individual records, not for analytical patterns.
• Identical Value Variable:
• A variable that has identical values across all
observations.
• Example: A "Calibration" setting for a machine that is the
same for all measurements.
• Retained to understand how data was generated or for
accuracy assessment when merging data.
Types of Variables: Special
Cases & Annotations
• Annotations of Variables:
• Provide important additional information about the
context of the data.
• Examples: Is it a count or a fraction? A time or a date? A
financial term? A derived value?.
• Units of measurement are critical for interpreting data and
merging data tables from different sources.
1. For each of the following variables, assign them to one of the following scales:
nominal, ordinal, interval, or ratio:
(e) Weight (kg)
(a) Name
(f) Height (m)
(b) Age
(g) Systolic blood pressure (mmHg)
(c) Gender
(h) Diastolic blood pressure (mmHg)
(d) Blood group
(i) Diabetes
Diabetes is Nominal. The numbers (0 and 1) are just labels used to represent
categories. There is no inherent order or magnitude between them.
why blood pressure is a ratio scale
variable
Criterion Blood Pressure Explanation
Has order ✔ Yes Higher values mean more pressure.
The difference between 120 mmHg and 130
Equal intervals ✔ Yes
mmHg is the same as 140 and 150.
0 mmHg means no pressure at all (absolute
True zero point ✔ Yes
zero).
Meaningful ratios ✔ Yes 120 mmHg is twice the pressure of 60 mmHg.
Celsius (°C) or Fahrenheit (°F) → Interval Note: Kelvin is a ratio scale measurement
Scale
Celsius /
Property Explanation
Fahrenheit
Ordered values ✔ Yes Higher temperature means more heat.
Difference between 10°C and 20°C equals the difference
Equal intervals ✔ Yes
between 20°C and 30°C.
True zero point ❌ No 0°C or 0°F does not mean absence of temperature.
Ratios
• Which of the following is measured on a nominal scale?
a) Temperature in Celsius
b) Blood group
c) Weight in kilograms
d) Exam score
• Which scale of measurement has a true zero and allows all arithmetic operations?
a) Nominal
b) Ordinal
c) Interval
d) Ratio
• Which variable is measured on an ordinal scale?
a) Date of birth
b) Satisfaction rating (e.g., Poor, Fair, Good, Excellent)
c) Height in centimeters
d) Gender
• What scale of measurement is used for the variable ‘marital status’?
a) Nominal
b) Ordinal
c) Interval
d) Ratio
• Classify the following variables and justify your answer:
• Salary
• Eye color
• Academic rank (Professor, Associate Professor, Assistant Professor)
• Date of birth
Central Tendency
• Overview of Central Tendency
• One of the most important ways to summarize a variable is to
quantify the middle or central location of its values.
• This represents the value around which many of the
observations' values for that variable lie.
• The choice of method depends on the variable's classification.
"The mean tells you what’s typical, the median shows what’s central, and the mode
reveals what’s popular."
Mode
• The most commonly reported value for a particular
variable.
• Example: For values (3, 4, 5, 6, 7, 7, 7, 8, 8, 9), the mode is 7.
• Useful for variables measured on a nominal scale. Can also
be calculated for ordinal, interval, and ratio scales.
• If multiple values have the same highest frequency, all can be
reported or a midpoint chosen (e.g., {7, 8} or 7.5 for 3, 4, 5,
6, 7, 7, 7, 8, 8, 8, 9).
Median
• The middle value of a variable, once it has been sorted
from low to high.
• Example: For sorted values (2, 2, 3, 3, 4, 4, 4, 4, 7, 7, 7), the
median is the 6th value, which is 4.
• For an even number of values, it's the average of the two
values closest to the middle.
• Can be calculated for ordinal, interval, and ratio scales.
• It is not distorted by extreme values, making it a good
indicator of central value for interval or ratio scales.
Mean (Average)
The sum of all values divided by the number of values.
Most commonly used summary of central tendency for variables
measured on
the interval or ratio scales.
Example: For (3, 4, 5, 7, 7, 8, 9, 9, 9), the sum is 61, and the mean is
61 ÷ 9 = 6.78.
Formula (for a sample x with n observations): x̄ = (∑xi) / n.
Distribution of Data:
Understanding Variation
• Beyond Central Location
• While central tendency gives a single "middle" value, it
provides no insight into the variation or spread of the
data.
• Understanding variation means understanding how different
values are distributed around the central location.
• Frequency Distribution
• A simple count of how many times a value occurs.
• Often it is the starting point for analysing variation.
• Can be explored using simple data visualizations and
calculated metrics.
• Plays a role in selecting appropriate data analysis
approaches.
Visualizing Distribution
• Purpose of Visualization
• Visualization aids in understanding data distribution,
including:
• The range of values.
• The shape created when values are plotted.
• Outliers (values at the extremes).
Bar Charts for Nominal
Variables
• Used to display the relative frequencies for different
values of a variable measured on a nominal scale.
• Each bar represents a value, and the height of the bars is
proportional to the frequency.
• Example: Car Origin variable ("America," "Europe," "Asia")
showing counts like 244 for "America".
• The ordering of the x-axis is arbitrary (often alphabetical
or by frequency).
• The y-axis can also display proportion or percentage
instead of raw frequency.
Bar Charts for Ordinal Variables
• Can also be used for variables measured on an ordinal scale with
a small number of values.
• Example:
• 1. PLT variable (number of mother's previous premature labors) with values 1,
2, 3, 4, showing decreasing observations as values increase.
• 2. A bar chart for ordinal variables displays categories with a natural order
(e.g., satisfaction levels- unsatisfied, neutral, satisfied, very satisfied) along
the x-axis and their frequencies on the y-axis.
Visualizing Distribution: Frequency Histograms
• Frequency Histograms for Ordered Scales
• Useful for variables with an ordered scale (ordinal, interval, or ratio) that
contain a larger number of values.
• It is used for continuous numerical data
• Variable values are divided into a series of ranges (groups).
• Bar heights are proportional to the number of observations within each
range.
• Ranges are ordered from low to high along the x-axis.
• Example: Acceleration variable grouped into ranges like 6-8, 8-10, etc., showing
most observations between 12 and 20.
• Typically display between 5 and 10 groups with easy-to-interpret boundary
values.
Note:
• Bar Chart is used for categorical data (nominal or ordinal), with spaces between bars,
showing frequencies of distinct categories.
• Histogram is used for continuous numerical data, with no spaces between bars, showing
the distribution of data over intervals (bins).
Understanding the Shape of
Distribution
• Histograms help to understand the shape of the frequency
distribution.
• Common Frequency Distributions:
• Constant: Number of observations remains constant as values
increase.
• Normal Distribution: Most observations centered around the mean,
with fewer at extremes, tapering off symmetrically (bell-shaped). Many
data analysis techniques assume an approximate normal
distribution.
• Bimodal Distribution: Values cluster in two locations.
Identifying Unusual Data:
• Can reveal if data contains two distinct types of
observations (e.g., two approximate normal distributions).
• Can show a small number of high values that don't
follow the main distribution, possibly indicating errors
or anomalies.
Measures of Variation: Range &
Quartiles
• Range
• A simple measure of the variation for a particular variable.
• Calculated as the difference between the highest and
lowest values.
• Example: For values (2, 3, 4, 6, 7, 7, 8, 9), the range is 9 - 2 =
7.
• Can be used with variables measured on an ordinal, interval,
or ratio scale.
Measures of Variation: Range &
Quartiles
• Quartiles
• Divide a continuous variable into four even segments based
on the number of observations.
• First Quartile (Q1): At the 25% mark of the sorted data.
• Second Quartile (Q2): At the 50% mark, which is the same as the
median value.
• Third Quartile (Q3): At the 75% mark.
• Example: For sorted values (2, 2, 3, 3, 4, 4, 4, 4, 7, 7, 7), Q1 is 3 and
Q3 is 7.
• Interquartile Range: The range from Q1 to Q3. In the example, 7 -
3 = 4.
• If boundaries don't fall on specific values, quartiles are calculated
based on adjacent numbers.
Odd number data points
• Data (sorted): 1, 3, 5, 7, 9, 11, 13, 15, 17
• Step 1: Find Q2 (Median)Middle value (5th number): Q2 = 9
• Step 2: Find Q1 (Median of lower half)
• Lower half (left of median, first 4 numbers): 1, 3, 5, 7
• Middle two values: 3 and 5
• Q1 = (3 + 5)/2 = 4
• Step 3: Find Q3 (Median of upper half)
• Upper half (right of median, last 4 numbers): 11, 13, 15, 17
• Middle two: 13 and 15
• Q3 = (13 + 15)/2 = 14
• Q4 = Maximum = 17
• 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24
What about when the sample size is even?!
• Data: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24 (12 NUMBERS)
• Step 1: Find Q2 (Median)
• Middle two values: 12 and 14
• Q2 = (12 + 14)/2 = 13
• Step 2: Find Q1 (Median of lower half)
• Lower half (left of median): 2, 4, 6, 8, 10, 12
• Middle two: 6 and 8
• Q1 = (6 + 8)/2 = 7
• Step 3: Find Q3 (Median of upper half)
• Upper half (right of median): 14, 16, 18, 20, 22, 24
• Middle two: 18 and 20
• Q3 = (18 + 20)/2 = 19
• Step 4: Q4 (Maximum) =24
Visualizing Variation: Box Plots
• Purpose of Box Plots
• Provide a succinct summary of the overall frequency
distribution of a variable.
• Elements Displayed
• Typically display six values: lowest value, lower quartile (Q1),
median (Q2), upper quartile (Q3), highest value, and the
mean.
• Conventional Box Plot Structure
• The box in the middle represents the central 50% of
observations (between Q1 and Q3).
• A vertical line shows the location of the median value.
• A dot represents the location of the mean value.
• Horizontal lines (whiskers) extend from the box to the "lowest
value" and "highest value" (or non-outlier extremes), representing the
values in the first and fourth quartiles.
Identifying Outliers
• Some box plots graphically separate outliers from the whiskers, explicitly drawing them as
small circles outside the main plot.
• The “whiskers” are not necessarily the true minimum and maximum values of the data.
Instead, they usually extend to:
• Lower whisker end=Q1−1.5×IQR
• Upper whisker end=Q3+1.5×IQR where IQR = Q3 - Q1.
The dots/circles you see beyond the whiskers represent outliers.
These are values that are:
< Q1 − 1.5 × IQR (too low) Ex: The box spans ~17 to ~30 MPG (Q1 to Q3).
Whiskers extend approximately from ~13 to ~35
> Q3 + 1.5 × IQR (too high)
MPG. Circles at ~8, ~39, ~40, and ~46 are outliers.
These values are outside the 1.5×IQR range.
• Understanding Symmetry
• Box plots help in understanding the symmetry of a frequency distribution.
• If the mean and median have approximately the same value, the distribution will
be roughly symmetric, with about the same number of values above and below the
mean.
Measures of Variation: Variance
•Variance (s²)
Describes the spread of the data.
Measures how much the values of a variable differ
from the mean.
For a sample, the formula is: s² = ∑(xi - x̄)² / (n - 1).
o(xi = individual value, x̄ = mean, n = number of
observations).
Reflects the average squared deviation.
Calculated for variables measured on the interval or
ratio scale.
S ta n d a rd D e v ia tio n
•Standard Deviation (s)
The square root of the variance.
For a sample, the formula is: s = √[∑(xi - x̄)² / (n - 1)].
It is the most widely used measure of the deviation of a variable.
Higher values indicate data values are more widely distributed around the
mean.
Interpretation for Normal Distribution:
oApproximately 68% of all observations fall within one standard
deviation of the mean.
oApproximately 95% of all observations fall within two standard
deviations of the mean.
Calculated for variables measured on the interval or ratio scales.
•68% of data falls within ±1 standard deviation (σ) of the mean (μ): μ−σ ≤ x ≤ μ+σ
This means that the majority of values are close to the average.
•95% of data falls within ±2 standard deviations (σ): μ−2σ ≤ x ≤ μ+2σ
Most values (almost all) are within this range. This is used to check for reasonableness or outliers.
Standard Deviation
• The standard deviation is the most widely used measure of the deviation of a
variable.
• The higher the value, the more widely distributed the variable’s data values are
around the mean.
• Assuming the frequency distribution is approximately normal (i.e., a bell-shaped
curve), about 68% of all observations will fall within one standard deviation of the
mean (34% less than and 34% greater than).
• For example, a variable has a mean value of 45 with a standard deviation value of 6.
• Approximately 68% of the observations should be in the range 39–51 (45 ± one
standard deviation) and approximately 95% of all observations fall within two
standard deviations of the mean (between 33 and 57).
• Standard deviations can be calculated for variables measured on the interval or ratio
scales.
Tutorial
1. Calculate the following statistics
for the variable Age (from Table 2.5):
(a) Mode
(b) Median
(c) Mean
(d) Range
(e) Variance
(f) Standard deviation
Z-Score
A z-score (also called a standard score) tells you how many
standard deviations a data point is from the mean of a dataset.
Formula: z = (xi - x̄) / s.
• where z is the z-score, xi is the actual data value, ̄x is the mean for the
variable,
and s is the standard deviation.
• A z-score of 0 indicates that a data element’s value is the same as the
mean,
• Data elements with z-scores greater than 0 have values greater than
the mean,
z-scores less than 0 have values less than the mean.
• This calculation can be useful for comparing variables measured on
different scales.
Scenario
• Scenario: You take a math test.
• Your score (x): 85
• Class average (mean, μ): 75
• Standard deviation of class scores (σ): 5
• The standard deviation tells us how spread out the
scores are. A small standard deviation means scores are
clustered close to the average, while a large standard
deviation means they're more spread out.
• Magnitude of Z-score: The larger the absolute value of the z-score, the
further away from the mean your data point is. A z-score of 2 is quite a bit
above average.
• Comparison: If another student scored 70, their z-score would be
(70−75)/5=−1. This means their score is 1 standard deviation below the
average. By comparing z-scores, you can easily see that you performed
significantly better than them relative to the class.
• Normal" distribution: In a normal distribution (bell curve), most data
points fall within 1 to 2 standard deviations of the mean. A z-score of 2
suggests you did exceptionally well compared to your classmates.
Roughly 95% of data falls within 2 standard deviations of the mean,
meaning you scored better than about 97.5% of the class.
Shape of Distribution: Skewness
• Shape Overview
• Beyond central tendency and spread, the overall shape of a variable's frequency
distribution provides additional insights.
• Skewness
• Skewness is a measure of asymmetry in the distribution of data around the
mean. It tells you whether the data is symmetrically distributed, or if it leans
to the left or right.
• Quantifies the lack of symmetry in the distribution of a variable.
• Positive (Right) Skew: The bulk of observations are to the left of the mean, and
the right tail is longer.
• Negative (Left) Skew: The bulk of observations are to the right of the mean, and
the left tail is longer.
• A skewness value of zero indicates a symmetric distribution.
Types of Skewness: 2. Positive Skew (Right Skewed)
1. Symmetrical (Zero Skewness) •Tail is longer on the right
•Mean > Median > Mode
• Mean ≈ Median ≈ Mode
Example: Income distribution: Most people earn lower
• The left and right sides of the
distribution are mirror images. incomes, but a few individuals earn very high amounts,
pulling the mean to the right.
• Example: Heights of adult men: If
most men's heights are centered Data: [25K, 28K, 30K, 32K, 35K, 50K, 100K, 250K]
around 170 cm with equal Median ≈ 32K
variation above and below. Mean ≈ 68K → much higher due to outliers
→ Right-skewed
3. Negative Skew (Left Skewed)
•Tail is longer on the left
•Mean < Median < Mode
Example: Age at retirement
Most people retire around 60–65, but a few retire early (e.g., 40), dragging the
mean down
Data: [40, 55, 58, 60, 61, 62, 63, 65]
Median ≈ 61
Mean ≈ 58 → pulled left by the early retiree
→ Left-skewed
Example: alkphos
(positive skewness of
0.763) has more
observations to the left
of the mean, while mcv
(negative skewness of -
0.394) has more to the
right.
Kurtosis
• Characterizes the type of peak a distribution has.
• High kurtosis score: Indicates a pronounced peak near the mean.
• Low kurtosis score: Indicates a flat peak.
• Importance for Data Analysis
• Skewness and kurtosis values close to zero indicate that the frequency distribution
for a variable approximates a normal distribution.
• This is important for checking assumptions in certain data analysis
methods.
•Balanced, Taller, •Flatter peak with
bell-shaped sharper lighter tails.
curve. peak with
•Kurtosis ≈ 0 Kurtosis < 0
•Moderate heavier tails. Less prone to
tails and Kurtosis > 0 outliers, values
peak. More likely are more spread
to have out.
extreme
outliers.
Confidence Intervals: Estimating Population Values
• Up to this point, we have been looking at ways of summarizing information on a set of
randomly collected observations.
• This summary information is usually referred to as statistics as they summarize only a
collection of observations that is a subset of a larger population.
• However, information derived from a sample of observations can only be an
approximation of the entire population.
• To make a definitive statement about an entire population, every member of that
population would need to be measured.
• Eg: Average weight of men in the United States is 194.7 lb.
• We would have to collect the weight measurements for every man living in the United
States and derive a mean from these observations. This is not possible or practical in
most situations.
• It is possible, however, to make estimates about a population by using confidence
intervals.
• Confidence intervals are a measure of our uncertainty about the statistics we
calculate from a single sample of observations.
• A confidence interval (CI) is a range of values that’s used to estimate the true value of
a population parameter (like a mean or proportion)
• It provides an interval estimate instead of a single value, and includes
a confidence level, such as 95% or 99%, which tells us how confident
we are that the interval contains the true population parameter.
• For example, the confidence interval might state that the average
weight of men in the United States is between 191.2 lb and 198.2 lb to
take into account the uncertainty of measuring only a sample of the
total population.
is called as standard error of the sampling
distribution
Note: If you repeated the sampling process 100 times, in 95 out of those 100
samples, the true population mean would fall inside the calculated interval.
Example
Suppose fuel efficiency of 100 cars: Mean (x̄) = 30.35 MPG,
•Fuel Efficiency
Standard
Deviation (s) = 2.01.
Using a 95% confidence level (𝛼 = 0.05), the z𝛼∕2 value is
1.96.
Calculation: 30.35 ± 1.96 * (2.01 / √100).
Result: 30.35 ± 0.393.
Confidence Interval: between 29.957 and 30.743 MPG.
•Note for Small Samples
If the population standard deviation (sigma) is unknown and the
number of
observations (n) is less than 30, a t-distribution should be used
instead of
the z-distribution.
• Tutorial 2. Using the
data in Table 2.6,
create a histogram
of Sale Price ($)
using the following
intervals: 0 to less
than 250, 250 to
less than 500, 500
to less than 750,
and 750 to less
than 1000.
Preparing data tables
• Preparing the data is one of the most time-consuming parts
of a data analysis/data mining project.
• The process of preparing data for analysis includes
Cleaning the data,
Removing certain variables or observations,
Generating consistent scales across variables,
Generating new frequency distributions,
Converting text to numbers,
Converting continuous data to categories
Combining variables, generating groups,
Preparing unstructured data.
Cleaning the data
• For variables measured on a nominal or ordinal scale (where there are a fixed number of
possible values) it is useful to inspect all possible values to uncover mistakes, duplications
and inconsistencies.
• For example, a variable Company may include a number of different spellings for the same
company such as “General Electric Company,” “General Elec. Co.,” “GE,” “Gen. Electric
Company,” “General electric company,” and “G.E. Company.”
• A common problem with numeric variables is the inclusion of non-numeric terms. For
example, a variable generally consisting of numbers may include a value such as “above
50” or “out of range.”
• Another problem arises when observations for a particular variable are missing data
values.
• It can be more challenging to clean variables measured on an interval or ratio scale since
they can take any possible value within a range.
• Outliers are a single or a small number of data values that differ greatly from the rest of
the values.
• For example, an outlier may be an error in the measurement or the result of
measurements made using a different calibration. An outlier may also be a legitimate and
valuable data point.
• Histograms and box plots can be useful in identifying outliers as
previously described.
• A particular variable may have been measured over different units.
For example, a variable Weight may have been measured using both
pounds and kilograms for different observations.
• These should be standardized to a single scale so that they can be
compared during analysis.
• When data is combined from multiple sources, an observation is
more likely to have been recorded more than once. Duplicate entries
should be removed.
Removing observations and variables
• After an initial categorization of the variables, it may be
possible to remove variables from consideration.
• For example, constants and variables with too many missing
data values would be candidates for removal.
• Similarly, it may be necessary to remove observations that
have data missing for a particular variable.
Generating consistent scales across
variables
• Height vs Weight example
• Normalization uses a mathematical function to transform numeric
columns to a new range.
• For example, when analyzing customer credit card data, the Credit limit
value (whose values might range from $500 to $100,000) should not be
given more weight in the analysis than the Customer’s age (whose
values might range from 18 to 100).
Formula
where xi′ is the new normalized value,
xi is the original variable’s value,
OriginalMin is the minimum possible value in the original variable,
OriginalMax is the maximum original possible value,
NewMin is the minimum value for the normalized range,
and NewMax is the maximum value for the normalized range.
Min-Max
Task: Calculate Mean and Standard deviation (Hint: Calculate Variance
Decimal
New frequency distribution
• A variable may not conform to a normal frequency distribution;
however, certain data analysis methods may require that the data
follow a normal distribution.
• We can transform any frequency distribution using these three
methods: log, exponential, or a Box–Cox transformation
• Converting into a Normal Distribution
Converting text to numbers
• To use variables that have been assigned as nominal or ordinal and
described using text values within certain numerical analysis methods, it is
necessary to convert the variable’s values into numbers.
• For example, a variable with values “low,” “medium” and “high” may have
“low” replaced by 0, “medium” replaced by 1, and “high” replaced by 2.
• Another way to handle nominal data is to convert each value into a separate
column with values 1 (indicating the presence of the category) and 0
(indicating the absence of the category). These new variables are often
referred to as dummy variables.
Converting continuous data to categories
• First, where a value is defined on an interval or ratio scale but when
knowledge about how the data was collected suggests the accuracy of
the data does not warrant these scales, a variable may be a candidate
for conversion to a categorical variable that reflects the true variation
in the data.
• Second, because certain techniques can only process categorical data,
converting continuous data into discrete values makes a numeric
variable accessible to these methods.
• For example, a continuous variable credit score may be divided into
four categories: poor, average, good, and excellent; or a variable.
• This process can also be applied to nominal variables, especially in
situations where there are a large number of values for a given nominal
variable.
Combining variables
• The variable that you are trying to use may not be
present in the dataset but it may be derived from existing
variables.
• Mathematical operations, such as average or sum, could
be applied to one or more variables in order to create an
additional variable.
Generating groups
• Generally, larger data sets take more computational time to analyze
and creating subsets from the data can speed up the analysis.
• One approach is to take a random subset which is effective where the
data set closely matches the target population.
• Another reason is that when building predictive models from a data
set, it is important to keep the models as simple as possible. Breaking
the data set down into subsets based on your knowledge of the data
may allow you to create several simpler models.
Preparing unstructured data
• In many disciplines, the focus of a data analysis or data mining
project is not a simple data table of observations and variables.
• For example, in the life sciences, the focus of the analysis is genes,
proteins, biological pathways, and chemical structures.
• For example, when analyzing a data set of chemicals, an initial step is
to generate variables based on the composition of the chemical such
as its molecular weight or the presence or absence of molecular
components.
Importance of understanding relationships between
variables
•🔍 Reveals Patterns: Helps identify how variables •🤝 Connects Insights:
move together (e.g., income & education). Turns raw data into
•📊 Supports Decision-Making: Informs business, meaningful relationships
healthcare, finance, and policy choices. and trends.
•⚠️Avoids
Misinterpretation:
Reminds us that
association isn’t
causation.
•🧠 Enhances Models:
Relationships power
regression, classification,
and prediction models.
•📈 Visual Discovery:
Graphs and plots help
spot trends not obvious
Data Visualization Considerations
Ensure that the dataset is complete and relevant. This enables the
data scientist to discover meaningful patterns and apply them
effectively in the right context.
Use appropriate graphical representations to clearly convey the
intended message.
Use efficient visualization techniques, which highlight all the
datapoints
Data Visualization
Factors
Python Data Visualization Libraries
Visualizing relationships between
variables
1. Scatterplots
2. Summary Tables
and Charts
Scatterplots
•A type of graph that shows the relationship between two continuous variables
•Each point represents one observation
•The x-axis and y-axis represent different variables (measured on ratio/interval
scales)
• How to Construct a Scatter Plot?
Step 1: Identify the independent and
dependent variables
Step 2: Plot the independent variable on x-
axis
Step 3: Plot the dependent variable on y-axis
Step 4: Extract the meaningful relationship
Scatterplots
Reveal
Whether a relationship exists between two variables
The direction of the relationship:
•Positive: Both variables increase together (e.g., sepal length vs. petal length)
•Negative: One increases while the other decreases
The nature of the relationship:
•Linear: Forms a straight trend line
•Non-linear: Forms a curved trend line
•Positive Linear:
🔼 Both variables increase together
(straight upward trend)
•Negative Linear:
🔽 One variable increases while the
other decreases (straight downward
trend)
•Positive Non-linear:
📈z Both increase but not proportionally
(e.g., logarithmic curve)
•Negative Non-linear:
📉 One increases while the other
decreases nonlinearly (e.g., decaying
curve)
A scatterplot can also show if there are points that do not
follow this linear relationship. These are referred to as
outliers.
Scatterplots can also show the lack of any relationship. In
Figure 4.5, the points scattered throughout the graph
indicates that there is no obvious relationship between
Alcohol and Nonflavonoid phenols in this data set
Applications of Scatter Plot
• Correlation Analysis: Scatter plot is useful in the investigation of the
correlation between two different variables. It can be used to find out
whether two variables have a positive correlation, negative correlation
or no correlation.
• Outlier Detection: Outliers are data points, which are different from
the rest of the data set. A Scatter Plot is used to bring out these outliers
on the surface.
• Cluster Identification: In some cases, scatter plots can help identify
clusters or groups within the data.
import matplotlib.pyplot as plt
Example import numpy as np
# Data
matches_played = [2, 5, 7, 1, 12, 15, 18]
goals_scored = [1, 4, 5, 2, 7, 12, 11]
# Create scatter plot
plt.scatter(matches_played, goals_scored,
color='blue', label='Data Points')
# Fit and plot the trend line
z = np.polyfit(matches_played, goals_scored, 1) #
Linear fit
p = np.poly1d(z)
plt.plot(matches_played, p(matches_played),
color='steelblue', label='Trend Line')
# Add labels and title
plt.title('Scatter Chart')
plt.grid(True) plt.xlabel('Matches Played')
plt.legend() plt.ylabel('Goals Scored')
plt.tight_layout() plt.ylim(0, 14) # Match visual range
plt.show()
plt.legend()
plt.show
2. Summary Tables and Charts
•A simple way to understand relationships between variables
•Especially useful when one variable is categorical or discrete
•Rows = Groups (categories or bins of continuous variables)
•Columns = Summary statistics of the second variable
•Examples of stats used: Mean, Median, Min, Max, Std Dev, Count
{Example
of a
summary
table}
•Petal width increases across classes from Setosa to Virginica.
•Iris-setosa shows the smallest petal width.
•Iris-virginica has the largest average petal width.
•Summary statistics help us compare across discrete categories.
Charts
• Line graph
• Bar graph
• Stacked Bar graphs
• Histograms
• Pie charts
• Box charts
• Bubble Graph
• Dials
• Geographical Data maps
• Pictographs
Line graph
• Basic and most popular type of
displaying information.
• Shows data as a series of points
connected by straight line
segments.
• If mining with time-series data,
time is usually shown on the x-
axis. Multiple variables can be
represented on the same scale PC:
https://www.edelweiss.in/ewwebimages/WebImages/
on y-axis to compare of the line Learner/Line_Chart_Stocks~b2869c5e-d36c-4bdb-80
e1-07d4805e70e0.jpg
graphs of all the variables.
Bar graph
• A bar graph shows thin
colorful rectangular bars
with their lengths being
proportional to the values
represented.
• The bars can be plotted
vertically or horizontally.
• The bar graphs use a lot
of more ink than the line
graph and should be used
when line graphs are PC:
https://images.twinkl.co.uk/tw1n/image/private/t_630/u/ux/barchart_ve
inadequate r_1.jpg
Box charts
• These are special form of
charts to show the
distribution of variables.
• The box shows the
middle half of the values,
while whiskers on both
sides extend to the
extreme values in either
direction.
Box
Plotsto show the distribution of
These are special form of charts
variables.
The box shows the middle half of the values, while
whiskers on both sides extend to the extreme values in
either direction.
Bar Charts Box Plots
•Display mean or sum of a variable for each group •Show data distribution within each category
•Easy comparison across categories •Display median, quartiles, and outliers
•Help spot overlap or distinct separation
between groups
Stacked Bar graphs
• These are a particular
method of doing bar
graphs. Values of multiple
variables are stacked one
on top of the other to tell
an interesting story.
• Bars can also be
normalized such as the https://chartio.com/learn/charts/stacked-bar-chart-complet
e-guide/
total height of every bar is
equal, so it can show the
relative composition of
each bar.
Histograms
• These are like bar
graphs, except that
they are useful in
showing data
frequencies or data
values on classes
(or ranges) of a
numerical variable.
PC:
https://statistics.laerd.com/statistical-guides/img/uh/laerd-statistics-example-histogram-frequenci
es-for-age.png
Pie charts
• These are very
popular to show the
distribution of a
variable, such as
sales by region. The
size of a slice is
representative of the
relative strengths of
each value.
Bubble Graph
• This is an interesting way
of displaying multiple
dimensions in one chart.
• It is a variant of a scatter
plot with many data
points marked on two
dimensions.
• Imagine that each data
point on the graph is a
bubble (or a circle) the
size of the circle and the
color fill in the circle could
represent two additional
dimensions.
Dials
• These are charts like the
speed dial in the car,
that shows whether the
variable value (such as
sales number) is in the
low range, medium
range, or high range.
• These ranges could be
colored red, yellow and
gree to give an instant
view of the data.
Pictographs
• One can use pictures to
represent data, where
images are used to
show the product for
easy reference.
• A survey was conducted
for 40 children by a fast
food junction to
understand the demand
for different flavors of
pizza available in their
outlet.
Case Study: Summarizing
Ordinal and Continuous Data
Exploratory Analysis of Vehicle Weights & Diagnostic
Testing
Background
• This case study examines two applied settings:
• Vehicle weight vs. fuel efficiency (MPG
categories)
• Blood-test values vs. infection status
Data & Methods
Case study 1: Vehicle dataset: Cars binned into three ordinal MPG categories
(0–20, 20–30, 30–50 MPG); for each group, we compute the count
and mean vehicle weight and visualized.
Case study 2: Clinical trial dataset: A novel blood‐test produces a continuous score
(–5.0 to +1.0).
Patients are labeled as infection = 0 (no infection) or infection = 1 (infection).
We summarize test values by group, bin continuous values into ranges,
and construct contingency tables.
•Visualizations: Bar charts, box plots, histograms, and contingency‐table
heatmaps to illustrate distributional shifts and classification characteristics.
Case study 1: Summary tables for Ordinal variable and other
Variable
Figure: Example of summary table where the
Three ordered categories are
categorical variable is ordinal
used to group the observations.
It is possible to see how the
mean weight changes as the
MPG category increases.
It is clear from this table that as
the MPG categories increase, the
mean weight decreases.
The same information can be seen as a
histogram and a series of box plots.
By ordering the categories on the x-axis and
plotting the information
Case study 1: Bar & Box-Plot of Mean Weight
by MPG
Bar chart highlights stepwise Box plots reveal variability and
decrease in mean weight across outliers within each category.
MPG bins.
Overlap indicates some medium-
MPG cars match weights of low-
MPG models.
Case study 2 : Box Plot & Summary Table by
Infection Status
Positive group mean = -
1.09; negative group mean
= -2.18.
Box plots show central
tendency shift but
considerable overlap.
Overlap suggests
standalone test may
misclassify without threshold
tuning.
Binned Blood Test Ranges vs Mean Infection
Rate
Low ranges (-5 to -2):
mean infection ~0 →
reliable negatives.
High ranges (0 to +2):
mean infection ~1 →
reliable positives.
Mid-range (-2 to -1):
mean ~0.55 indicates
diagnostic grey zone.
CALCULATING METRICS ABOUT RELATIONSHIPS
Metrics are the methods for quantifying the strength of
relationships between variables
Metrics are used to measure the strength of the relationship
between two variables.
Metrics are usually based on the types of variables being
considered, such as a comparison between categorical variables
and continuous variables.
1. Correlation Coefficients
2. Kendall Tau
3. t-Tests Comparing Two Groups
4. Chi-Square
5. ANOVA
Group each metric by the types of variables it
compares
Variable Types Metric(s) When to Use
To quantify linear association
Continuous vs. Continuous • Correlation between two continuous
measures (e.g., height vs. weight)
To assess monotonic association
Ordinal vs. Ordinal • Kendall’s Tau between two ranked/ordinal
variables
To compare a continuous outcome
Continuous vs. Categorical (2
• Independent‑samples t‑test across two groups (e.g., mean
levels)
weight in men vs. women)
To compare a continuous outcome
Continuous vs. Categorical
• One‑way ANOVA across three or more groups (e.g.,
(>2 levels)
mean weight by diet type)
To test whether two categorical
Categorical vs. Categorical • Chi‑square test of independence variables are related (e.g.,
smoking status × disease status)
Correlation
Coefficient
Quantifies linear association
between two continuous variables.
Values range from –1 (perfect
negative) to +1 (perfect positive).
Computed as covariance divided
by product of standard
deviations.
Example r values: 0.83 (strong),
0.59 (moderate).
Example: Raw Data & Initial
Scatterplot
18 paired observations of x and y.
• Mean x = 106.94; mean y =
6.41.
• Scatterplot shows a clear positive
trend.
Detailed Calculation
of r
Sum Σ(xi−x̄)(yi−ȳ) =
1,357.06.
Denominator: 17 × 47.28 ×
1.86.
Calculated r = 1,357.06 /
(17×47.28×1.86) = 0.91.
Interpretation & Next
Steps
r = 0.91 indicates very strong positive linear
relationship.
Assess predictive utility and linear regression
modeling
Check assumptions: linearity, normality,
homoscedasticity.
Perform significance tests and confidence intervals
2. Kendall’s Tau
Metric
Measures the association between two ordinal (ranked) variables.
Ranking can be derived by ordering the values and then replacing
the actual values with a rank from 1 to n (where n is the number of
observations in the data set).
Based on counting concordant and discordant pairs.
Value ranges from −1 (perfect disagreement) to +1 (perfect
agreement).
Concordant Pair
• When the ranks/differences of both variables move in the same direction.
• Both differences are positive ⇒ Concordant
• If you flip the subtraction you get two negatives,
but still “same direction.”
Discordant Pair
• When the ranks/differences of the two variables
move in opposite directions.
• One positive, one negative ⇒ Discordant
• Total concordant (n_c) and discordant (n_d) pairs
illustrate Kendall Tau using two variables, counted over all combinations.
Variable X and Variable Y, containing a
ranking with 10 observations (A through J).
Computed concordant and Discordant
• Using These Counts in
Tau A
What τₐ ≈ 0.73
Reveals?????
•Positive Association: Since τₐ is positive, higher ranks in X tend to correspond to
higher ranks in Y.
•Values of τ between 0.7 and 0.9 are generally considered strong.
•Strength of Monotonic Relationship:
•Here, τₐ ≈ 0.73 indicates a substantial agreement between the two rank orderings.
•Interpretation in Context:
•Roughly three times as many concordant pairs as discordant pairs (393939 vs.
666).
•The ranked variables move together in the same direction in the majority of pair‐
wise comparisons.
Adjusting for Ties with
Kendall’s Tau B
•Problem: In real‐world data, you often get “ties”, pairs where either Xi=Xj or Yi=Yj.
•Impact on Tau A: Tau A ignores those tied pairs, which can bias the coefficient
when there are many ties.
•Solution: Tau B incorporates tie counts and rescales the denominator for a fairer
measure of association.
•Tau B Formula
•nc: number of concordant pairs
•nd: number of discordant pairs
•tx: number of pairs tied on X (but not on Y)
•ty: number of pairs tied on Y (but not on X)
Example with Ties
Obs X Y
A 1 1
B 2 1
C 2 2
D 3 3
E 4 3
Compute Tau A vs. Tau B Interpretation
•τB > τA because ties reduce the
“effective” number of comparable
pairs; Tau B corrects for that.
•Use Tau B whenever your data
contains tied ranks, common in
ratings, Likert scales, or binned
3. t-Tests Comparing Two
Groups
•Objective: Test whether two independent groups have significantly
different means.
•Applications: Call‑center performance, medical treatment vs.
control, A/B testing, etc.
When to apply t-test????
•Null hypothesis (H0): μ1=μ2 (no difference in population means)
•Alternative (HA): μ1≠μ2(means differ)
•Preconditions:
•Independent samples
•Approximate normality of each group’s data
•Either equal variances (pooled test) or allow unequal variances (Welch’s test)
Equal‑Variance (Pooled) t‑Test
Degrees of Freedom
(Equal‑Variance Test)
df=n1+n2−2
•Use in t‑distribution to obtain critical t or p‑value.
•Example: n1=8, n2=8 ⇒df=14.
Unequal‑Variance (Welch’s) t‑Test
•Welch’s adjustment compensates for
unequal variances.
•Often more robust in real‑world data.
Step‑By‑Step Calculation • Example: Compare, for
instance, Center A vs.
Center B.
Interpreting the t‑Value
Hypothesis & Test Choice
Test Statistic
Results & Inference
4. Chi‑Square Test
•Help to Assess Relationships Between Two Categorical Variables
•Non‑parametric test for two‑way tables (contingency tables)
•Evaluates whether two nominal/ordinal variables are independent or associated
•Widely used in market research, epidemiology, social sciences
Hypotheses
•Null (H₀): No association; the two variables are independent
•Alternative (H₁): There is an association; variables are related
Example Data
•Table structure: Columns(c) = ZIP codes (e.g. 43221, 43026, 43212)
•Rows(r) = Brands (X, Y, Z)
•Observed counts (Oₖ): e.g. 5,521 customers in 43221 bought X; 4,597
bought Y; etc.
Calculating Expected Frequencies
The Chi-Square test compares the observed frequencies with the
expected frequencies.
The expected frequencies are calculated using the following formula:
• where Er,c is the expected frequency for a
particular cell in a contingency table, r is the row
count, c is the column count and n is the total
number of observations in the sample.
Example: To calculate the expected frequency for the table cell where
the washing powder is brand X and the zip code is 43221 would be
Computing the χ² Statistic
• ek is the number of all
categories
ZIP Code Brand O E O−E (O−E)²/E
• Oi is the observed cell
• frequency
(598)²/
43221 X 5 521 4 923 + 598
4 923 ≈ 72.6 Sum over all 9 cells
43221 Y 4 597 4 913 – 316
(–316)²/
4 913 ≈ 20.3
• E
(3i is the
ZIP ×3expected
Brands) cell
(–283)²/
frequency.
Degrees of Freedom & Critical Value
43221 Z 4 642 4 925 – 283
4 925 ≈ 16.3
43026 X 4 522 4 764 – 242
(–242)²/
4 764 ≈ 12.3
df=(r−1)×(c−1)=(3−1)×(3
(– 38)²/
−1)=4
43026 Y 4 716 4 754 – 38
4 754 ≈ 0.3
(281)²/
43026 Z 5 047 4 766 + 281
4 766 ≈ 16.6
(–356)²/
43212 X 4 424 4 780 – 356
4 780 ≈ 26.5
(354)²/
43212 Y 5 124 4 770 + 354
4 770 ≈ 26.3
( 2)²/
43212 Z 4 784 4 782 + 2 4 782 ≈ 0.00
08
Total 191.2
Interpretation & Conclusion
•Since 191.2 ≫ 9.488, reject H₀
•Inference: A significant relationship exists between ZIP code
and powder brand
•Consumers’ brand choices depend on their ZIP code
4.
ANOVA
• One‑Way ANOVA: Comparing Multiple Group Means Hypotheses
•Null (H₀): μ₁ = μ₂ = μ₃ = μ₄ (all
• Test if three or more independent groups have the same
population mean. call‑center means equal)
•Alternative (H₁): At least one mean
• Example: Daily call volumes at four call centers (A, B,
C, D). differs
Data Overview (Table 4.5)
•Groups (k = 4): Centers A–D
•Observations (N = 29):
• A: 8 days
• B: 7 days
• C: 8 days
• D: 6 days
•Daily calls: values ranging ~124–157
Step1: Compute Group Means &
Variances
Center nᵢ x̄ᵢ sᵢ²
A 8 139.1 16.4
B 7 129.9 11.8
C 8 142.4 8.6
D 6 153.7 9.5
Overall mean
(x̄ = 140.8 (sum of all 29 value
Step2: Determine the within-group
The variation within groups
variation
is defined as the within-group variance or mean square
within (MSW).
To calculate this value, we use a weighted sum of the variance for the individual groups.
The weights are based on the number of observations in each group.
This sum is divided by the number of degrees of freedom calculated by subtracting the
number of groups (k) from the total number of observations (N):
Reflects average variance within each center
Step 3: Determine the between-group
The between-groupvariation
variation or mean square between (MSB) is calculated.
The MSB is the variance between the group means.
It is calculated using a weighted sum of the squared difference between the
group mean (x̄i ) and the average of all observations (x).
This sum is divided by the number of degrees of
freedom.
Measures how far each center’s mean is from
the overall mean
This is calculated by subtracting one from the
number of groups (k). The following formula is
used to calculate the MSB:
Step 4: Determine the F-statistic
• The F-statistic is the ratio of the MSB and
the MSW:
• A large FFF suggests between‑group variability ≫ within‑group
variability
Step 5: Test the significance of the F-
statistic
• where N is the total number of observations in
all groups and k is the number of groups.
Inference: Since the calculated F-statistic is greater than the critical value, we reject
the null hypothesis. The means for the different call centers are not equal.