Statistics For Machine Learning Part 01 1719342613
Statistics For Machine Learning Part 01 1719342613
Foundations:
Descriptive Statistics:
This list is not exhaustive, but it covers the core concepts that form the foundation of
statistical analysis. The specific topics you delve deeper into will depend on your field
of study and the types of data you work with.
2. How Statistics Paves the Way for Machine Learning?
Imagine a treasure trove of information, but without a map to navigate it. That's where
statistics comes in! It acts as the compass and key, helping us unlock the secrets
hidden within data. But how does it connect to the marvels of machine learning (ML)?
Let's delve into this fascinating partnership.
Machine learning algorithms are data-hungry beasts. They require vast amounts of
high-quality data to learn and improve. Statistics plays a vital role in this process:
Statistics and machine learning are like peanut butter and jelly. Together, they create
something far greater than the sum of their parts. Statistics provides the foundation for
understanding data, while machine learning leverages that understanding to build
powerful models. By acknowledging the limitations of both, we can create a robust
and insightful journey through the world of data.
Real-World Example: In clinical trials for new drugs, statistical methods are used to
determine the effectiveness and safety of the drug by analysing data from test subjects.
Statistics and economics are two sides of the same coin. Here's how statistics
empowers economic decision-making:
Statistics are crucial for optimizing energy production, distribution, and consumption:
MEAN (ARITHMETIC)
The mean (or average) is the most popular and well-known measure of central
tendency. It can be used with both discrete and continuous data, although its use is
most often with continuous data (see our Types of Variable guide for data types).
The mean is equal to the sum of all the values in the data set divided by the number
of values in the data set. So, if we have n values in a data set and they have values
X1, X2, X3… Xn , the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capital
letter, , pronounced "sigma", which means "sum of...".
You may have noticed that the above formula refers to the sample mean. So, why
have we called it a sample mean? This is because, in statistics, samples and
populations have very different meanings and these differences are very important,
even if, in the case of the mean, they are calculated in the same way. To
acknowledge that we are calculating the population mean and not the sample mean,
we use the Greek lowercase letter “mu", denoted as µ:
The mean is essentially a model of your data set. It is the value that is most
common. You will notice, however, that the mean is not often one of the actual
values that you have observed in your data set. However, one of its important
properties is that it minimizes error in the prediction of any one value in your data
set. That is, it is the value that produces the lowest amount of error from all other
values in the data set.
An important property of the mean is that it includes every value in your data set as
part of the calculation. In addition, the mean is the only measure of central tendency
where the sum of the deviations of each value from the mean is always zero.
MEDIAN
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to
calculate the median, suppose we have the data below:
We first need to rearrange that data into order of magnitude (smallest first):
Our median mark is the middle mark — in this case, 56 (highlighted in bold). It is
the middle mark because there are 5 scores before it and 5 scores after it. This works
fine when you have an odd number of scores, but what happens when you have an
even number of scores? What if you had only 10 scores? Well, you simply have to
take the middle two scores and average the result. So, if we look at the example
below:
Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5.
Example of Median:
The mode is particularly problematic with continuous data because it is more likely
not to have any value that is more frequent than the other. For example, consider the
data set consisting of the weights of 30 people. How likely is it that that two or more
people with exactly the same weight (e.g., 55.4 kg) are present in the same sample?
The answer would be that it is perhaps highly unlikely. Though many people might be
close, it is impossible to find two people with exactly the same weight (to the nearest
0.1 kg), with such a small sample (30 people) and a large range of possible weights.
This is why the mode is very rarely used with continuous data.
Other Limitations of Using Mode
One of the major limitations with the mode is that it is not unique. So it leaves with
problems when having two or more values that share the highest frequency, such as
below:
The below table will help to choose the best measures of central tendency
with respect to different types of variables.
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
There are several approaches you can take when dealing with a dataset missing more
than 30% of its values, depending on the specific situation and your goals for the
analysis. Here are some options to consider:
1. Removal:
2. Imputation:
This involves estimating the missing values based on the available data. There
are various imputation techniques, each with its own advantages and
limitations:
o Mean/Median/Mode Imputation: Replace missing values with the
average (mean), middle value (median), or most frequent value (mode)
of the existing data in that column. This is a simple method but might
not be suitable for skewed data or if the missing values are not randomly
distributed.
o Interpolation: Estimate missing values based on surrounding data
points. This can be linear interpolation (connecting adjacent values with
a straight line) or more complex methods depending on the data.
o Model-based Imputation: Use statistical models like regression
analysis to predict missing values based on the relationships between
variables. This can be effective but requires careful model selection and
validation.
3. Dimensionality Reduction:
If you have a large number of features (columns) and many missing values,
consider reducing the dimensionality of your data. Techniques like Principal
Component Analysis (PCA) can help identify underlying patterns and create
new features with less missing data.
It's important to be transparent about the chosen approach and the potential
limitations it introduces when presenting your analysis.
Ultimately, the best approach depends on the specific dataset and your research
question.
Outliers and skewed data often push the mean to the extreme, making the
median a better indicator of where most values cluster in the dataset
Suppose we’re analysing the income of a group of people. Here’s why the median might be a
better measure:
1. Skewed Data: Income distributions are often skewed, with a few extremely high
earners. If we calculate the mean income, these outliers significantly impact the result.
The mean gets pulled toward the higher values, making it less representative of the
typical person’s income.
2. Robustness: The median is a robust statistic. It’s not affected by extreme values
(outliers) as much as the mean. Even if a few people have exceptionally high or low
incomes, the median remains relatively stable.
3. Interpretability: The median represents the middle value when data is sorted. For
income, it’s the income level at which half the population earns more and half earns
less. This intuitively captures the “typical” income.
Example Calculation:
Incomes: 20,25,30,40,50,1000
Mean Income:
20,25,30,40,50,1000
In this case, the median ($35,000) better represents the typical income compared to the mean
($191,670). It’s less affected by the outlier ($1,000,000).
Remember, context matters, and choosing the right measure depends on the specific
characteristics of the data!
7. What is the difference between Descriptive and Inferential
Statistics?
Descriptive Inferential
Aspect Statistics Statistics
Visible characteristics
of a dataset (population Likelihood of future event
Focus or sample). occurrence.
Focus:
Data Used:
Goal:
Techniques:
Certainty:
Descriptive Statistics: Offers a high degree of certainty about the data being
analysed since it's based on the entire dataset (ideally) or a representative
sample.
Inferential Statistics: Results have a margin of error due to using samples.
There's always some level of uncertainty when making inferences about a
larger population.
Application:
In essence, Descriptive Statistics paints a picture of the data itself, while Inferential
Statistics uses that picture to make educated guesses about the bigger picture - the
entire population.
Partition values or fractiles such a quartile, a decile, etc. are the different sides of the
same story. In other words, these are values that divide the same set of observations
in different ways. So, we can fragment these observations into several equal parts.
QUARTILE
A quartile is a statistical value that divides a sorted dataset into four equal parts. It
helps us understand how the data is distributed and where most of the values fall.
Concept:
Imagine you have a list of exam scores for a class, arranged from lowest to
highest.
Dividing this list into four equal parts gives you three quartile values: Q1, Q2,
and Q3.
Types of Quartiles:
First Quartile (Q1): Also called the lower quartile, it represents the value at
which 25% of the data falls below it and 75% falls above it.
Second Quartile (Q2): This is the median of the dataset. It represents the
middle value when the data is sorted, with 50% of the data falling below it and
50% above it.
Third Quartile (Q3): Also called the upper quartile, it represents the value at
which 75% of the data falls below it and 25% falls above it.
Applications:
Understanding the spread of data: Knowing the quartiles gives you an idea of
how spread out the data is. A small difference between Q1 and Q3 indicates
that most of the data is clustered around the median (Q2). A large difference
suggests a wider spread with more values towards the extremes.
Identifying outliers: Values significantly lower than Q1 or higher than Q3 can
be potential outliers that deserve further investigation.
Comparing datasets: Quartiles allow you to compare the distribution of data
across different groups or populations.
DECILES
Deciles are statistical values that divide a sorted dataset into ten equal parts. They provide
more granularity compared to quartiles (which divide data into fourths).
Number of Deciles:
There are actually nine deciles (D1 to D9), not eight. Each decile marks a point where a
specific percentage of the data falls below it.
D1 Explained:
D1 refers to the first decile. It's not the "typical peak value." Here's a breakdown:
D1: Represents the value at which 10% of the data falls below it and 90% falls above
it.
Example: Imagine a dataset containing exam scores for 100 students, arranged from lowest to
highest.
D1 would be the score at which 10 students (10%) scored lower than or equal to that
value. The remaining 90 students scored higher than D1.
D2: The score where 20% of students scored lower and 80% higher.
D3: The score where 30% scored lower and 70% higher, and so on.
D9: The score where 90% of students scored lower and 10% higher (almost reaching
the highest score).
Applications of Deciles:
Detailed Distribution: Deciles provide a more precise picture of how data is spread
out compared to quartiles.
Identifying Specific Segments: They can be used to identify specific segments
within a dataset. For example, D5 (median) divides the data into half, while D7 might
represent the score above which only the top 30% of students scored.
PERCENTILES
Concept:
Types of Percentiles:
P1: Represents the value where 1% of the data falls below it and 99% falls
above it.
P25: This is the first quartile (Q1). It represents the value at which 25% of the
data falls below it and 75% falls above it.
P50: This is the median (Q2) of the dataset. It represents the middle value with
50% of the data falling below it and 50% above it.
P75: This is the third quartile (Q3). It represents the value at which 75% of the
data falls below it and 25% falls above it.
P99: Represents the value where 99% of the data falls below it and 1% falls
above it.
Applications:
Quartiles (P25, P50, P75) are commonly used for a basic understanding of data
spread.
Deciles (P10, P20, ..., P90) provide more detailed information.
You can choose any specific percentile (like P90 or P95) to pinpoint a specific
point in the data distribution.
Calculation Steps:
Identify the Value: The value with the next rank after the ordinal rank is the
desired percentile value.
Example 1:
Data: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Percentiles:
o 30th percentile:
Ordinal Rank=(10030)×10=3
(Next rank: 4, value: 40)
o 40th percentile:
Ordinal Rank=(10040)×10=4
(Next rank: 5, value: 50)
o 50th percentile:
Ordinal Rank=(10050)×10=5
(Next rank: 6, value: 60)
Data: 25, 25, 26, 36, 39, 40, 40, 44, 44, 44, 45, 47, 48, 51, 52, 52, 52, 53, 67, 77
Percentiles:
o 10th percentile: 25
o 30th percentile: 36
o 60th percentile: 52
o 80th percentile: 52
9. How Spread Out Is Our Data? Exploring Range and
Interquartile Range (IQR)?
We've learned about measures of central tendency, like mean or median, which tell us
the typical value in a dataset. But data isn't always perfectly clustered around that
central point. Some datasets are spread out widely, while others are tightly packed
together. This spread or dispersion is just as important to understand as the average
value. In statistics, we have different terms for this spread: variability, dispersion, or
simply how "spread out" the data is. Similar to having multiple ways to measure the
centre, there are also several ways to quantify how spread out the data points is in a
distribution.
Key Points:
Measures of central tendency (mean, median, and mode) describe the "typical"
value in a dataset.
Measures of variation tell us how spreads out the data points are around the
central value.
A low variation indicates the data points are close to the centre.
High variation signifies the data points are scattered further away from the
centre.
Variability, dispersion, and spread all refer to the same concept - how wide the
distribution of data is.
RANGE
Let's start with the range because it is the most straightforward measure of
variability to calculate and the simplest to understand. The range of a dataset
is the difference between the largest and smallest values in that dataset. For
example, in the two datasets below, dataset 1 has a range of 20 — 38 = 18
while dataset 2 has a range of 11 — 52 = 41. Dataset 2 has a wider range and,
hence, more variability than dataset 1.
While the range is easy to understand, it is based on only the two most extreme
values in the dataset, which makes it very susceptible to outliers. If one of those
numbers is unusually high or low, it affects the entire range even if it is atypical.
Additionally, the size of the dataset affects the range. In general, you are less likely to
observe extreme values. However, as you increase the sample size, you have more
opportunities to obtain these extreme values.
Consequently, when you draw random samples from the same population, the range
tends to increase as the sample size increases. Consequently, use the range to compare
variability only when the sample sizes are similar.
Example:
Imagine you have a bunch of leaves and want to know how spreads out their sizes are.
The average (mean) size tells you one thing, but it doesn't reveal the whole picture.
The interquartile range (IQR) helps us understand the "middle majority" of leaf sizes.
1. Splitting the Data: We can divide the leaves (or any data set) into four equal
quarters, ordered from smallest to largest. Statisticians call these quarters
"quartiles" and label them Q1 (lowest), Q2 (median), Q3 (highest), and
(confusingly) not Q4!
2. Focusing on the Middle: The IQR specifically zooms in on the middle two
quartiles (Q2 and Q3). Q2, the median, represents the exact middle value.
3. IQR: The Gap in the Middle: The IQR is the difference between the value in
the upper quartile (Q3) and the value in the lower quartile (Q1). In simpler
terms, it tells you the range of values that encompasses the middle 50% of the
data, excluding the most extreme values at either end.
Key takeaway: The IQR provides a clearer picture of how spread out the majority of
the data points are, focusing on the central area and potentially revealing outliers on
the fringes.
IQR Calculations
We've learned that the median is a good central tendency measure because it's not
easily swayed by extreme values (outliers). The interquartile range (IQR) shares this
strength, making it a robust measure of variability.
In essence, IQR is a powerful tool for understanding data spread, especially when
outliers or skewed distributions might distort other measures of variability.
Both mean deviation (MD) and standard deviation (SD) are measures of variability in
statistics, but they differ in their approach and how they handle data. Here's a
breakdown to understand their key differences:
where:
(n) is the number of data points.
(xi) represents each data value.
(µ) is the mean of the data set.
Standard Deviation (SD):
Concept: SD measures the average squared distance of each data point from
the mean. It considers not just the direction (positive or negative difference) but
also the magnitude of the deviation from the mean. Squaring the differences
emphasizes larger deviations more heavily.
Calculation:
1. Calculate the mean of the data.
2. Find the squared difference between each data point and the mean.
3. Take the average of those squared differences.
4. Finally, take the square root of the result obtained in step 3 (to bring the
units back to the original scale of the data).
where:
(n) is the number of data points.
(xi) represents each data value. Follow me on LinkedIn for more:
(µ) is the mean of the data set
https://lnkd.in/gxcsx77g
Choosing Between MD and SD:
In general:
Use SD when normality is assumed and outliers are not a major concern.
Use MD when dealing with skewed data or when outliers might
significantly affect the results.