20 Important Statistical Approaches Every Data Scientist Knows🐱‍🚀

9 min readFeb 6, 2025

Data science is a multidisciplinary field that combines mathematics, statistics, computer science, and domain expertise to extract insights from data. While machine learning algorithms often steal the spotlight, a solid foundation in statistical methods can be just as pivotal. In this writing, we explore 20 advanced statistical approaches that every data scientist should be familiar with. Each approach includes a brief explanation, an example use case, and a snippet of Python code to illustrate its usage.

1. Bayesian Inference

What It Is
Bayesian inference uses Bayes’ theorem to update the probability of a hypothesis as more evidence or information becomes available. Unlike frequentist methods, Bayesian approaches allow you to incorporate prior beliefs and update those beliefs with observed data.

Example Use Case

Spam Filtering: Prior belief on how likely an email is spam can be combined with new evidence (words in the email) to update the spam probability.

Code Snippet
Below is a simple example using PyMC3 or PyMC to perform Bayesian parameter estimation for a Bernoulli process (e.g., coin flips).

!pip install pymc3  # or pymc

import pymc3 as pm
import numpy as np

# Suppose we observed 20 coin flips with 12 heads and 8 tails
observed_heads = 12
observed_tails = 8

with pm.Model() as model:
    # Prior for the bias of the coin (theta)
    theta = pm.Beta('theta', alpha=1, beta=1)
    
    # Likelihood
    y = pm.Binomial('y', n=observed_heads + observed_tails, p=theta, observed=observed_heads)
    
    # Posterior sampling
    trace = pm.sample(2000, tune=1000, cores=1, chains=2)

pm.summary(trace)

2. Maximum Likelihood Estimation (MLE)

What It Is
MLE finds the parameter values that maximize the likelihood of the observed data under a certain statistical model.

Example Use Case

Fitting Distributions: Estimate parameters (mean, variance) of a normal distribution that best fit your data.

Code Snippet
Using scipy.stats to fit a normal distribution:

import numpy as np
from scipy.stats import norm

# Generate synthetic data
data = np.random.normal(loc=5, scale=2, size=1000)

# Estimate MLE for mean and standard deviation
mu_hat, std_hat = norm.fit(data)

print(f"Estimated Mean (mu): {mu_hat:.2f}")
print(f"Estimated Std (sigma): {std_hat:.2f}")

3. Hypothesis Testing (t-test)

What It Is
Hypothesis testing involves formulating a null hypothesis (no difference/effect) and an alternative hypothesis. A t-test specifically checks if the means of two groups are significantly different.

Example Use Case

A/B Testing: Testing if a new website layout (Group B) leads to a significantly different average session time compared to the old layout (Group A).

Code Snippet
Using scipy.stats.ttest_ind for a two-sample t-test:

import numpy as np
from scipy.stats import ttest_ind

# Synthetic data
group_A = np.random.normal(5, 1, 50)
group_B = np.random.normal(5.5, 1.2, 50)

stat, pvalue = ttest_ind(group_A, group_B)

print(f"T-statistic: {stat:.2f}, p-value: {pvalue:.4f}")
if pvalue < 0.05:
    print("Reject the null hypothesis (Significant difference).")
else:
    print("Fail to reject the null hypothesis (No significant difference).")

4. Analysis of Variance (ANOVA)

What It Is
ANOVA tests whether there is a statistically significant difference between the means of three or more groups.

Example Use Case

Marketing Experiment: Evaluate the effectiveness of three different advertising strategies by measuring sales lift.

Code Snippet
Using scipy.stats.f_oneway:

import numpy as np
from scipy.stats import f_oneway

# Synthetic data
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(12, 2, 30)
group3 = np.random.normal(14, 2, 30)

f_stat, p_val = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.2f}, p-value: {p_val:.4f}")

5. Principal Component Analysis (PCA)

What It Is
PCA reduces the dimensionality of data by projecting it onto new, orthogonal axes (principal components) that capture the most variance.

Example Use Case

Image Compression: Reduce high-dimensional pixel data into fewer features for faster processing.

Code Snippet
Using sklearn.decomposition.PCA:

import numpy as np
from sklearn.decomposition import PCA

# Synthetic data: 100 samples with 10 features
X = np.random.rand(100, 10)

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print("Explained variance ratios:", pca.explained_variance_ratio_)
print("Reduced shape:", X_reduced.shape)

6. Factor Analysis

What It Is
Factor Analysis models the observed variables as linear combinations of latent (unobserved) factors, often used for dimensionality reduction or to uncover hidden structure.

Example Use Case

Psychometrics: Identifying underlying personality traits from questionnaire data.

Code Snippet
Using factor_analyzer library:

!pip install factor_analyzer

import numpy as np
from factor_analyzer import FactorAnalyzer

# Synthetic data (100 samples, 6 variables)
X = np.random.rand(100, 6)

fa = FactorAnalyzer(n_factors=2, rotation='varimax')
fa.fit(X)

print("Loadings:\n", fa.loadings_)

7. Cluster Analysis (K-means)

What It Is
Clustering partitions data into homogeneous groups (clusters) based on similarity. K-means is a popular centroid-based clustering technique.

Example Use Case

Customer Segmentation: Group customers by purchasing patterns.

Code Snippet
Using sklearn.cluster.KMeans:

import numpy as np
from sklearn.cluster import KMeans

# Synthetic data: 200 samples, 2D
X = np.random.rand(200, 2)

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

print("Cluster centers:", kmeans.cluster_centers_)
print("Cluster labels:", kmeans.labels_[:10])

8. Bootstrapping

What It Is
Bootstrapping involves repeatedly sampling with replacement from your dataset to estimate the distribution (and uncertainty) of a statistic (e.g., mean, median).

Example Use Case

Confidence Intervals: Estimate a 95% confidence interval for the mean of a small dataset.

Code Snippet
Illustrating a simple bootstrap for the mean:

import numpy as np

np.random.seed(42)
data = np.random.normal(50, 5, size=100)  # Original sample

def bootstrap_mean_ci(data, n_bootstraps=1000, ci=95):
    means = []
    n = len(data)
    for _ in range(n_bootstraps):
        sample = np.random.choice(data, size=n, replace=True)
        means.append(np.mean(sample))
    lower = np.percentile(means, (100-ci)/2)
    upper = np.percentile(means, 100 - (100-ci)/2)
    return np.mean(means), (lower, upper)

mean_estimate, (lower_ci, upper_ci) = bootstrap_mean_ci(data)
print(f"Bootstrap Mean: {mean_estimate:.2f}")
print(f"{95}% CI: [{lower_ci:.2f}, {upper_ci:.2f}]")

9. Time Series Analysis (ARIMA)

What It Is
ARIMA (AutoRegressive Integrated Moving Average) is a popular model for forecasting univariate time series data by capturing autocorrelation in the data.

Example Use Case

Sales Forecasting: Predict future sales based on past performance.

Code Snippet
Using statsmodels.tsa.arima.model:

!pip install statsmodels

import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Synthetic time series data
np.random.seed(42)
data = np.random.normal(100, 5, 50)
time_series = pd.Series(data)

# Fit ARIMA model (p=1, d=1, q=1)
model = ARIMA(time_series, order=(1,1,1))
model_fit = model.fit()

# Forecast next 5 points
forecast = model_fit.forecast(steps=5)
print("Forecast:", forecast.values)

10. Survival Analysis

What It Is
Survival Analysis deals with time-to-event data, often focusing on the probability of an event (e.g., churn) occurring after a certain time.

Example Use Case

Customer Churn: Estimate how long a customer will remain active before canceling a subscription.

Code Snippet
Using lifelines:

!pip install lifelines

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter

# Synthetic survival times and censorship
np.random.seed(42)
durations = np.random.exponential(scale=10, size=100)
event_observed = np.random.binomial(1, 0.8, size=100)

kmf = KaplanMeierFitter()
kmf.fit(durations, event_observed=event_observed, label='Test Group')

kmf.plot_survival_function()

11. Multivariate Regression (Multiple Linear Regression)

What It Is
Multiple linear regression models the relationship between a dependent variable and multiple independent variables.

Example Use Case

Pricing Models: Predict house price based on square footage, number of rooms, and location.

Code Snippet
Using sklearn.linear_model.LinearRegression:

import numpy as np
from sklearn.linear_model import LinearRegression

# Synthetic data: price = 100 + 2*rooms + 0.5*sqft + noise
np.random.seed(42)
rooms = np.random.randint(1, 5, 100)
sqft = np.random.randint(500, 2500, 100)
price = 100 + 2*rooms + 0.5*sqft + np.random.normal(0, 50, 100)

X = np.column_stack([rooms, sqft])
y = price

model = LinearRegression()
model.fit(X, y)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

12. Ridge/Lasso Regression

What It Is

Ridge Regression adds an L2 penalty to reduce overfitting by shrinking coefficients.
Lasso Regression adds an L1 penalty, which can drive some coefficients to zero, effectively performing feature selection.

Example Use Case

High-Dimensional Data: Gene expression data with many correlated features.

Code Snippet
Using sklearn.linear_model:

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split

# Synthetic data
np.random.seed(42)
X = np.random.rand(100, 10)
y = X[:, 0]*5 + X[:, 1]*3 + np.random.normal(0, 0.1, 100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)

print("Ridge coefficients:", ridge.coef_)
print("Lasso coefficients:", lasso.coef_)

13. Logistic Regression

What It Is
Logistic regression is used for binary classification, modeling the probability of a certain class or event existing.

Example Use Case

Credit Card Fraud Detection: Classify transactions as fraudulent or legitimate.

Code Snippet
Using sklearn.linear_model.LogisticRegression:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

np.random.seed(42)
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, 100)  # Binary labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)

14. Mixed Effects Models

What It Is
Mixed effects models (or hierarchical linear models) include both fixed effects (common to all groups) and random effects (specific to each group). They are commonly used in longitudinal or grouped data.

Example Use Case

Educational Data: Test scores across multiple schools, each school having a random intercept.

Code Snippet
Using statsmodels’ mixedlm:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# Synthetic data: Each student belongs to a specific school
np.random.seed(42)
school_ids = np.repeat(np.arange(10), 20)
scores = 50 + 5*np.random.rand(200) + 2*school_ids + np.random.normal(0, 5, 200)

df = pd.DataFrame({"score": scores, "school": school_ids})
model = smf.mixedlm("score ~ 1", df, groups=df["school"])
result = model.fit()
print(result.summary())

15. Nonparametric Tests (Mann-Whitney U)

What It Is
Nonparametric tests do not assume a specific distribution of the data. The Mann-Whitney U test is used to compare two independent samples.

Example Use Case

Median Comparison: Compare the median of sales in two stores without assuming normality.

Code Snippet
Using scipy.stats.mannwhitneyu:

import numpy as np
from scipy.stats import mannwhitneyu

# Synthetic data
group_A = np.random.exponential(scale=1.0, size=30)
group_B = np.random.exponential(scale=1.2, size=30)

stat, pvalue = mannwhitneyu(group_A, group_B, alternative='two-sided')
print(f"Statistic: {stat:.2f}, p-value: {pvalue:.4f}")

16. Monte Carlo Simulation

What It Is
Monte Carlo simulations use repeated random sampling to estimate the probability of different outcomes under uncertainty.

Example Use Case

Risk Analysis: Predict the probability of project cost overruns given uncertain variables like labor cost, raw material cost, etc.

Code Snippet
Estimating π using a Monte Carlo approach:

import numpy as np

np.random.seed(42)
n_samples = 10_000_00
xs = np.random.rand(n_samples)
ys = np.random.rand(n_samples)
# Points within the unit circle
inside_circle = (xs**2 + ys**2) <= 1.0
pi_estimate = inside_circle.sum() * 4 / n_samples

print("Estimated π:", pi_estimate)

17. Markov Chain Monte Carlo (MCMC)

What It Is
MCMC methods (e.g., Metropolis-Hastings, Gibbs sampling) are used in Bayesian inference to generate samples from a posterior distribution when direct sampling is difficult.

Example Use Case

Parameter Estimation: Complex hierarchical models where direct integration is not feasible.

Code Snippet
Using PyMC3 to do a simple MCMC:

import pymc3 as pm
import numpy as np

# Synthetic data
np.random.seed(42)
data = np.random.normal(0, 1, 100)

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=10)
    sigma = pm.HalfNormal('sigma', sd=1)
    
    likelihood = pm.Normal('likelihood', mu=mu, sd=sigma, observed=data)
    trace = pm.sample(1000, tune=500, chains=2)

pm.summary(trace)

18. Robust Regression

What It Is
Robust regression methods (e.g., RANSAC, Huber regression) are less sensitive to outliers compared to ordinary least squares.

Example Use Case

Outlier-Prone Data: Fitting a model to data that contains extreme values, such as in finance.

Code Snippet
Using sklearn.linear_model.RANSACRegressor:

import numpy as np
from sklearn.linear_model import RANSACRegressor, LinearRegression

np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 3 * X.squeeze() + 2 + np.random.normal(0, 2, 100)

# Add outliers
X_outliers = np.array([[8], [9], [9.5]])
y_outliers = np.array([50, 55, 60])
X = np.vstack((X, X_outliers))
y = np.concatenate((y, y_outliers))

ransac = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
ransac.fit(X, y)
print("RANSAC Coefficients:", ransac.estimator_.coef_)
print("RANSAC Intercept:", ransac.estimator_.intercept_)

19. Copulas

What It Is
Copulas capture the dependence structure between random variables separately from their marginal distributions. They are popular in finance for modeling joint distributions of asset returns.

Example Use Case

Portfolio Risk: Model the joint behavior of multiple stocks that exhibit non-linear dependence.

Code Snippet
A simple example using a Gaussian copula from the copulas library:

!pip install copulas

import numpy as np
from copulas.multivariate import GaussianMultivariate

# Synthetic data
X = np.random.normal(0, 1, (1000, 2))
X[:,1] = 0.8 * X[:,0] + np.random.normal(0, 0.6, 1000)  # correlation

model = GaussianMultivariate()
model.fit(X)

sample = model.sample(5)
print("Original correlation:", np.corrcoef(X[:, 0], X[:, 1])[0,1])
print("Sample correlation:", np.corrcoef(sample[:, 0], sample[:, 1])[0,1])

20. Generalized Additive Models (GAMs)

What It Is
GAMs extend linear models by allowing for non-linear functions of predictors while maintaining additivity. They’re more flexible than linear regression but still interpretable.

Example Use Case

Health Data: Modeling a patient outcome as a smooth non-linear function of age and other variables.

Code Snippet
Using the pyGAM library:

!pip install pygam

import numpy as np
from pygam import LinearGAM, s

# Synthetic data
np.random.seed(42)
X = np.random.rand(200, 1) * 10
y = 2 + 3*np.sin(X).ravel() + np.random.normal(0, 0.5, 200)

gam = LinearGAM(s(0)).fit(X, y)
XX = np.linspace(0, 10, 100)
preds = gam.predict(XX)
print("Coefficients:", gam.summary())

Conclusion

From understanding Bayesian inference and MLE, through advanced concepts like Copulas and GAMs, these 20 advanced statistical approaches form a comprehensive toolkit for any data scientist. The code snippets here are brief illustrations, but each method can be explored in-depth to address complex real-world problems — be it forecasting, inference, or modeling intricate relationships.

Happy Analyzing!

20 Important Statistical Approaches Every Data Scientist Knows🐱‍🚀

1. Bayesian Inference

2. Maximum Likelihood Estimation (MLE)

3. Hypothesis Testing (t-test)

4. Analysis of Variance (ANOVA)

5. Principal Component Analysis (PCA)

6. Factor Analysis

7. Cluster Analysis (K-means)

8. Bootstrapping

9. Time Series Analysis (ARIMA)

10. Survival Analysis

11. Multivariate Regression (Multiple Linear Regression)

12. Ridge/Lasso Regression

13. Logistic Regression

14. Mixed Effects Models

15. Nonparametric Tests (Mann-Whitney U)

16. Monte Carlo Simulation

17. Markov Chain Monte Carlo (MCMC)

18. Robust Regression

19. Copulas

20. Generalized Additive Models (GAMs)

Conclusion

Further Reading

Written by Sarowar Jahan Saurav

Responses (37)