Final Report Srini
Final Report Srini
G, Srinuvasan
Name
Customer Churn
Project
DATA SCIENCE
Group
02/06/2024
Date of Submission
A study on “Unveiling Customer Churn”
Submitted by:
G, Srinuvasan
USN:
221VMTR01646
Nimesh Marfatia
(Faculty-JAIN Online)
I, G, Srinuvasan, hereby declare that the Research Project Report titled “Unveilling
Customer Churn” has been prepared by me under the guidance of the Nimesh
Marfatia. I declare that this Project work is towards the partial fulfilment of the
University Regulations for the award of the degree of Master of Computer Application
by Jain University, Bengaluru. I have undergone a project for a period of Eight Weeks. I
further declare that this Project is based on the original study undertaken by me and has
not been submitted for the award of any degree/diploma from any other University /
Institution.
Abstract:
This research project titled "Unveiling Customer Churn" aims to investigate and
predict customer churn within the context of various industries. The study utilizes data
science methodologies to develop predictive models and extract actionable insights for
businesses. The research involves thorough data cleaning, preprocessing, exploratory data
analysis (EDA), and the implementation of predictive models such as logistic regression,
decision trees, random forest, and gradient boosting machine. By analysing key churn drivers
and model performance metrics, the study provides strategies for enhancing customer
retention and profitability.
Introduction:
Sure, here's the background, problem statement, and objective of study based on the provided
introduction:
Background:
In the competitive landscape of modern business, understanding and predicting customer
behavior is crucial for maintaining a stable and loyal customer base. One of the significant
challenges faced by companies is customer churn, where customers stop doing business with
a company. Accurately predicting customer churn can help businesses implement targeted
retention strategies, thereby reducing turnover and increasing profitability.
Problem Statement:
The problem at hand is to effectively predict customer churn using machine learning models.
By accurately identifying customers who are likely to churn, businesses can proactively
implement retention strategies to mitigate churn rates and maintain a loyal customer base.
Objective of Study:
The objective of this study is to evaluate various machine learning models to determine the
most effective approach for predicting customer churn. The models considered include
Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Machine (GBM).
The study aims to compare these models using evaluation metrics such as accuracy,
precision, recall, F1-score, and Area Under the ROC Curve (AUC). Ultimately, the goal is to
identify the model that provides the most reliable and actionable predictions, guiding the
implementation of effective retention strategies to minimize churn and enhance customer
satisfaction.
Literature Review
Company and Industry Overview
The Tele communications Industry
The telecommunications industry is known for its intense competition and high customer
churn rates. Customers can easily switch between providers because many companies offer
similar services, making it challenging to maintain a loyal customer base. The ease of
switching, coupled with aggressive marketing strategies and promotional offers from
competitors, contributes to high churn rates.
Churn analysis in this industry is critical because of the significant costs associated with
acquiring new customers compared to retaining existing ones. Extensive customer interaction
data, including call details, internet usage, customer service interactions, and payment history,
provides a rich dataset for churn prediction models. The availability of such detailed data
allows for a nuanced understanding of customer behavior and churn patterns, making the
telecommunications industry a primary focus for churn studies.
Methodology
This section outlines the comprehensive methodology employed in the "Unveiling Customer
Churn" research project. The approach encompasses data preparation, exploratory data
analysis (EDA), model building, and validation, ensuring robustness and actionable insights
for business decision-making.
1. Data Preparation
Data Cleaning and Preprocessing
Missing Values Treatment:
Identification and Imputation: Missing values were identified using techniques such as
exploratory data analysis (EDA) and handled through mean imputation for numerical
variables and mode imputation for categorical variables to maintain data completeness.
Outlier Detection and Treatment:
Statistical Methods: Outliers were detected using statistical measures like z-scores and visual
methods such as box plots. Extreme outliers were either transformed using log transformation
or removed if they were determined to be data entry errors.
Variable Transformation:
Normalization and Standardization: Numerical variables were normalized or standardized
to ensure consistent scaling across features. Log transformation was applied to skewed
variables to achieve a more normal distribution.
Feature Engineering and Selection:
Irrelevant Feature Removal: Redundant or irrelevant variables were eliminated to
streamline the dataset and improve model performance.
Derived Variables: New features were engineered based on domain knowledge, such as
interaction terms, to enhance predictive accuracy.
Report Text: "In this section, we address missing values, outliers, and scaling of data. We
used mean imputation for numerical variables and mode imputation for categorical variables.
Outliers were detected and treated using z-scores."
import pandas as pd
# Load dataset
data = pd.read_csv('your_dataset.csv')
data['numerical_column'].fillna(data['numerical_column'].mean(), inplace=True)
data['categorical_column'].fillna(data['categorical_column'].mode()[0], inplace=True)
z_scores = stats.zscore(data['numerical_column'])
# Filter out rows where the z-score is greater than 3 (indicating an outlier)
scaler = StandardScaler()
data[['numerical_feature1', 'numerical_feature2']] =
scaler.fit_transform(data[['numerical_feature1', 'numerical_feature2']])
Report Text: "Univariate analysis was conducted to understand the distribution of individual
features. Bivariate analysis helped in identifying relationships between variables, while
multivariate analysis was used to uncover complex interactions."
# Univariate analysis
# Plotting the distribution of a numerical feature
plt.figure(figsize=(10,6))
sns.histplot(data['numerical_feature'], kde=True)
plt.title('Distribution of Numerical Feature')
plt.xlabel('Feature')
plt.ylabel('Frequency')
plt.show()
# Bivariate analysis
# Plotting the relationship between two numerical features
plt.figure(figsize=(10,6))
sns.scatterplot(x='numerical_feature1', y='numerical_feature2', data=data)
plt.title('Feature1 vs Feature2')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()
# Correlation heatmap
# Calculating the correlation matrix
correlation_matrix = data.corr()
# Plotting the heatmap of the correlation matrix
plt.figure(figsize=(12,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Performance Enhancement:
Hyperparameter Tuning: Grid search and cross-validation were used for hyperparameter
tuning to optimize model settings and performance.
Feature Engineering: Enhanced model predictive power through careful selection and
transformation of features based on domain knowledge.
Model Validation
Validation Techniques:
Holdout Method: Models were validated using a holdout test set to ensure unbiased
performance evaluation.
Cross-Validation: Techniques like k-fold cross-validation were employed to assess model
stability and generalizability.
Evaluation Metrics:
Performance Metrics: Accuracy, precision, recall, F1-score, and Area Under the ROC Curve
(AUC) were used to evaluate and compare model performance.
Business Implications: Selection of the best-performing model based on these metrics
ensured reliable churn prediction and actionable insights for retention strategies.
Explanation: The graph provides a visual comparison of the performance metrics for the various
models implemented in the study. It highlights key metrics such as accuracy, precision, recall, F1-
score, and AUC (Area Under the ROC Curve) for each model, facilitating a clear understanding of their
relative effectiveness.
Implementation Steps
1. Data Preparation:
Preprocessing: Rigorous data cleaning and preprocessing were conducted to ensure high-
quality input for model building. Data was split into training (80%) and testing (20%) sets to
facilitate model training and validation.
2. Model Training:
Optimization: Models were trained on the training set with hyperparameter tuning to achieve
optimal performance.
3. Model Evaluation:
Assessment: Performance was assessed on the test set using comprehensive metrics to ensure
model reliability.
4. Model Interpretation:
Insight Extraction: Interpretation of logistic regression coefficients and decision tree paths
provided insights into churn drivers. Feature importance scores in ensemble models were
analysed to understand the contribution of each feature.
5. Deployment:
Integration: The best-performing model was integrated into the business's CRM system for
operational use. Regular monitoring and retraining were scheduled to maintain model
accuracy over time.
This methodology ensures a rigorous, accurate, and actionable approach to customer churn
prediction, providing valuable insights for enhancing customer retention and profitability.
Report Text: "We implemented several models including Logistic Regression, Decision
Trees, Random Forest, and Gradient Boosting Machines. Hyperparameter tuning was
performed using grid search and cross-validation."
rf = RandomForestClassifier()
grid_search.fit(X_train, y_train)
# Model evaluation
y_pred = grid_search.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))
plt.figure(figsize=(8,6))
plt.title('ROC Curve')
plt.show()
Report Text: "Univariate analysis was conducted to understand the distribution of individual
features. Bivariate analysis helped in identifying relationships between variables, while multivariate
analysis was used to uncover complex interactions. The correlation heatmap below highlights the
correlations between different features in the dataset."
# Correlation heatmap
correlation_matrix = data.corr()
plt.title('Correlation Heatmap')
plt.show()
Graph
Explanation:
"The correlation heatmap above illustrates the relationships between various features in the dataset.
Each cell in the heatmap represents the correlation coefficient between two features, with values
ranging from -1 to 1. A value close to 1 indicates a strong positive correlation, meaning that as one
feature increases, the other tends to increase as well. Conversely, a value close to -1 indicates a
strong negative correlation, where one feature increases as the other decreases. Values near 0
suggest no linear correlation between the features.
- Service Score and Account User Count: There appears to be a moderate positive correlation,
indicating that accounts with higher service scores tend to have more users.
- Complains (L12m) and Churn: A positive correlation is visible, suggesting that customers who have
lodged more complaints in the last 12 months are more likely to churn.
- Revenue per Month and Rev Growth YoY: A strong positive correlation, indicating that higher
monthly revenue is associated with higher year-over-year revenue growth.
These insights are crucial for feature selection and engineering, as highly correlated features may
introduce multicollinearity, potentially affecting the performance and interpretability of machine
learning models."
"In our EDA, we analyzed the distributions of individual features and their relationships. The
correlation heatmap below highlights the correlations between different features in the dataset. This
visual representation helps identify strongly correlated features, which can inform feature selection
and engineering steps."
"The correlation heatmap above illustrates the relationships between various features in the dataset.
Each cell in the heatmap represents the correlation coefficient between two features, with values
ranging from -1 to 1. A value close to 1 indicates a strong positive correlation, meaning that as one
feature increases, the other tends to increase as well. Conversely, a value close to -1 indicates a
strong negative correlation, where one feature increases as the other decreases. Values near 0
suggest no linear correlation between the features.
Service Score and Account User Count: There appears to be a moderate positive correlation,
indicating that accounts with higher service scores tend to have more users.
- Complains (L12m) and Churn: A positive correlation is visible, suggesting that customers who have
lodged more complaints in the last 12 months are more likely to churn.
- Revenue per Month and Rev Growth YoY: A strong positive correlation, indicating that higher
monthly revenue is associated with higher year-over-year revenue growth.
These insights are crucial for feature selection and engineering, as highly correlated features may
introduce multicollinearity, potentially affecting the performance and interpretability of machine
learning models."
4. Account Usage:
Accounts with higher user counts (Account_user_count) and lower service scores
(Service_Score) have a higher likelihood of churn.
The quality of service provided and the number of active users on an account
significantly impact churn rates.
The correlation matrix reveals that features like tenure, service score, and call center
interactions are significant predictors of churn.
Strong correlations between variables such as 'Tenure' and 'Churn' indicate that
longer-tenured customers are less likely to churn.
2. Feature Importance:
The Random Forest model achieved a high ROC AUC score, indicating good
performance in predicting churn. Further tuning and comparison with other models
can optimize results.
Evaluating models using metrics like accuracy, precision, recall, and F1-score helps in
selecting the best-performing model.
Graph
This graph illustrates the distribution of churned versus non-churned customers within the
dataset. The x-axis represents the churn status, with '0' indicating non-churned customers and
'1' indicating churned customers. The y-axis shows the count of customers in each category.
From the graph, it is evident that there is a higher number of churned customers compared to
non-churned customers. This visual representation underscores the imbalance in the dataset,
highlighting the need for addressing class imbalance through techniques such as SMOTE or
resampling to ensure more accurate and reliable model training.
The graph is placed in the "Findings Based on Data Analysis" section to visually support the
discussion about churn, making it easier to understand the extent of customer churn and its
implications on the analysis.
1. Correlation Analysis:
o The correlation matrix reveals that features like tenure, service score, and call center
interactions are significant predictors of churn.
o Strong correlations between variables such as 'Tenure' and 'Churn' indicate that
longer-tenured customers are less likely to churn.
2. Feature Importance:
o Random Forest feature importance analysis highlights 'Tenure', 'Service_Score',
'Account_user_count', and 'CC_Contacted_L12m' as key predictors of churn.
o These features significantly impact the model's ability to predict customer churn.
3. Class Imbalance:
o The dataset shows an imbalance with a higher proportion of non-churners compared
to churners. Techniques like SMOTE or resampling are necessary for balanced model
training.
o Addressing class imbalance is crucial for improving model accuracy and reliability.
General Findings
1. Churn Patterns:
Customers with short tenure, frequent service issues, and low service scores are more
likely to churn.
Analysing churn patterns helps in identifying high-risk customers and developing
targeted retention strategies.
2. Service Quality Impact:
High service quality and satisfaction, indicated by high Service_Score and
CC_Agent_Score, are crucial for customer retention.
Improving service quality can significantly reduce churn rates.
3. Payment Method Influence:
Improve customer support and resolve issues promptly to reduce churn related to
service interactions.
Implementing robust customer support systems can help address customer issues
more effectively.
2. Targeted Retention Strategies:
Improve data collection methods to ensure completeness and accuracy, especially for
critical features influencing churn.
Accurate and complete data is essential for reliable churn prediction and analysis.
2. Advanced Analytics:
Incorporate advanced analytics techniques like machine learning and deep learning
for more accurate churn prediction.
Advanced analytics can provide deeper insights into customer behaviour and churn
patterns.
3. Regular Monitoring:
Conduct longitudinal studies to understand how customer behavior and churn patterns
evolve over time.
Long-term studies can provide insights into trends and changes in customer behavior.
2. Cross-Industry Analysis:
Expand the study to other industries to identify common churn factors and develop
industry-specific retention strategies.
Cross-industry analysis can help in generalizing findings and applying them to
different contexts.
3. Integration with Other Data Sources:
Integrate data from social media, customer reviews, and other sources to enrich the
dataset and provide more comprehensive insights.
Additional data sources can enhance the analysis and provide a holistic view of
customer behaviour.
Conclusion
The analysis highlights the importance of understanding and predicting customer
churn to develop effective retention strategies. By focusing on key predictors such as tenure,
service quality, and customer interaction, businesses can proactively address churn and
enhance customer loyalty. Implementing the recommendations and continuously improving
the analytical models will enable sustained growth and customer satisfaction.