Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
ipynb - Colab
Resources:
Jupyter Notebook
loan_data
Procedure:
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load the data and check the content of the dataframe using Pandas
loans= pd.read_csv('loan_data.csv')
loans.head()
credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last
loans.dtypes
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 1/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
credit.policy int64
purpose object
int.rate float64
installment float64
log.annual.inc float64
dti float64
fico int64
days.with.cr.line float64
revol.bal int64
revol.util float64
inq.last.6mths int64
delinq.2yrs int64
pub.rec int64
not.fully.paid int64
dtype: object
Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')
Text(0.5, 0, 'FICO')
Create a similar figure, except this time select by the not.fully.paid column.
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 2/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
Text(0.5, 0, 'FICO')
Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 3/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
Create the following lmplots to see if the trend differed between not.fully.paid and credit.policy.
plt.figure(figsize=(11,7))
sns.lmplot(y='int.rate',x='fico',data=loans,hue='credit.policy',
col='not.fully.paid',palette='Set1')
<seaborn.axisgrid.FacetGrid at 0x7c528965e410>
<Figure size 1100x700 with 0 Axes>
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 4/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
The purpose column contains categorical value. Therefore, we need to transform the data using dummy variables. otice that the purpose
column as categorical
cat_feats = ['purpose']
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 credit.policy 9578 non-null int64
1 int.rate 9578 non-null float64
2 installment 9578 non-null float64
3 log.annual.inc 9578 non-null float64
4 dti 9578 non-null float64
5 fico 9578 non-null int64
6 days.with.cr.line 9578 non-null float64
7 revol.bal 9578 non-null int64
8 revol.util 9578 non-null float64
9 inq.last.6mths 9578 non-null int64
10 delinq.2yrs 9578 non-null int64
11 pub.rec 9578 non-null int64
12 not.fully.paid 9578 non-null int64
13 purpose_credit_card 9578 non-null bool
14 purpose_debt_consolidation 9578 non-null bool
15 purpose_educational 9578 non-null bool
16 purpose_home_improvement 9578 non-null bool
17 purpose_major_purchase 9578 non-null bool
18 purpose_small_business 9578 non-null bool
dtypes: bool(6), float64(6), int64(7)
memory usage: 1.0 MB
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
Create an instance of the RandomForestClassifier class and fit it to our training data
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
▾ RandomForestClassifier i ?
RandomForestClassifier(n_estimators=600)
What is n_estimators?
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 5/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
predictions = rfc.predict(X_test)
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
[[2423 8]
[ 433 10]]
0.8465553235908142
print(roc_auc_score(y_test, predictions))
0.5096412683054564
Interpret the result of the classification error, confusion matrix , accuracy score and roc_auc_score
The matrix show a high value of positive datas and it also show a high number of errors but the possitive is much bigger and the accruracy is in
a good accuracy because it reach 0.84 but not as good as much and the roc auc score is a bit not good because it show a bad prediction for
me
Supplementary Activity:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 6/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
bank= pd.read_csv('bankloan.csv')
bank.head()
ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online Cred
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0
bank.dtypes
ID int64
Age int64
Experience int64
Income int64
ZIP.Code int64
Family int64
CCAvg float64
Education int64
Mortgage int64
Personal.Loan int64
Securities.Account int64
CD.Account int64
Online int64
CreditCard int64
dtype: object
num_datapoints = bank.shape[0]
num_columns = bank.shape[1]
nan_counts = bank.isna().sum()
numerical_features = bank.select_dtypes(include=['number'])
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 7/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
Education Mortgage Personal.Loan Securities.Account CD.Account \
mean 1.881 56.4988 0.096 0.1044 0.0604
median 2.000 0.0000 0.000 0.0000 0.0000
Online CreditCard
mean 0.5968 0.294
median 1.0000 0.000
correlation_matrix = bank.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Bank Loan Data')
plt.show()
plt.figure(figsize=(8, 6))
plt.hist(bank['Age'], bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')
plt.show()
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 8/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
plt.figure(figsize=(8, 6))
sns.boxplot(x='Education', y='Income', data=bank)
plt.xlabel('Education Level')
plt.ylabel('Income')
plt.title('Income Distribution by Education Level')
plt.show()
plt.figure(figsize=(8, 6))
plt.scatter(bank['CCAvg'], bank['Income'], alpha=0.5)
plt.xlabel('CCAvg')
plt.ylabel('Income')
plt.title('CCAvg vs. Income')
plt.show()
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 9/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
plt.figure(figsize=(6, 4))
sns.countplot(x='Personal.Loan', data=bank)
plt.xlabel('Personal Loan')
plt.ylabel('Count')
plt.title('Number of Customers with Personal Loan')
plt.show()
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 10/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
search
Create a graph to compare the accuracy based from n_estimators. IN
pen_spark Generate ID int64
Age int64
Close
chevron_left 1 of 1
chevron_right Undo Changes Use code with caution
import pandas as pd
import numpy as np
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 11/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
n_estimators_list = []
accuracy_scores_list = []
for n_estimators in range(100, 1001, 100):
X = bank.drop('Personal.Loan', axis=1)
y = bank['Personal.Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rfc = RandomForestClassifier(n_estimators=n_estimators)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
n_estimators_list.append(n_estimators)
accuracy_scores_list.append(accuracy)
plt.plot(n_estimators_list, accuracy_scores_list)
plt.xlabel('Number of Estimators (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. n_estimators in Random Forest')
plt.show()
X = final_data.drop('not.fully.paid', axis=1)
y = final_data['not.fully.paid']
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(roc_auc_score(y_test, predictions))
[[2419 12]
[ 433 10]]
0.8451635351426583
0.5088185616003967
predictions = rfc.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))
Classification Report:
precision recall f1-score support
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 12/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
Confusion Matrix:
[[2419 12]
[ 433 10]]
accuracy_scores = []
n_estimator_values = range(100, 1001, 100)
plt.figure(figsize=(10, 6))
plt.plot(n_estimator_values, accuracy_scores, marker='o')
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. n_estimators for Random Forest")
plt.grid(True)
plt.show()
n_estimators_list = []
accuracy_scores_list = []
for n_estimators in range(100, 1001, 100):
X = bank.drop('Personal.Loan', axis=1)
y = bank['Personal.Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rfc = RandomForestClassifier(n_estimators=n_estimators)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 13/13