0% found this document useful (0 votes)
8 views13 pages

Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring

Hands-on Activity 3.3 Random Forest Mantaring.ipynb -mantaring
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring

Hands-on Activity 3.3 Random Forest Mantaring.ipynb -mantaring
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.

ipynb - Colab

keyboard_arrow_down Activity 3.2 Random Forest


Objective(s):

This activity aims to perform classification using Random Forest

Intended Learning Outcomes (ILOs):

Demonstrate how to build the model using Random Forest.


Demonstrate how to evaluate the performance of the model.

Resources:

Jupyter Notebook
loan_data

Procedure:

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.

Import the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load the data and check the content of the dataframe using Pandas

loans= pd.read_csv('loan_data.csv')
loans.head()

credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last

0 1 debt_consolidation 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1

1 1 credit_card 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7

2 1 debt_consolidation 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6

3 1 debt_consolidation 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2

4 1 credit_card 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5

Next steps: Generate code with loans


toggle_off View recommended plots New interactive sheet

Examine the data types.

loans.dtypes

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 1/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

credit.policy int64

purpose object

int.rate float64

installment float64

log.annual.inc float64

dti float64

fico int64

days.with.cr.line float64

revol.bal int64

revol.util float64

inq.last.6mths int64

delinq.2yrs int64

pub.rec int64

not.fully.paid int64

dtype: object

Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.

plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

Text(0.5, 0, 'FICO')

Interpret the result of the graph

Type your answer here

Create a similar figure, except this time select by the not.fully.paid column.

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 2/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

Text(0.5, 0, 'FICO')

Interpret the result of the graph

Type your answer here

Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.

plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 3/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

<Axes: xlabel='purpose', ylabel='count'>

Interpret the result of the graph.

Type your answer here

Create the following lmplots to see if the trend differed between not.fully.paid and credit.policy.

plt.figure(figsize=(11,7))
sns.lmplot(y='int.rate',x='fico',data=loans,hue='credit.policy',
col='not.fully.paid',palette='Set1')

<seaborn.axisgrid.FacetGrid at 0x7c528965e410>
<Figure size 1100x700 with 0 Axes>

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 4/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

Interpret the result of the graph

Type your answer here

The purpose column contains categorical value. Therefore, we need to transform the data using dummy variables. otice that the purpose
column as categorical

cat_feats = ['purpose']

final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)

final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 credit.policy 9578 non-null int64
1 int.rate 9578 non-null float64
2 installment 9578 non-null float64
3 log.annual.inc 9578 non-null float64
4 dti 9578 non-null float64
5 fico 9578 non-null int64
6 days.with.cr.line 9578 non-null float64
7 revol.bal 9578 non-null int64
8 revol.util 9578 non-null float64
9 inq.last.6mths 9578 non-null int64
10 delinq.2yrs 9578 non-null int64
11 pub.rec 9578 non-null int64
12 not.fully.paid 9578 non-null int64
13 purpose_credit_card 9578 non-null bool
14 purpose_debt_consolidation 9578 non-null bool
15 purpose_educational 9578 non-null bool
16 purpose_home_improvement 9578 non-null bool
17 purpose_major_purchase 9578 non-null bool
18 purpose_small_business 9578 non-null bool
dtypes: bool(6), float64(6), int64(7)
memory usage: 1.0 MB

Split the data into a training set and a testing set

from sklearn.model_selection import train_test_split

X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

Create an instance of the RandomForestClassifier class and fit it to our training data

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=600)

rfc.fit(X_train,y_train)

▾ RandomForestClassifier i ?

RandomForestClassifier(n_estimators=600)

What is n_estimators?

Type your answer here

Predict the class of not.fully.paid for the X_test data.

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 5/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

predictions = rfc.predict(X_test)

Create a classification report from the results

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, roc_auc_score

print(classification_report(y_test,predictions))

precision recall f1-score support

0 0.85 1.00 0.92 2431


1 0.56 0.02 0.04 443

accuracy 0.85 2874


macro avg 0.70 0.51 0.48 2874
weighted avg 0.80 0.85 0.78 2874

Show the Confusion Matrix for the predictions.

print(confusion_matrix(y_test,predictions))

[[2423 8]
[ 433 10]]

#evaluate the performance using accuracy score


print(accuracy_score(y_test, predictions))

0.8465553235908142

print(roc_auc_score(y_test, predictions))

0.5096412683054564

Interpret the result of the classification error, confusion matrix , accuracy score and roc_auc_score

The matrix show a high value of positive datas and it also show a high number of errors but the possitive is much bigger and the accruracy is in
a good accuracy because it reach 0.84 but not as good as much and the roc auc score is a bit not good because it show a bad prediction for
me

Start coding or generate with AI.

Supplementary Activity:

Choose your own dataset


Import the dataset
Determine the number of datapoints, columns and data types
Remove unneccesary columns
Do data cleaning such as removing empty values(NaN), replacing missing data .
Perform descriptive statistics such as mean, median and mode
Perform data visualization
Build the model using Ranfom Forest
Evaluate the model using classification report, accuracy , confusion matrix and roc_auc_score
Change the n_estimators from 100 to 1000, increment by 100.
Create a graph to compare the accuracy based from n_estimators

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 6/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
bank= pd.read_csv('bankloan.csv')
bank.head()

ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online Cred

0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0

1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0

2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0

3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0

4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0

Next steps: Generate code with bank


toggle_off View recommended plots New interactive sheet

bank.dtypes

ID int64

Age int64

Experience int64

Income int64

ZIP.Code int64

Family int64

CCAvg float64

Education int64

Mortgage int64

Personal.Loan int64

Securities.Account int64

CD.Account int64

Online int64

CreditCard int64

dtype: object

num_datapoints = bank.shape[0]
num_columns = bank.shape[1]

print(f"Number of datapoints: {num_datapoints}")


print(f"Number of columns: {num_columns}")

Number of datapoints: 5000


Number of columns: 14

nan_counts = bank.isna().sum()

print("NaN values in each column:")


print(nan_counts[nan_counts > 0])

NaN values in each column:


Series([], dtype: int64)

numerical_features = bank.select_dtypes(include=['number'])

descriptive_stats = numerical_features.agg(['mean', 'median'])

print("Descriptive Statistics for Numerical Features:")


print(descriptive_stats)

Descriptive Statistics for Numerical Features:


ID Age Experience Income ZIP.Code Family CCAvg \
mean 2500.5 45.3384 20.1046 73.7742 93152.503 2.3964 1.937938
median 2500.5 45.0000 20.0000 64.0000 93437.000 2.0000 1.500000

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 7/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
Education Mortgage Personal.Loan Securities.Account CD.Account \
mean 1.881 56.4988 0.096 0.1044 0.0604
median 2.000 0.0000 0.000 0.0000 0.0000

Online CreditCard
mean 0.5968 0.294
median 1.0000 0.000

correlation_matrix = bank.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Bank Loan Data')
plt.show()

plt.figure(figsize=(8, 6))
plt.hist(bank['Age'], bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')
plt.show()

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 8/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

plt.figure(figsize=(8, 6))
sns.boxplot(x='Education', y='Income', data=bank)
plt.xlabel('Education Level')
plt.ylabel('Income')
plt.title('Income Distribution by Education Level')
plt.show()

plt.figure(figsize=(8, 6))
plt.scatter(bank['CCAvg'], bank['Income'], alpha=0.5)
plt.xlabel('CCAvg')
plt.ylabel('Income')
plt.title('CCAvg vs. Income')
plt.show()

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 9/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

plt.figure(figsize=(6, 4))
sns.countplot(x='Personal.Loan', data=bank)
plt.xlabel('Personal Loan')
plt.ylabel('Count')
plt.title('Number of Customers with Personal Loan')
plt.show()

sns.pairplot(bank[['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Personal.Loan']], hue='Personal.Loan')


plt.show()

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 10/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

search
Create a graph to compare the accuracy based from n_estimators. IN
pen_spark Generate ID int64
Age int64
Close

chevron_left 1 of 1
chevron_right Undo Changes Use code with caution

Suggested code may be subject to a license | iamchaichai.com/2023/01/30/develop-customer-retention-analytics-in-python/ | zindi.africa/blog/random-forest-classifier-tutorial-how-to-use-tree-based-algor

import pandas as pd
import numpy as np
https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 11/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

n_estimators_list = []
accuracy_scores_list = []
for n_estimators in range(100, 1001, 100):

X = bank.drop('Personal.Loan', axis=1)
y = bank['Personal.Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rfc = RandomForestClassifier(n_estimators=n_estimators)
rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

n_estimators_list.append(n_estimators)
accuracy_scores_list.append(accuracy)

plt.plot(n_estimators_list, accuracy_scores_list)
plt.xlabel('Number of Estimators (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. n_estimators in Random Forest')
plt.show()

X = final_data.drop('not.fully.paid', axis=1)
y = final_data['not.fully.paid']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(roc_auc_score(y_test, predictions))

precision recall f1-score support

0 0.85 1.00 0.92 2431


1 0.45 0.02 0.04 443

accuracy 0.85 2874


macro avg 0.65 0.51 0.48 2874
weighted avg 0.79 0.85 0.78 2874

[[2419 12]
[ 433 10]]
0.8451635351426583
0.5088185616003967

predictions = rfc.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, predictions))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

accuracy = accuracy_score(y_test, predictions)


print("\nAccuracy Score:", accuracy)

roc_auc = roc_auc_score(y_test, predictions)


print("\nROC AUC Score:", roc_auc)

Classification Report:
precision recall f1-score support

0 0.85 1.00 0.92 2431


1 0.45 0.02 0.04 443

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 12/13
10/18/24, 10:35 PM Hands-on Activity 3.3 Random Forest Mantaring.ipynb - Colab

accuracy 0.85 2874


macro avg 0.65 0.51 0.48 2874
weighted avg 0.79 0.85 0.78 2874

Confusion Matrix:
[[2419 12]
[ 433 10]]

Accuracy Score: 0.8451635351426583

ROC AUC Score: 0.5088185616003967

accuracy_scores = []
n_estimator_values = range(100, 1001, 100)

for n_estimators in n_estimator_values:


rfc = RandomForestClassifier(n_estimators=n_estimators)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
accuracy_scores.append(accuracy)

plt.figure(figsize=(10, 6))
plt.plot(n_estimator_values, accuracy_scores, marker='o')
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. n_estimators for Random Forest")
plt.grid(True)
plt.show()

n_estimators_list = []
accuracy_scores_list = []
for n_estimators in range(100, 1001, 100):

X = bank.drop('Personal.Loan', axis=1)
y = bank['Personal.Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rfc = RandomForestClassifier(n_estimators=n_estimators)
rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)

https://colab.research.google.com/drive/1rjZpb7g-E8AAE_F6v3yLOeVRyfiJR_pc#scrollTo=CiJdCWviANKY&printMode=true 13/13

You might also like