0% found this document useful (0 votes)

29 views48 pages

73035393463850

Please accept

Uploaded by

ANSAR ALI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views48 pages

73035393463850

Please accept

Uploaded by

ANSAR ALI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

AMRITSAR GROUP OF COLLEGES

Autonomous Status Conferred by UGC | NAAC - A Grade

Lab Report
On
Machine Learning Lab
(MTCS 103-18)
Submitted in the partial fulfillment of the requirement for the award of a degree of

Master of Technology
In
COMPUTER SCIENCE & ENGINEERING
Batch (2024-2026)

Submitted to Submitted by
Er. Jasmeet Kaur Ranjodhbir
ANSAR ALISingh

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Index

S. No. Title Page No.

Experiment 1: Study of pattern for implementation and Download

1. 1-6
opensource software of your interest.

2. Experiment 2: Supervised Learning – Regression 7-12

3. Experiment 3: Supervised Learning – Classification 13-16

4. Experiment 4: Unsupervised Learning 17-19

5. Experiment 5: Dimensionality Reduction 20-23

6. Experiment 6: Supervised Learning and Kernel Methods 24-27

7. Experiment 7: Project
Project name:- Medical Insurance Price Prediction 28-46
Experiment 1: Study of pattern for implementation of Assignments.
Download the opensource software of your interest. Document the distinct features and functionality of the
software platform. You may choose WEKA, R or any other software.
Python is a high-level programming language that has become increasingly popular due to its simplicity,
versatility, and extensive range of applications. The process of How to install Python in Windows, operating
system is relatively easy and involves a few uncomplicated steps.
This article aims to take you through the process of downloading and installing Python on your Windows
computer.
How to Install Python in Windows?
We have provided step-by-step instructions to guide you and ensure a successful installation. Whether you are
new to programming or have some experience, mastering how to install Python on Windows will enable you
to utilize this potent language and uncover its full range of potential applications.
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows operating system.
Locate a reliable version of Python 3, preferably version 3.10.11, which was used in testing this tutorial.
Choose the correct link for your device from the options provided: either Windows installer (64-bit) or
Windows installer (32-bit) and proceed to download the executable file.

Python Homepage

1
Step 2: Downloading the Python Installer
Once you have downloaded the installer, open the .exe file, such as python-3.10.11-amd64.exe, by
doubleclicking it to launch the Python installer. Choose the option to Install the launcher for all users by
checking the corresponding checkbox, so that all users of the computer can access the Python launcher
application. Enable users to run Python from the command line by checking the Add python.exe to PATH
checkbox.

Python Installer
After Clicking the Install Now Button the setup will start installing Python on your Windows system. You will
see a window like this.

Python Setup
2
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see a successful
message.

Python Successfully installed

Step 4: Verify the Python Installation in Windows

Close the window after successful installation of Python. You can check if the installation of Python was
successful by using either the command line or the Integrated Development Environment (IDLE), which you
may have installed. To access the command line, click on the Start menu and type “cmd” in the search bar.
Then click on Command Prompt. python --version

Python version

You can also check the version of Python by opening the IDLE application. Go to Start and enter IDLE in the
search bar and then click the IDLE app, for example, IDLE (Python 3.10.11 64-bit). If you can see the Python
IDLE window then you are successfully able to download and installed Python on Windows.

3
Python IDLE
Getting Started with Python

Python is a lot easier to code and learn. Python programs can be written on any plain text editor like Notepad,
notepad++, or anything of that sort. One can also use an online IDE to run Python code or can even install one
on their system to make it more feasible to write these codes because IDEs provide a lot of features like an
intuitive code editor, debugger, compiler, etc. To begin with, writing Python Codes and performing various
intriguing and useful operations, one must have Python installed on their System.

Features and Functionality in Python

In this section we will see what are the features of Python programming language:

1. Free and Open Source

Python language is freely available at the official website and you can download it from the given download
link below click on the Download Python keyword. Download Python Since it is open-source, this means that
source code is also available to the public. So you can download it, use it as well as share it.

2. Easy to code
Python is a high-level programming language. Python is very easy to learn the language as compared to other
languages like C, C#, Java script, Java, etc. It is very easy to code in the Python language and anybody can
learn Python basics in a few hours or days. It is also a developer-friendly language.

4
3. Easy to Read
As you will see, learning Python is quite simple. As was already established, Python’s syntax is really
straightforward. The code block is defined by the indentations rather than by semicolons or brackets.

4. Object-Oriented Language
One of the key features of Python is Object-Oriented programming. Python supports object-oriented language
and concepts of classes, object encapsulation, etc.

5. GUI Programming Support

Graphical User interfaces can be made using a module such as PyQt5, PyQt4, wxPython, or Tk in Python.
PyQt5 is the most popular option for creating graphical apps with Python.

6. High-Level Language
Python is a high-level language. When we write programs in Python, we do not need to remember the system
architecture, nor do we need to manage the memory.

7. Large Community Support

Python has gained popularity over the years. Our questions are constantly answered by the enormous Stack
Overflow community. These websites have already provided answers to many questions about Python, so
Python users can consult them as needed.

8. Easy to Debug
Excellent information for mistake tracing. You will be able to quickly identify and correct the majority of your
program’s issues once you understand how to interpret Python’s error traces. Simply by glancing at the code,
you can determine what it is designed to perform.

5
9. Python is a Portable language
Python language is also a portable language. For example, if we have Python code for Windows and if we want
to run this code on other platforms such as Linux, Unix, and Mac then we do not need to change it, we can run
this code on any platform.

10. Python is an Integrated language

Python is also an Integrated language because we can easily integrate Python with other languages like C,
C++, etc.

11. Interpreted Language

Python is an Interpreted Language because Python code is executed line by line at a time. like other languages
C, C++, Java, etc. there is no need to compile Python code this makes it easier to debug our code. The source
code of Python is converted into an immediate form called bytecode.

12. Large Standard Library

Python has a large standard library that provides a rich set of modules and functions so you do not have to
write your own code for every single thing. There are many libraries present in Python such as regular
expressions, unit-testing, web browsers, etc.

13. Dynamically Typed Language

Python is a dynamically-typed language. That means the type (for example- int, double, long, etc.) for a
variable is decided at run time not in advance because of this feature we don’t need to specify the type of
variable.
14. Frontend and backend development
With a new project py script, you can run and write Python codes in HTML with the help of some simple tags
<py-script>, <py-env>, etc. This will help you do frontend development work in Python like java script.
Backend is the strong forte of Python it’s extensively used for this work cause of its frameworks like Django
and Flask.
6
Experiment 2: Supervised Learning – Regression

Python Code:
import numpy as np import matplotlib.pyplot as plt from
sklearn.model_selection import train_test_split, KFold from
sklearn.linear_model import LinearRegression from
sklearn.metrics import mean_squared_error from
sklearn.preprocessing import PolynomialFeatures from
sklearn.feature_selection import SelectKBest, f_regression

Generate a proper 2-D dataset of N points. Split the data set into Training Data Set and Test Data Set
# Generate Dataset np.random.seed(42)
N = 100 # Number of data points
X = np.random.uniform(0, 10, N).reshape(-1, 1) # Features
true_function = lambda x: 3 * x + 2 noise =
np.random.normal(0, 2, N).reshape(-1, 1) y =
true_function(X) + noise
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Perform Linear Regression analysis with Least Squares Method.

model = LinearRegression() model.fit(X_train,
y_train) y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test) print("Train

7
MSE (Linear Regression):", train_mse) print("Test
MSE (Linear Regression):", test_mse)

Plot the graphs for Training MSE and Test MSE and Comment on Curve Fitting and Generalization
Errors.
#CalculateMSEtrain_mse=mean_squared_error(y_train,y_train_pred) test_mse = mean_squared_error(y_test,
y_test_pred)
# Analyze and Plot Training/Test MSE degrees
= range(1, 10)
train_errors = [] test_errors = [] for degree in
degrees: poly =
PolynomialFeatures(degree=degree)

8
X_train_poly=poly.fit_transform(X_train)X_test_poly=
poly.transform(X_test)model.fit(X_train_poly,y_train)
train_errors.append(mean_squared_error(y_train, model.predict(X_train_poly)))
test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly)))

# Plot MSE vs Polynomial Degree plt.figure(figsize=(10, 5))

plt.plot(degrees, train_errors, label="Training MSE", marker='o')
plt.plot(degrees, test_errors, label="Test MSE", marker='o')
plt.xlabel("Polynomial Degree") plt.ylabel("Mean Squared
Error")
plt.title("Training and Test MSE vs Polynomial Degree")
plt.legend() plt.grid() plt.show()

9
Verify the Effect of Data Set Size and Bias-Variance Trade off.
# Effect of Dataset Size dataset_sizes = np.linspace(10, len(X_train),
10, dtype=int) train_mse_size = [],test_mse_size = [] for size in
dataset_sizes: indices = np.random.choice(range(len(X_train)), size,
replace=False) X_subset = X_train[indices],y_subset =
y_train[indices] model.fit(X_subset, y_subset)
train_mse_size.append(mean_squared_error(y_subset, model.predict(X_subset)))
test_mse_size.append(mean_squared_error(y_test, model.predict(X_test)))
# Plot MSE vs Dataset Size plt.figure(figsize=(10, 5))
plt.plot(dataset_sizes, train_mse_size, label="Training MSE", marker='o')
plt.plot(dataset_sizes, test_mse_size, label="Test MSE", marker='o')
plt.xlabel("Dataset Size") plt.ylabel("Mean Squared Error")
plt.title("MSE vs Dataset Size")
plt.legend(),plt.grid(),plt.show()

10
Apply Cross Validation and plot the graphs for errors.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_errors = [] for train_index, val_index in
kf.split(X_train):
X_train_cv, X_val_cv = X_train[train_index], X_train[val_index]
y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]
model.fit(X_train_cv, y_train_cv) val_error =
mean_squared_error(y_val_cv, model.predict(X_val_cv))
cv_errors.append(val_error) # Plot Cross-Validation Errors
plt.figure(figsize=(10, 5)) plt.plot(range(1, len(cv_errors) + 1),
cv_errors, marker='o') plt.xlabel("Fold Number") plt.ylabel("Validation
MSE") plt.title("Cross-Validation Errors") plt.grid() plt.show()

Apply Subset Selection Methods and plot the graphs for errors.
selector = SelectKBest(score_func=f_regression, k=1) X_train_selected =
selector.fit_transform(X_train, y_train.ravel()) X_test_selected =
selector.transform(X_test) model.fit(X_train_selected, y_train)

11
y_train_pred_selected = model.predict(X_train_selected)
y_test_pred_selected = model.predict(X_test_selected)
train_mse_selected = mean_squared_error(y_train,
y_train_pred_selected) test_mse_selected = mean_squared_error(y_test,
y_test_pred_selected)
# Print Final Findings print("Train MSE (Subset
Selection):", train_mse_selected) print("Test MSE (Subset
Selection):", test_mse_selected)

Output:

12
Experiment 3: Supervised Learning – Classification
Implement Naive Bayes classifier and K-Nearest Neighbour Classifier on Data set of your choice. Test and
compare for Accuracy and Precision.

Dataset Description:
The Wine dataset is a multi-class classification dataset that contains 178 samples of wine, each with 13
chemical properties. These properties are used to classify the wines into one of three cultivars.
Key Details:
• Number of Samples: 178
• Number of Features: 13 (continuous chemical attributes)
• Number of Classes (Target): 3 wine cultivars
• Features: Include alcohol content, malic acid, ash, colour
intensity, phenols, and other chemical properties.
• Target Labels: o Class 0: Cultivar 1 o Class 1: Cultivar 2 o

Class 2: Cultivar 3 Purpose:

The dataset is used to predict the cultivar (wine type) based on chemical attributes, making it suitable for
classification tasks in machine learning.

Python Code:
# Importing necessary libraries
import pandas as pd import
numpy as np
from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from
sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score,
precision_score, classification_report, ConfusionMatrixDisplay import matplotlib.pyplot as plt

13
# Loading the Wine dataset from sklearn from
sklearn.datasets import load_wine
# Load the dataset data = load_wine() df =
pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Dataset description print("Dataset

Description:") print(data.DESCR)

# Splitting the dataset into features and target

X = df.drop(columns=['target']) y =
df['target']

# Splitting into training and testing datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Implementing Naive Bayes Classifier

nb_model = GaussianNB()
nb_model.fit(X_train, y_train) y_pred_nb
= nb_model.predict(X_test)

# Implementing K-Nearest Neighbors Classifier

knn_model =
KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train) y_pred_knn =
knn_model.predict(X_test)
14
# Calculating accuracy and precision nb_accuracy =
accuracy_score(y_test, y_pred_nb) nb_precision =
precision_score(y_test, y_pred_nb, average='weighted') knn_accuracy =
accuracy_score(y_test, y_pred_knn) knn_precision =
precision_score(y_test, y_pred_knn, average='weighted')

# Displaying metrics print("\nComparison

of Models:") comparison_results =
pd.DataFrame({
"Model": ["Naive Bayes", "K-Nearest Neighbors"],
"Accuracy": [nb_accuracy, knn_accuracy],
"Precision": [nb_precision, knn_precision],
}) print(comparison_results)

# Graphical representation: Confusion Matrices fig,

ax = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_nb, ax=ax[0]) ax[0].set_title("Naive
Bayes Classifier") ConfusionMatrixDisplay.from_predictions(y_test, y_pred_knn, ax=ax[1])
ax[1].set_title("K-Nearest Neighbors Classifier") plt.tight_layout() plt.show()

15
Output:

16
Experiment 4: Unsupervised Learning
Implement K-means Clustering and Hierarchical Clustering on proper data set of your choice. Compare their
Convergence. Dataset Description:
he dataset used in the code is generated using the make_blobs function from sklearn.datasets. This function
creates a synthetic dataset with specified properties such as the number of samples, number of clusters, and
the standard deviation of the clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
• n_samples=300: The dataset consists of 300 data points.
• centers=4: The dataset has 4 clusters.
• cluster_std=1.0: The standard deviation of the clusters is 1.0, meaning the clusters will have a moderate
spread.
• random_state=42: This ensures that the dataset generation is reproducible (i.e., you get the same dataset if
you run the code multiple times).
Explanation of the Variables:
• X: This is a 2D array of shape (300, 2) representing the feature matrix. Each row is a data point with 2
features.
• y: This is a 1D array of length 300, containing the true cluster labels for each point (ranging from 0 to 3 in
this case since there are 4 clusters).
The dataset (X and y) consists of:
• 300 data points with two features.
• 4 clusters, each with a somewhat normal distribution around a center.
• The true labels of the clusters are represented by y.

Python Code:
# Import necessary libraries, import numpy as np
import matplotlib.pyplot as plt from sklearn.cluster

17
import KMeans, AgglomerativeClustering from
sklearn.datasets
import make_blobs from scipy.cluster.hierarchy
import dendrogram, linkage

# Generating a synthetic dataset

# This dataset has 300 samples, 2 features, and 4 clusters
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# K-Means Clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=300) kmeans_labels
= kmeans.fit_predict(X)

# Hierarchical Clustering (Agglomerative) hierarchical =

AgglomerativeClustering(n_clusters=4, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X)
# Visualizing the clustering results fig, ax
= plt.subplots(1, 3, figsize=(18, 6))
# Scatter plot of original data with true labels ax[0].scatter(X[:,
0], X[:, 1], c=y, cmap='viridis', s=30) ax[0].set_title("Original
Data (True Labels)")
# Scatter plot of K-Means clustering results
ax[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=30) ax[1].set_title("K-
Means Clustering")

18
# Scatter plot of Hierarchical clustering results
ax[2].scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='viridis', s=30) ax[2].set_title("Hierarchical
Clustering")
plt.tight_layout() plt.show()

# Creating a dendrogram for hierarchical clustering

linked = linkage(X, method='ward')
plt.figure(figsize=(10, 7)) dendrogram(linked,
truncate_mode='level', p=5) plt.title("Dendrogram
(Hierarchical Clustering)") plt.xlabel("Sample
Index") plt.ylabel("Distance") plt.show()

Output:

19
Experiment 5: Dimensionality Reduction
Principal Component Analysis-Finding Principal Components, variance and standard Deviation calculations of
principle components.

The dataset used in this code is a synthetically generated, two-dimensional dataset created to illustrate the
process and effects of Principal Component Analysis (PCA).
Characteristics of the Dataset:

1. Two Features:
Feature 1 (X): A set of 100 random values sampled from a standard normal distribution (N (0, 1)).
Feature 2 (Y): A linear function of X, with added random noise. Specifically, where is Gaussian noise sampled
from.

2. Linear Correlation:
The features are linearly correlated, as Feature 2 is derived directly from Feature 1 with added noise. This
makes the dataset an excellent candidate for PCA, as PCA seeks to identify the directions (principal
components) that capture the most variance in the data.

3. Dimensionality:
The dataset has two dimensions, making it simple and visually interpretable. PCA in this case will reduce the
data to uncorrelated components while maintaining the most significant variance.

4. Preprocessing:
The dataset is standardized using Standard Scaler, ensuring that each feature has a mean of 0 and a standard
deviation of 1. Standardization is critical for PCA, as it ensures that features with larger scales do not dominate
the variance calculation.

20
Python Code:
import numpy as np import pandas as pd import
matplotlib.pyplot as plt from
sklearn.decomposition import PCA from
sklearn.preprocessing import StandardScaler

# Generate a synthetic dataset np.random.seed(42) X

= np.random.normal(0, 1, 100) Y = 2 * X +
np.random.normal(0, 0.5, 100) data =
pd.DataFrame({'Feature 1': X, 'Feature 2': Y})

# Standardize the data scaler =

StandardScaler() data_scaled =
scaler.fit_transform(data)

# Perform PCA pca = PCA() principal_components =

pca.fit_transform(data_scaled)

# Variance and Standard Deviation explained_variance

= pca.explained_variance_ratio_ std_deviation =
np.sqrt(pca.explained_variance_)

# Results print("Explained Variance Ratios:",

explained_variance) print("Standard Deviations:",

21
std_deviation) print("Principal Components:\n",
principal_components[:5])

#Visualizationplt.figure(figsize=
(10,5))

# Original data plt.subplot(1, 2, 1) plt.scatter(data['Feature 1'], data['Feature

2'], alpha=0.7, label='Original Data') plt.xlabel('Feature 1') plt.ylabel('Feature
2') plt.title('Original Dataset') plt.legend()

# Principal components plt.subplot(1, 2, 2) plt.quiver(0, 0, pca.components_[0, 0],

pca.components_[0, 1], scale=3, color='r', label='PC1') plt.quiver(0, 0, pca.components_[1, 0],
pca.components_[1, 1], scale=3, color='b', label='PC2') plt.scatter(data_scaled[:, 0],
data_scaled[:, 1], alpha=0.7, label='Standardized Data') plt.xlabel('Standardized Feature 1')
plt.ylabel('Standardized Feature 2') plt.title('Principal Components') plt.legend()

plt.tight_layout() plt.show()

22
Output:

23
Experiment 6: Supervised Learning and Kernel Methods
Design, implement SVM for classification with proper dataset of your choice. Comment on design and
implementation for linearly non-separable Dataset.

The dataset used in this code is generated using the make_circles function from scikit-learn. It creates a
synthetic, two-dimensional, non-linearly separable dataset commonly used for testing and visualizing
classification algorithms, particularly for demonstrating the power of non-linear models like Support Vector
Machines (SVM) with the RBF kernel.

Characteristics of the Dataset:

1. Two Classes:
The dataset consists of two classes:
Points in the inner circle (class 0).
Points in the outer circle (class 1).

2. Dimensionality:
The dataset is two-dimensional, making it easy to visualize. Each data point is represented by two features
(X[:, 0] and X[:, 1]).

3. Non-linearly Separable:
The inner and outer circles are concentric, making the dataset non-linearly separable. A linear classifier like a
basic SVM without a kernel would struggle to separate the classes.

4. Parameters Used:
n_samples=500: The total number of points in the dataset. noise=0.1: Adds random noise to the data, making
the boundaries between the classes less distinct and more realistic.

24
factor=0.5: Specifies the relative size of the inner circle compared to the outer circle. A smaller value makes
the inner circle tighter. random_state=42: Ensures reproducibility of the data generation.

5. Split for Training and Testing:

The data is split into a training set (70%) and a test set (30%) to evaluate the model's performance.

Python Code:

# Importing necessary libraries import numpy as np import

matplotlib.pyplot as plt from sklearn.datasets import
make_circles from sklearn.model_selection import
train_test_split from sklearn.svm import SVC from
sklearn.metrics import classification_report, accuracy_score

# Generate a non-linearly separable dataset

X, y = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the SVM classifier with RBF kernel svm_rbf =
SVC(kernel='rbf', C=1, gamma=0.5, random_state=42)
svm_rbf.fit(X_train, y_train)
# Predicting on test data y_pred =
svm_rbf.predict(X_test)
# Calculating accuracy
accuracy = accuracy_score(y_test, y_pred) report
= classification_report(y_test, y_pred)

25
# Visualizing the dataset and decision boundary def
plot_decision_boundary(model, X, y):
# Create a mesh grid x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min,
y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min,
x_max, 0.01), np.arange(y_min, y_max, 0.01))

# Predict across the grid

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and dataset plt.contourf(xx, yy, Z, alpha=0.8,

cmap=plt.cm.coolwarm) plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k',
cmap=plt.cm.coolwarm) plt.title("SVM with RBF Kernel")
plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.show()

# Plotting dataset and decision boundary

plt.figure(figsize=(12, 5)) plt.subplot(1,
2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')
plt.title("Non-linearly Separable Dataset") plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.subplot(1, 2, 2) plot_decision_boundary(svm_rbf,
X, y) # Output the classification report and accuracy
accuracy, report.

26
Output:

27
INTRODUCTION TO THE PROJECT

These days fake news is creating different issues from sarcastic articles to a fabricated news and plan
government propaganda in some outlets. Fake news and lack of trust in the media are growing problems with
huge ramifications in our society. Obviously, a purposely misleading story is “fake news but lately blathering
social media’s discourse is changing its definition. Some of them now use the term to dismiss the facts counter
to their preferred viewpoints. The importance of disinformation within American political discourse was the
subject of weighty attention, particularly following the American president election. The term 'fake news'
became common parlance for the issue, particularly to describe factually incorrect and misleading articles
published mostly for the purpose of making money through page views. In this paper, it seeks to produce a
model that can accurately predict the likelihood that a given article is fake news. Facebook has been at the
epicentre of much critique following media attention. They have already implemented a feature to flag fake
news on the site when a user sees it; they have also said publicly they are working on to distinguish these
articles in an automated way. Certainly, it is not an easy task. A given algorithm must be politically unbiased –
since fake news exists on both ends of the spectrum – and also give equal balance to legitimate news sources
on either end of the spectrum. In addition, the question of legitimacy is a difficult one. However, in order to
solve this problem, it is necessary to have an understanding on what Fake News.

28
OBJECTIVE

The objective of this project is to examine the problems and possible significances related with the spread of
fake news. We will be working on different fake news data set in which we will apply different machine
learning algorithms to train the data and test it to find which news is the real news or which one is the fake
news. As the fake news is a problem that is heavily affecting society and our perception of not only the media
but also facts and opinions themselves. By using the artificial intelligence and the machine learning, the
problem can be solved as we will be able to mine the patterns from the data to maximize well defined
objectives. So, our focus is to find which machine learning algorithm is best suitable for what kind of text
dataset. Also, which dataset is better for finding the accuracies as the accuracies directly depends on the type
of data and the amount of data. The more the data, more are your chances of getting correct accuracy as you
can test and train more data to find out your results.

29
TECHNOLOGY USED FOR PROJECT

PYTHON

Python is a high-level, general-purpose, and very popular programming language. Python programming
language (latest Python 3) is being used in web development, Machine Learning applications, along with all
cutting-edge technology in Software Industry.Python language is being used by almost all tech-giant
companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc. The biggest strength of Python
is huge collection of standard library which can be used for the following:

• Machine Learning
• GUI Applications (like Tkinter, PyQt etc)
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, Beautiful Soup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more

30
TOOL USED FOR PROJECT

SPYDER
Spyder is a powerful scientific environment written in Python, for Python, and for scientists, engineers and
data analysts. It features a unique combination of the advanced editing, analysis, debugging and profiling
functionality of a comprehensive development tool with the data exploration, interactive execution, deep
inspection and beautiful visualization capabilities of a scientific package n Furthermore, Spyder offers builtin
integration with many popular scientific packages, including NumPy, SciPy, Pandas, Python, Console,
Matplotlib, SymPy, and more. Beyond its many built-in features, Spyder can be extended even further via
third-party plugins. Spyder can also be used as a PyQt5 extension library, allowing you to build upon its
functionality and embed its components, such as the interactive console or advanced editor, in your own
software.

Features of Spyder

Some of the remarkable features of Spyder are:

• Customizable Syntax Highlighting

• Availability of break points (debugging and conditional breakpoints)
• Interactive execution which allows you to run line, file, cell, etc.
• Run configurations for working directory selections, command-line options, current/ dedicated/
external console, etc

31
ALGORITHM USED

Machine learning algorithms used for fake news detection can be divided into two main categories: Supervised
and Unsupervised learning.

Supervised learning algorithms are trained on labelled datasets, where each news article is labelled as either
real or fake. The algorithm learns from the labelled dataset and is then used to classify new news articles as
real or fake. Supervised learning algorithms include logistic regression, decision trees, support vector
machines, and neural networks.

Unsupervised learning algorithms, on the other hand, do not require labelled datasets. Instead, they use
clustering techniques to group news articles into clusters based on their similarities. The algorithm then
identifies the characteristics of the clusters that contain fake news articles. Unsupervised learning algorithms
include k-means clustering, hierarchical clustering, and association rule learning.

Some of these popular classifiers are given below that are used for this purpose.

Support Vector Machine: This algorithm is mostly used for classification. This is a supervised machine
learning algorithm that learns from the labelled data set. This used various classifiers of machine learning and
the support vector machine have given them the best results in detecting the fake news.

Naïve Bayes: Naïve Bayes is also used for the classification tasks. This can be used to check whether the news
is authentic or fake.

Logistic Regression: This classifier is used when the value to be predicted is categorical. For example, it can
predict or give the result in true or false. This classifier used to detect the news whether it is true or fake.

Random Forests: In this classifier, there are different random forests that give a value and a value with more
votes is the actual result of this classifier.

Recurrent Neural Network: This classifier is also helpful for detecting the fake news. Researcher have used
the recurrent neural network to classify the news as true or false.

Neural Network: There are different algorithms of machine learning that are used to help in classification
problems. One of these algorithms is the neural network.

32
K-Nearest Neighbour: This is a supervised algorithm of machine learning that is used for solving the
classification problems. This stores the data about all the cases to classify the new case on the base of similarity.

Decision Tree: This supervised algorithm of machine learning can help to detect the fake news. It breaks down
the dataset into different smaller subsets.

33
Project Modules
A. Data Use

In this project we are using different packages and to load and read the data set we are using pandas. By using
pandas, we can read the .csv file and then we can display the shape of the dataset with that we can also display
the dataset in the correct form. We will be training and testing the data, when we use supervised learning it
means we are labelling the data. By getting the testing and training data and labels we can perform different
machine learning algorithms but before performing the predictions and accuracies, the data is need to be
preprocessing i.e. the null values which are not readable are required to be removed from the data set and the
data is required to be converted into vectors by normalizing and tokening the data so that it could be understood
by the machine. Next step is by using this data, getting the visual reports, which we will get by using the
matplot Library of Python and sklearn. This library helps us in getting the results in the form of histograms,
pie charts or bar charts.

B. Preprocessing

The data set used is split into a training set and a testing set containing in Dataset 1 -3256 training data and
814 testing data and in Dataset II- 1882 training data and 471 testing data respectively. Cleaning the data is
always the first step. In this, those words are removed from the dataset. That helps in mining the useful
information. Whenever we collect data online, it sometimes contains the undesirable characters like stop
words, digits etc. which creates hindrance while spam detection. It helps in removing the texts which are
language independent entities and integrate the logic which can improve the accuracy of the identification task.

C. Feature Extraction

Feature extractions the process of selecting a subset of relevant features for use in model construction. Feature
extraction methods help in to create an accurate predictive model. They help in selecting features that will give
better accuracy. When the input data to an algorithm is too large to be handled and its supposed to be redundant
then the input data will be transformed into a reduced illustration set of features also named feature vectors.
Altering the input data to perform the desired task using this reduced representation instead of the full-size
input. Feature extraction is performed on raw data prior to applying any machine learning algorithm, on the
transformed data in feature space.
34
D. Training the Classifier

As In this project I am using Scikit-Learn Machine learning library for implementing the architecture.
ScikitLearn is an opensource python Machine Learning library which comes bundled in 3rd distribution
anaconda. This just needs importing the packages and you can compile the command as soon as you write it.
If the command doesn’t run, we can get the error at the same time. I am using 4 different algorithms and I have
trained these 4 models i.e. Naïve Bayes, Support Vector Machine, K Nearest Neighbour and Logistic
Regression which are very popular methods for document classification problem. Once the classifiers are
trained, we can c heck the performance of the models on test-set. We can extract the word count vector for
each mail in test-set and predict it class with the trained models. For the training purpose, they have used a
training data set.

35
IMPLEMENTATION

Steps to be followed

1. Importing Libraries and Datasets

2. Data Preprocessing

3. Preprocessing and analysis of News column

4. Converting text into Vectors

5. Model training, Evaluation, and Prediction

The libraries used are:

• Pandas: For importing the dataset.

• Seaborn/Matplotlib: For data visualization.

Code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Let’s import the downloaded dataset.
data = pd.read_csv('News.csv',index_col=0) data.head()

Output:

36
Data preprocessing

The shape of the dataset can be found by the below code.

data.shape

Output:

(44919, 5)

As the title, subject and date column will not going to be helpful in identification of the news. So, we can drop
these column.

data = data.drop(["title", "subject","date"], axis = 1)

Now, we have to check if there is any null value (we will drop those rows) data.isnull().sum()

Output:

Text 0

class 0

Now we have to shuffle the dataset to prevent the model to get bias. After that we will reset the index and then
drop it. Because index column is not useful to us.

# Shuffling data = data.sample(frac=1)

data.reset_index(inplace=True)

data.drop(["index"], axis=1,inplace=True)

Now Let’s explore the unique values in each category using below code.

sns.countplot(data=data, x='class', order=data['class'].value_counts().index)

37
Output:

unique values in each category

Preprocessing and analysis of News column

First, we will remove all the stopwords, punctuations and any irrelevant spaces from the text. For that NLTK
Library is required and some of it’s module need to be downloaded. So, for that run the below code.

from tqdm import tqdm

import re import nltk

nltk.download('punkt')

nltk.download('stopwords')

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem.porter importPorterStemmer

38
from wordcloud import WordCloud

Once we have all the required modules, we can create a function name preprocess text. This function will
preprocess all the data given as input.

def preprocess_text(text_data):

preprocessed_text = []

for sentence in tqdm(text_data):

sentence = re.sub(r'[^\w\s]', '', sentence)

preprocessed_text.append(' '.join(token.lower()

for token in str(sentence).split()

if token not in stopwords.words('english')))

return preprocessed_text

To implement the function in all the news in the text column, run the below command.

preprocessed_review = preprocess_text(data['text'].values)

data['text'] = preprocessed_review

This command will take some time (as the dataset taken is very large).

Let’s visualize the Word Cloud for fake and real news separately.

# Real

consolidated = join (word for word in data['text'][data['class'] == 1].astype(str))

wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110,collocations=False)

plt.figure(figsize=(15, 10)) plt.imshow(wordCloud.generate(consolidated), interpolation='bilinear')

39
plt.axis('off') plt.show()

Output :

Word Cloud for real news

# Fake

consolidated = ' '.join(word for word in data['text'][data['class'] == 0].astype(str))

wordCloud = WordCloud(width=1600,height=800,random_state=21,collocations=False)

plt.figure(figsize=(15, 10)) plt.imshow(wordCloud.generate(consolidated),

interpolation='bilinear')

plt.axis('off')

plt.show()

40
Output:

Word Cloud for fake news

Now, Let’s plot the bargraph of the top 20 most frequent words.

from sklearn.feature_extraction.text import CountVectorizer

def get_top_n_words(corpus, n=None):

vec = CountVectorizer().fit(corpus)

bag_of_words = vec.transform(corpus)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx])

for word, idx in vec.vocabulary_.items()]

words_freq = sorted(words_freq, key=lambda x: x[1],reverse=True)

return words_freq[:n]
41
common_words = get_top_n_words(data['text'], 20)

df1 = pd.DataFrame(common_words, columns=['Review', 'count'])

df1.groupby('Review').sum()['count'].sort_values(ascending=False).plot(kind='bar',figsize=(10, 6),

xlabel="Top Words", ylabel="Count", title="Bar Chart of Top Words Frequency")

Output:

Bar graph of the top 20 most frequent words

Converting text into Vectors

Before converting the data into vectors, split it into train and test.

from sklearn.model_selection import train_test_split

42
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression x_train, x_test, y_train, y_test =

train_test_split(data['text'], data['class'] ,test_size=0.25)

Now we can convert the training data into vectors using TfidfVectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer vectorization =

TfidfVectorizer() x_train = vectorization.fit_transform(x_train) x_test =

vectorization.transform(x_test)

Model training, Evaluation, and Prediction

Now, the dataset is ready to train the model.

For training we will use Logistic Regression and evaluate the prediction accuracy using accuracy_score.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() model.fit(x_train, y_train)

# testing the model

print(accuracy_score(y_train, model.predict(x_train))) print(accuracy_score(y_test,

model.predict(x_test)))

Output:
0.993766511324171
0.9893143365983972

Let’s train with Decision Tree Classifier.

from sklearn.tree import DecisionTreeClassifier model

= DecisionTreeClassifier() model.fit(x_train,y_train)
43
# testing the model

print(accuracy_score(y_train, model.predict(x_train)))

print(accuracy_score(y_test, model.predict(x_test)))

Output :

0.9999703167205913

0.9951914514692787

The confusion matrix for Decision Tree Classifier can be implemented with the code below.

# Confusion matrix of Results from Decision Tree classification

from sklearn import metrics

cm = metrics.confusion_matrix(y_test, model.predict(x_test)) cm_display=metrics

Confusion Matrix Display(confusion_matrix=cm,display_labels=[False, True])

cm_display.plot()

plt.show()

44
Output:

Confusion matrix

45
CONCLUSION

The development of a Fake News Detector using machine learning is a significant step forward in combating
the growing issue of misinformation on the internet. With the proliferation of social media platforms and digital
news outlets, fake news has become a serious concern, influencing public opinion, and even political outcomes.
Machine learning techniques, especially natural language processing (NLP) and classification algorithms, have
shown promising results in identifying and distinguishing fake news from reliable information.

Key insights from the project include:

Data Preprocessing is Crucial: Proper cleaning, tokenization, and feature extraction from text data are
essential for building an effective fake news detection system.

Model Performance: Traditional models like Logistic Regression, Decision Trees, and Naive Bayes, as well
as advanced deep learning techniques like LSTMs and BERT, have demonstrated high accuracy in classifying
fake vs. real news.

Limitations: Despite promising results, detecting fake news remains a complex task due to the subtlety of
misinformation and the constant evolution of the tactics used by creators of fake news. Models might face
difficulties in handling ambiguous or nuanced content, satire, and highly contextual language.

In conclusion, while machine learning offers powerful tools for fake news detection, there is still a long way
to go before these systems are flawless and fully adaptable. As new technologies, research, and data become
available, the future holds great promise for more robust and scalable solutions to this critical problem.

Python for Beginners: A Crash Course to Learn Python Programming in 1 Week
From Everand
Python for Beginners: A Crash Course to Learn Python Programming in 1 Week
Brady Ellison
No ratings yet
Age and Gender Detection-3
67% (12)
Age and Gender Detection-3
20 pages
Learn Python in 10 Minutes
From Everand
Learn Python in 10 Minutes
Victor Ebai
4/5 (30)
Python UNIT 1
No ratings yet
Python UNIT 1
118 pages
Python Material
No ratings yet
Python Material
166 pages
Xi Ip
No ratings yet
Xi Ip
5 pages
Chap1_ Introduction
No ratings yet
Chap1_ Introduction
42 pages
Python Programming E Book
No ratings yet
Python Programming E Book
88 pages
Python Basics
No ratings yet
Python Basics
34 pages
Report of Industrial Training
No ratings yet
Report of Industrial Training
51 pages
PPT
No ratings yet
PPT
21 pages
Python Programming
No ratings yet
Python Programming
54 pages
Class 6 Phython Chapter For Revision
No ratings yet
Class 6 Phython Chapter For Revision
17 pages
Setting Up Python 3.5 and Numpy and Matplotlib On Your Own Windows PC or Laptop
No ratings yet
Setting Up Python 3.5 and Numpy and Matplotlib On Your Own Windows PC or Laptop
18 pages
Lab Programs 1 to 3
No ratings yet
Lab Programs 1 to 3
23 pages
Python Ppt
No ratings yet
Python Ppt
38 pages
How to Install Python
No ratings yet
How to Install Python
6 pages
Python Practicals
100% (2)
Python Practicals
33 pages
pyChapt-1
No ratings yet
pyChapt-1
11 pages
Python Tutorial9
No ratings yet
Python Tutorial9
10 pages
Beginner Guide To Python Programming PDF
No ratings yet
Beginner Guide To Python Programming PDF
78 pages
Getting Started With Python-1
No ratings yet
Getting Started With Python-1
17 pages
Project File
No ratings yet
Project File
9 pages
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
From Everand
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Mark Chan
5/5 (4)
Python Primer: Patrice Koehl Modified by Xin Liu in Apr., 2011
No ratings yet
Python Primer: Patrice Koehl Modified by Xin Liu in Apr., 2011
33 pages
PythonProgrammingTutorial Day01
No ratings yet
PythonProgrammingTutorial Day01
6 pages
Wall - Eric Python Coding - Learn To Code Fast. Python For Data Analysis and Machine Learning
No ratings yet
Wall - Eric Python Coding - Learn To Code Fast. Python For Data Analysis and Machine Learning
103 pages
Setting Up Python
No ratings yet
Setting Up Python
22 pages
PP Unit I Notes Dbatu-1
100% (1)
PP Unit I Notes Dbatu-1
20 pages
Python - Environment Setup
No ratings yet
Python - Environment Setup
10 pages
BCA III Year Major-II Python
No ratings yet
BCA III Year Major-II Python
66 pages
Lesson 3 Introduction To Python
No ratings yet
Lesson 3 Introduction To Python
4 pages
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
A 03 Introduction To Python Programming
No ratings yet
A 03 Introduction To Python Programming
56 pages
ECE PROJECT3
No ratings yet
ECE PROJECT3
56 pages
! - Getting Started With Python6
No ratings yet
! - Getting Started With Python6
17 pages
Lab Manual
No ratings yet
Lab Manual
134 pages
Python
No ratings yet
Python
5 pages
Lesson_03_Introduction_to_Python_Programming
No ratings yet
Lesson_03_Introduction_to_Python_Programming
56 pages
Unit 3. Python Programming-15.09.2022
No ratings yet
Unit 3. Python Programming-15.09.2022
123 pages
PLC Module-1
No ratings yet
PLC Module-1
33 pages
Python Module 1 23MBA
No ratings yet
Python Module 1 23MBA
42 pages
PDF&Rendition=1
No ratings yet
PDF&Rendition=1
83 pages
Python Report
No ratings yet
Python Report
30 pages
Jcs2201-Python Programming Unit-I Notes
No ratings yet
Jcs2201-Python Programming Unit-I Notes
42 pages
Getting Started With Python6
No ratings yet
Getting Started With Python6
16 pages
Python Running Notes
No ratings yet
Python Running Notes
6 pages
python pdf
No ratings yet
python pdf
34 pages
1923 B.E Eee Batchno Intern 54
No ratings yet
1923 B.E Eee Batchno Intern 54
34 pages
Python Programming Language Notes: Starnet Computer Education Fatehgunj, Vadodara Mo:9727203697
80% (5)
Python Programming Language Notes: Starnet Computer Education Fatehgunj, Vadodara Mo:9727203697
131 pages
Installation of Python and Other Packages
No ratings yet
Installation of Python and Other Packages
13 pages
R19 Python UNIT - 1 (Ref-2)
No ratings yet
R19 Python UNIT - 1 (Ref-2)
48 pages
Python Practical Python Programming For Beginners and Experts (PDFDrive)
No ratings yet
Python Practical Python Programming For Beginners and Experts (PDFDrive)
161 pages
Book Python Only For Beginners
No ratings yet
Book Python Only For Beginners
115 pages
2 5285066367077189059
No ratings yet
2 5285066367077189059
48 pages
Software Environment
No ratings yet
Software Environment
27 pages
u1_python
No ratings yet
u1_python
45 pages
Lesson 3 Setting Up The Python Environment
No ratings yet
Lesson 3 Setting Up The Python Environment
16 pages
1. Introduction
No ratings yet
1. Introduction
4 pages
ML Lab 1
No ratings yet
ML Lab 1
24 pages
Introduction To Python
No ratings yet
Introduction To Python
43 pages
Solutions To Deep Learning
No ratings yet
Solutions To Deep Learning
25 pages
1 s2.0 S0016003220302544 Main
No ratings yet
1 s2.0 S0016003220302544 Main
22 pages
DRL GNN
No ratings yet
DRL GNN
12 pages
Deep Audio Classification
No ratings yet
Deep Audio Classification
10 pages
Vision-Based Learning For Drones A Survey
No ratings yet
Vision-Based Learning For Drones A Survey
20 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Ali Ahmadian (Editor), Soheil Salahshour (Editor) - Soft Computing Approach For Mathematical Modeling of Engineering Problems-CRC Press (2021)
100% (2)
Ali Ahmadian (Editor), Soheil Salahshour (Editor) - Soft Computing Approach For Mathematical Modeling of Engineering Problems-CRC Press (2021)
267 pages
Course: EX801 Computer-Aided Design of Electrical Machines: Departmental
No ratings yet
Course: EX801 Computer-Aided Design of Electrical Machines: Departmental
10 pages
A Comparative Study of Classifying Legal Documents With Neural Networks
No ratings yet
A Comparative Study of Classifying Legal Documents With Neural Networks
9 pages
IEEE - Applications of Deep Learning and Reinforcement Learning To Biological Data PDF
No ratings yet
IEEE - Applications of Deep Learning and Reinforcement Learning To Biological Data PDF
17 pages
Russell 2023
No ratings yet
Russell 2023
69 pages
VISCOSITY CORRELATION Woelflin 1942 Loose
No ratings yet
VISCOSITY CORRELATION Woelflin 1942 Loose
4 pages
AI Masterclass
No ratings yet
AI Masterclass
2 pages
2021 OxfordBibliographies GeoAI
No ratings yet
2021 OxfordBibliographies GeoAI
17 pages
COMP 577 - Soft Computing Techniques: Chapter - 1: Introduction
No ratings yet
COMP 577 - Soft Computing Techniques: Chapter - 1: Introduction
2 pages
Tensorflow and Deep Learning
No ratings yet
Tensorflow and Deep Learning
51 pages
Deep Learning, Theory and Foundation A Brief Review
No ratings yet
Deep Learning, Theory and Foundation A Brief Review
7 pages
Fundamentals of Infrastructure Management
No ratings yet
Fundamentals of Infrastructure Management
210 pages
Pinar 2020
No ratings yet
Pinar 2020
11 pages
DIY Bodies: Surveillance Never Sleeps
No ratings yet
DIY Bodies: Surveillance Never Sleeps
18 pages
New Chapters
No ratings yet
New Chapters
47 pages
Automated Detection of Lunar Craters Using Deep Learning
No ratings yet
Automated Detection of Lunar Craters Using Deep Learning
5 pages
RAMAN_Reinforcement_Learning_Inspired_Algorithm_for_Mapping_Applications_onto_Mesh_Network-on-Chip
No ratings yet
RAMAN_Reinforcement_Learning_Inspired_Algorithm_for_Mapping_Applications_onto_Mesh_Network-on-Chip
7 pages
RTCSE2017 Proceeding
No ratings yet
RTCSE2017 Proceeding
51 pages
Research Article: Machine Learning-Based Predictive Farmland Optimization and Crop Monitoring System
No ratings yet
Research Article: Machine Learning-Based Predictive Farmland Optimization and Crop Monitoring System
12 pages
Alahmari Et Al. - 2022
No ratings yet
Alahmari Et Al. - 2022
23 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
1st 2nd Merged
No ratings yet
1st 2nd Merged
19 pages