73035393463850
73035393463850
Lab Report
On
Machine Learning Lab
(MTCS 103-18)
Submitted in the partial fulfillment of the requirement for the award of a degree of
Master of Technology
In
COMPUTER SCIENCE & ENGINEERING
Batch (2024-2026)
Submitted to Submitted by
Er. Jasmeet Kaur Ranjodhbir
ANSAR ALISingh
7. Experiment 7: Project
Project name:- Medical Insurance Price Prediction 28-46
Experiment 1: Study of pattern for implementation of Assignments.
Download the opensource software of your interest. Document the distinct features and functionality of the
software platform. You may choose WEKA, R or any other software.
Python is a high-level programming language that has become increasingly popular due to its simplicity,
versatility, and extensive range of applications. The process of How to install Python in Windows, operating
system is relatively easy and involves a few uncomplicated steps.
This article aims to take you through the process of downloading and installing Python on your Windows
computer.
How to Install Python in Windows?
We have provided step-by-step instructions to guide you and ensure a successful installation. Whether you are
new to programming or have some experience, mastering how to install Python on Windows will enable you
to utilize this potent language and uncover its full range of potential applications.
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows operating system.
Locate a reliable version of Python 3, preferably version 3.10.11, which was used in testing this tutorial.
Choose the correct link for your device from the options provided: either Windows installer (64-bit) or
Windows installer (32-bit) and proceed to download the executable file.
Python Homepage
1
Step 2: Downloading the Python Installer
Once you have downloaded the installer, open the .exe file, such as python-3.10.11-amd64.exe, by
doubleclicking it to launch the Python installer. Choose the option to Install the launcher for all users by
checking the corresponding checkbox, so that all users of the computer can access the Python launcher
application. Enable users to run Python from the command line by checking the Add python.exe to PATH
checkbox.
Python Installer
After Clicking the Install Now Button the setup will start installing Python on your Windows system. You will
see a window like this.
Python Setup
2
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see a successful
message.
Close the window after successful installation of Python. You can check if the installation of Python was
successful by using either the command line or the Integrated Development Environment (IDLE), which you
may have installed. To access the command line, click on the Start menu and type “cmd” in the search bar.
Then click on Command Prompt. python --version
Python version
You can also check the version of Python by opening the IDLE application. Go to Start and enter IDLE in the
search bar and then click the IDLE app, for example, IDLE (Python 3.10.11 64-bit). If you can see the Python
IDLE window then you are successfully able to download and installed Python on Windows.
3
Python IDLE
Getting Started with Python
Python is a lot easier to code and learn. Python programs can be written on any plain text editor like Notepad,
notepad++, or anything of that sort. One can also use an online IDE to run Python code or can even install one
on their system to make it more feasible to write these codes because IDEs provide a lot of features like an
intuitive code editor, debugger, compiler, etc. To begin with, writing Python Codes and performing various
intriguing and useful operations, one must have Python installed on their System.
2. Easy to code
Python is a high-level programming language. Python is very easy to learn the language as compared to other
languages like C, C#, Java script, Java, etc. It is very easy to code in the Python language and anybody can
learn Python basics in a few hours or days. It is also a developer-friendly language.
4
3. Easy to Read
As you will see, learning Python is quite simple. As was already established, Python’s syntax is really
straightforward. The code block is defined by the indentations rather than by semicolons or brackets.
4. Object-Oriented Language
One of the key features of Python is Object-Oriented programming. Python supports object-oriented language
and concepts of classes, object encapsulation, etc.
6. High-Level Language
Python is a high-level language. When we write programs in Python, we do not need to remember the system
architecture, nor do we need to manage the memory.
8. Easy to Debug
Excellent information for mistake tracing. You will be able to quickly identify and correct the majority of your
program’s issues once you understand how to interpret Python’s error traces. Simply by glancing at the code,
you can determine what it is designed to perform.
5
9. Python is a Portable language
Python language is also a portable language. For example, if we have Python code for Windows and if we want
to run this code on other platforms such as Linux, Unix, and Mac then we do not need to change it, we can run
this code on any platform.
Python Code:
import numpy as np import matplotlib.pyplot as plt from
sklearn.model_selection import train_test_split, KFold from
sklearn.linear_model import LinearRegression from
sklearn.metrics import mean_squared_error from
sklearn.preprocessing import PolynomialFeatures from
sklearn.feature_selection import SelectKBest, f_regression
Generate a proper 2-D dataset of N points. Split the data set into Training Data Set and Test Data Set
# Generate Dataset np.random.seed(42)
N = 100 # Number of data points
X = np.random.uniform(0, 10, N).reshape(-1, 1) # Features
true_function = lambda x: 3 * x + 2 noise =
np.random.normal(0, 2, N).reshape(-1, 1) y =
true_function(X) + noise
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
MSE (Linear Regression):", train_mse) print("Test
MSE (Linear Regression):", test_mse)
Plot the graphs for Training MSE and Test MSE and Comment on Curve Fitting and Generalization
Errors.
#CalculateMSEtrain_mse=mean_squared_error(y_train,y_train_pred) test_mse = mean_squared_error(y_test,
y_test_pred)
# Analyze and Plot Training/Test MSE degrees
= range(1, 10)
train_errors = [] test_errors = [] for degree in
degrees: poly =
PolynomialFeatures(degree=degree)
8
X_train_poly=poly.fit_transform(X_train)X_test_poly=
poly.transform(X_test)model.fit(X_train_poly,y_train)
train_errors.append(mean_squared_error(y_train, model.predict(X_train_poly)))
test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly)))
9
Verify the Effect of Data Set Size and Bias-Variance Trade off.
# Effect of Dataset Size dataset_sizes = np.linspace(10, len(X_train),
10, dtype=int) train_mse_size = [],test_mse_size = [] for size in
dataset_sizes: indices = np.random.choice(range(len(X_train)), size,
replace=False) X_subset = X_train[indices],y_subset =
y_train[indices] model.fit(X_subset, y_subset)
train_mse_size.append(mean_squared_error(y_subset, model.predict(X_subset)))
test_mse_size.append(mean_squared_error(y_test, model.predict(X_test)))
# Plot MSE vs Dataset Size plt.figure(figsize=(10, 5))
plt.plot(dataset_sizes, train_mse_size, label="Training MSE", marker='o')
plt.plot(dataset_sizes, test_mse_size, label="Test MSE", marker='o')
plt.xlabel("Dataset Size") plt.ylabel("Mean Squared Error")
plt.title("MSE vs Dataset Size")
plt.legend(),plt.grid(),plt.show()
10
Apply Cross Validation and plot the graphs for errors.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_errors = [] for train_index, val_index in
kf.split(X_train):
X_train_cv, X_val_cv = X_train[train_index], X_train[val_index]
y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]
model.fit(X_train_cv, y_train_cv) val_error =
mean_squared_error(y_val_cv, model.predict(X_val_cv))
cv_errors.append(val_error) # Plot Cross-Validation Errors
plt.figure(figsize=(10, 5)) plt.plot(range(1, len(cv_errors) + 1),
cv_errors, marker='o') plt.xlabel("Fold Number") plt.ylabel("Validation
MSE") plt.title("Cross-Validation Errors") plt.grid() plt.show()
Apply Subset Selection Methods and plot the graphs for errors.
selector = SelectKBest(score_func=f_regression, k=1) X_train_selected =
selector.fit_transform(X_train, y_train.ravel()) X_test_selected =
selector.transform(X_test) model.fit(X_train_selected, y_train)
11
y_train_pred_selected = model.predict(X_train_selected)
y_test_pred_selected = model.predict(X_test_selected)
train_mse_selected = mean_squared_error(y_train,
y_train_pred_selected) test_mse_selected = mean_squared_error(y_test,
y_test_pred_selected)
# Print Final Findings print("Train MSE (Subset
Selection):", train_mse_selected) print("Test MSE (Subset
Selection):", test_mse_selected)
Output:
12
Experiment 3: Supervised Learning – Classification
Implement Naive Bayes classifier and K-Nearest Neighbour Classifier on Data set of your choice. Test and
compare for Accuracy and Precision.
Dataset Description:
The Wine dataset is a multi-class classification dataset that contains 178 samples of wine, each with 13
chemical properties. These properties are used to classify the wines into one of three cultivars.
Key Details:
• Number of Samples: 178
• Number of Features: 13 (continuous chemical attributes)
• Number of Classes (Target): 3 wine cultivars
• Features: Include alcohol content, malic acid, ash, colour
intensity, phenols, and other chemical properties.
• Target Labels: o Class 0: Cultivar 1 o Class 1: Cultivar 2 o
The dataset is used to predict the cultivar (wine type) based on chemical attributes, making it suitable for
classification tasks in machine learning.
Python Code:
# Importing necessary libraries
import pandas as pd import
numpy as np
from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from
sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score,
precision_score, classification_report, ConfusionMatrixDisplay import matplotlib.pyplot as plt
13
# Loading the Wine dataset from sklearn from
sklearn.datasets import load_wine
# Load the dataset data = load_wine() df =
pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
15
Output:
16
Experiment 4: Unsupervised Learning
Implement K-means Clustering and Hierarchical Clustering on proper data set of your choice. Compare their
Convergence. Dataset Description:
he dataset used in the code is generated using the make_blobs function from sklearn.datasets. This function
creates a synthetic dataset with specified properties such as the number of samples, number of clusters, and
the standard deviation of the clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
• n_samples=300: The dataset consists of 300 data points.
• centers=4: The dataset has 4 clusters.
• cluster_std=1.0: The standard deviation of the clusters is 1.0, meaning the clusters will have a moderate
spread.
• random_state=42: This ensures that the dataset generation is reproducible (i.e., you get the same dataset if
you run the code multiple times).
Explanation of the Variables:
• X: This is a 2D array of shape (300, 2) representing the feature matrix. Each row is a data point with 2
features.
• y: This is a 1D array of length 300, containing the true cluster labels for each point (ranging from 0 to 3 in
this case since there are 4 clusters).
The dataset (X and y) consists of:
• 300 data points with two features.
• 4 clusters, each with a somewhat normal distribution around a center.
• The true labels of the clusters are represented by y.
Python Code:
# Import necessary libraries, import numpy as np
import matplotlib.pyplot as plt from sklearn.cluster
17
import KMeans, AgglomerativeClustering from
sklearn.datasets
import make_blobs from scipy.cluster.hierarchy
import dendrogram, linkage
# K-Means Clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=300) kmeans_labels
= kmeans.fit_predict(X)
18
# Scatter plot of Hierarchical clustering results
ax[2].scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='viridis', s=30) ax[2].set_title("Hierarchical
Clustering")
plt.tight_layout() plt.show()
Output:
19
Experiment 5: Dimensionality Reduction
Principal Component Analysis-Finding Principal Components, variance and standard Deviation calculations of
principle components.
The dataset used in this code is a synthetically generated, two-dimensional dataset created to illustrate the
process and effects of Principal Component Analysis (PCA).
Characteristics of the Dataset:
1. Two Features:
Feature 1 (X): A set of 100 random values sampled from a standard normal distribution (N (0, 1)).
Feature 2 (Y): A linear function of X, with added random noise. Specifically, where is Gaussian noise sampled
from.
2. Linear Correlation:
The features are linearly correlated, as Feature 2 is derived directly from Feature 1 with added noise. This
makes the dataset an excellent candidate for PCA, as PCA seeks to identify the directions (principal
components) that capture the most variance in the data.
3. Dimensionality:
The dataset has two dimensions, making it simple and visually interpretable. PCA in this case will reduce the
data to uncorrelated components while maintaining the most significant variance.
4. Preprocessing:
The dataset is standardized using Standard Scaler, ensuring that each feature has a mean of 0 and a standard
deviation of 1. Standardization is critical for PCA, as it ensures that features with larger scales do not dominate
the variance calculation.
20
Python Code:
import numpy as np import pandas as pd import
matplotlib.pyplot as plt from
sklearn.decomposition import PCA from
sklearn.preprocessing import StandardScaler
21
std_deviation) print("Principal Components:\n",
principal_components[:5])
#Visualizationplt.figure(figsize=
(10,5))
plt.tight_layout() plt.show()
22
Output:
23
Experiment 6: Supervised Learning and Kernel Methods
Design, implement SVM for classification with proper dataset of your choice. Comment on design and
implementation for linearly non-separable Dataset.
The dataset used in this code is generated using the make_circles function from scikit-learn. It creates a
synthetic, two-dimensional, non-linearly separable dataset commonly used for testing and visualizing
classification algorithms, particularly for demonstrating the power of non-linear models like Support Vector
Machines (SVM) with the RBF kernel.
2. Dimensionality:
The dataset is two-dimensional, making it easy to visualize. Each data point is represented by two features
(X[:, 0] and X[:, 1]).
3. Non-linearly Separable:
The inner and outer circles are concentric, making the dataset non-linearly separable. A linear classifier like a
basic SVM without a kernel would struggle to separate the classes.
4. Parameters Used:
n_samples=500: The total number of points in the dataset. noise=0.1: Adds random noise to the data, making
the boundaries between the classes less distinct and more realistic.
24
factor=0.5: Specifies the relative size of the inner circle compared to the outer circle. A smaller value makes
the inner circle tighter. random_state=42: Ensures reproducibility of the data generation.
Python Code:
25
# Visualizing the dataset and decision boundary def
plot_decision_boundary(model, X, y):
# Create a mesh grid x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min,
y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min,
x_max, 0.01), np.arange(y_min, y_max, 0.01))
plt.subplot(1, 2, 2) plot_decision_boundary(svm_rbf,
X, y) # Output the classification report and accuracy
accuracy, report.
26
Output:
27
INTRODUCTION TO THE PROJECT
These days fake news is creating different issues from sarcastic articles to a fabricated news and plan
government propaganda in some outlets. Fake news and lack of trust in the media are growing problems with
huge ramifications in our society. Obviously, a purposely misleading story is “fake news but lately blathering
social media’s discourse is changing its definition. Some of them now use the term to dismiss the facts counter
to their preferred viewpoints. The importance of disinformation within American political discourse was the
subject of weighty attention, particularly following the American president election. The term 'fake news'
became common parlance for the issue, particularly to describe factually incorrect and misleading articles
published mostly for the purpose of making money through page views. In this paper, it seeks to produce a
model that can accurately predict the likelihood that a given article is fake news. Facebook has been at the
epicentre of much critique following media attention. They have already implemented a feature to flag fake
news on the site when a user sees it; they have also said publicly they are working on to distinguish these
articles in an automated way. Certainly, it is not an easy task. A given algorithm must be politically unbiased –
since fake news exists on both ends of the spectrum – and also give equal balance to legitimate news sources
on either end of the spectrum. In addition, the question of legitimacy is a difficult one. However, in order to
solve this problem, it is necessary to have an understanding on what Fake News.
28
OBJECTIVE
The objective of this project is to examine the problems and possible significances related with the spread of
fake news. We will be working on different fake news data set in which we will apply different machine
learning algorithms to train the data and test it to find which news is the real news or which one is the fake
news. As the fake news is a problem that is heavily affecting society and our perception of not only the media
but also facts and opinions themselves. By using the artificial intelligence and the machine learning, the
problem can be solved as we will be able to mine the patterns from the data to maximize well defined
objectives. So, our focus is to find which machine learning algorithm is best suitable for what kind of text
dataset. Also, which dataset is better for finding the accuracies as the accuracies directly depends on the type
of data and the amount of data. The more the data, more are your chances of getting correct accuracy as you
can test and train more data to find out your results.
29
TECHNOLOGY USED FOR PROJECT
PYTHON
Python is a high-level, general-purpose, and very popular programming language. Python programming
language (latest Python 3) is being used in web development, Machine Learning applications, along with all
cutting-edge technology in Software Industry.Python language is being used by almost all tech-giant
companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc. The biggest strength of Python
is huge collection of standard library which can be used for the following:
• Machine Learning
• GUI Applications (like Tkinter, PyQt etc)
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, Beautiful Soup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more
30
TOOL USED FOR PROJECT
SPYDER
Spyder is a powerful scientific environment written in Python, for Python, and for scientists, engineers and
data analysts. It features a unique combination of the advanced editing, analysis, debugging and profiling
functionality of a comprehensive development tool with the data exploration, interactive execution, deep
inspection and beautiful visualization capabilities of a scientific package n Furthermore, Spyder offers builtin
integration with many popular scientific packages, including NumPy, SciPy, Pandas, Python, Console,
Matplotlib, SymPy, and more. Beyond its many built-in features, Spyder can be extended even further via
third-party plugins. Spyder can also be used as a PyQt5 extension library, allowing you to build upon its
functionality and embed its components, such as the interactive console or advanced editor, in your own
software.
Features of Spyder
31
ALGORITHM USED
Machine learning algorithms used for fake news detection can be divided into two main categories: Supervised
and Unsupervised learning.
Supervised learning algorithms are trained on labelled datasets, where each news article is labelled as either
real or fake. The algorithm learns from the labelled dataset and is then used to classify new news articles as
real or fake. Supervised learning algorithms include logistic regression, decision trees, support vector
machines, and neural networks.
Unsupervised learning algorithms, on the other hand, do not require labelled datasets. Instead, they use
clustering techniques to group news articles into clusters based on their similarities. The algorithm then
identifies the characteristics of the clusters that contain fake news articles. Unsupervised learning algorithms
include k-means clustering, hierarchical clustering, and association rule learning.
Some of these popular classifiers are given below that are used for this purpose.
Support Vector Machine: This algorithm is mostly used for classification. This is a supervised machine
learning algorithm that learns from the labelled data set. This used various classifiers of machine learning and
the support vector machine have given them the best results in detecting the fake news.
Naïve Bayes: Naïve Bayes is also used for the classification tasks. This can be used to check whether the news
is authentic or fake.
Logistic Regression: This classifier is used when the value to be predicted is categorical. For example, it can
predict or give the result in true or false. This classifier used to detect the news whether it is true or fake.
Random Forests: In this classifier, there are different random forests that give a value and a value with more
votes is the actual result of this classifier.
Recurrent Neural Network: This classifier is also helpful for detecting the fake news. Researcher have used
the recurrent neural network to classify the news as true or false.
Neural Network: There are different algorithms of machine learning that are used to help in classification
problems. One of these algorithms is the neural network.
32
K-Nearest Neighbour: This is a supervised algorithm of machine learning that is used for solving the
classification problems. This stores the data about all the cases to classify the new case on the base of similarity.
Decision Tree: This supervised algorithm of machine learning can help to detect the fake news. It breaks down
the dataset into different smaller subsets.
33
Project Modules
A. Data Use
In this project we are using different packages and to load and read the data set we are using pandas. By using
pandas, we can read the .csv file and then we can display the shape of the dataset with that we can also display
the dataset in the correct form. We will be training and testing the data, when we use supervised learning it
means we are labelling the data. By getting the testing and training data and labels we can perform different
machine learning algorithms but before performing the predictions and accuracies, the data is need to be
preprocessing i.e. the null values which are not readable are required to be removed from the data set and the
data is required to be converted into vectors by normalizing and tokening the data so that it could be understood
by the machine. Next step is by using this data, getting the visual reports, which we will get by using the
matplot Library of Python and sklearn. This library helps us in getting the results in the form of histograms,
pie charts or bar charts.
B. Preprocessing
The data set used is split into a training set and a testing set containing in Dataset 1 -3256 training data and
814 testing data and in Dataset II- 1882 training data and 471 testing data respectively. Cleaning the data is
always the first step. In this, those words are removed from the dataset. That helps in mining the useful
information. Whenever we collect data online, it sometimes contains the undesirable characters like stop
words, digits etc. which creates hindrance while spam detection. It helps in removing the texts which are
language independent entities and integrate the logic which can improve the accuracy of the identification task.
C. Feature Extraction
Feature extractions the process of selecting a subset of relevant features for use in model construction. Feature
extraction methods help in to create an accurate predictive model. They help in selecting features that will give
better accuracy. When the input data to an algorithm is too large to be handled and its supposed to be redundant
then the input data will be transformed into a reduced illustration set of features also named feature vectors.
Altering the input data to perform the desired task using this reduced representation instead of the full-size
input. Feature extraction is performed on raw data prior to applying any machine learning algorithm, on the
transformed data in feature space.
34
D. Training the Classifier
As In this project I am using Scikit-Learn Machine learning library for implementing the architecture.
ScikitLearn is an opensource python Machine Learning library which comes bundled in 3rd distribution
anaconda. This just needs importing the packages and you can compile the command as soon as you write it.
If the command doesn’t run, we can get the error at the same time. I am using 4 different algorithms and I have
trained these 4 models i.e. Naïve Bayes, Support Vector Machine, K Nearest Neighbour and Logistic
Regression which are very popular methods for document classification problem. Once the classifiers are
trained, we can c heck the performance of the models on test-set. We can extract the word count vector for
each mail in test-set and predict it class with the trained models. For the training purpose, they have used a
training data set.
35
IMPLEMENTATION
Steps to be followed
2. Data Preprocessing
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Let’s import the downloaded dataset.
data = pd.read_csv('News.csv',index_col=0) data.head()
Output:
36
Data preprocessing
data.shape
Output:
(44919, 5)
As the title, subject and date column will not going to be helpful in identification of the news. So, we can drop
these column.
Now, we have to check if there is any null value (we will drop those rows) data.isnull().sum()
Output:
Text 0
class 0
Now we have to shuffle the dataset to prevent the model to get bias. After that we will reset the index and then
drop it. Because index column is not useful to us.
data.reset_index(inplace=True)
data.drop(["index"], axis=1,inplace=True)
Now Let’s explore the unique values in each category using below code.
37
Output:
First, we will remove all the stopwords, punctuations and any irrelevant spaces from the text. For that NLTK
Library is required and some of it’s module need to be downloaded. So, for that run the below code.
nltk.download('punkt')
nltk.download('stopwords')
38
from wordcloud import WordCloud
Once we have all the required modules, we can create a function name preprocess text. This function will
preprocess all the data given as input.
def preprocess_text(text_data):
preprocessed_text = []
preprocessed_text.append(' '.join(token.lower()
return preprocessed_text
To implement the function in all the news in the text column, run the below command.
preprocessed_review = preprocess_text(data['text'].values)
data['text'] = preprocessed_review
This command will take some time (as the dataset taken is very large).
Let’s visualize the Word Cloud for fake and real news separately.
# Real
wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110,collocations=False)
39
plt.axis('off') plt.show()
Output :
# Fake
wordCloud = WordCloud(width=1600,height=800,random_state=21,collocations=False)
interpolation='bilinear')
plt.axis('off')
plt.show()
40
Output:
Now, Let’s plot the bargraph of the top 20 most frequent words.
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
return words_freq[:n]
41
common_words = get_top_n_words(data['text'], 20)
df1.groupby('Review').sum()['count'].sort_values(ascending=False).plot(kind='bar',figsize=(10, 6),
Output:
Before converting the data into vectors, split it into train and test.
42
from sklearn.metrics import accuracy_score
Now we can convert the training data into vectors using TfidfVectorizer.
vectorization.transform(x_test)
For training we will use Logistic Regression and evaluate the prediction accuracy using accuracy_score.
model.predict(x_test)))
Output:
0.993766511324171
0.9893143365983972
= DecisionTreeClassifier() model.fit(x_train,y_train)
43
# testing the model
print(accuracy_score(y_train, model.predict(x_train)))
print(accuracy_score(y_test, model.predict(x_test)))
Output :
0.9999703167205913
0.9951914514692787
The confusion matrix for Decision Tree Classifier can be implemented with the code below.
cm_display.plot()
plt.show()
44
Output:
Confusion matrix
45
CONCLUSION
The development of a Fake News Detector using machine learning is a significant step forward in combating
the growing issue of misinformation on the internet. With the proliferation of social media platforms and digital
news outlets, fake news has become a serious concern, influencing public opinion, and even political outcomes.
Machine learning techniques, especially natural language processing (NLP) and classification algorithms, have
shown promising results in identifying and distinguishing fake news from reliable information.
Data Preprocessing is Crucial: Proper cleaning, tokenization, and feature extraction from text data are
essential for building an effective fake news detection system.
Model Performance: Traditional models like Logistic Regression, Decision Trees, and Naive Bayes, as well
as advanced deep learning techniques like LSTMs and BERT, have demonstrated high accuracy in classifying
fake vs. real news.
Limitations: Despite promising results, detecting fake news remains a complex task due to the subtlety of
misinformation and the constant evolution of the tactics used by creators of fake news. Models might face
difficulties in handling ambiguous or nuanced content, satire, and highly contextual language.
In conclusion, while machine learning offers powerful tools for fake news detection, there is still a long way
to go before these systems are flawless and fully adaptable. As new technologies, research, and data become
available, the future holds great promise for more robust and scalable solutions to this critical problem.
46