Ijresm V4 I4 34

This document describes a machine learning-based plagiarism detection system. It discusses using techniques like vectorization and cosine similarity to compare text documents and generate a similarity score between 0-1 to indicate plagiarism. The proposed system uses sci-kit learn, a Python machine learning library, to implement features like TF-IDF vectorization to convert text to vectors and calculate the cosine similarity between vectors to detect plagiarized content. It provides a freely available plagiarism detector for teachers and educational institutions.

Uploaded by

team pagal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views3 pages

Ijresm V4 I4 34

Uploaded by

team pagal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

International Journal of Research in Engineering, Science and Management 152

Volume 4, Issue 4, April 2021

https://www.ijresm.com | ISSN (Online): 2581-5792

Plagiarism Detector Using Machine Learning

Hiten Chavan1, Mohd. Taufik2, Rutuja Kadave3, Nikita Chandra4*
1,2,3,4
Department of Information Technology, Bharati Vidyapeeth College of Engineering, Navi Mumbai, India

Abstract: Plagiarism is the act of stealing someone’s idea or dot product or the cos ( being the angle between the two
work and representing it as one’s own. Plagiarism has been vectors) is computed. This result gives us the resemblance of
identified as a violation of moral rights in various countries. Today
data between both the text files, or in this case, vectors.
in the world of evolving technology and ever-growing usage of the
Internet, the unacceptable act of plagiarism has been increasing With the outbreak of the COVID-19 pandemic, the whole
on a large scale. It is often observed in many educational areas education system has been proven to be dependent on
such as research papers, blogs, articles, assignments, etc. This technology through online lectures, assignments, and
paper majorly focuses on the plagiarism that is frequently found examinations. Through this theory, an easier spotting of
in schools and colleges. Many students can be found to have copied plagiarism in student's assignments and online examinations
assignments from their classmates. A system can be developed for
could be done.
the convenience of teachers that could check the amount of
plagiarism in students’ assignments. This system could be
mentioned as an improvement from the old manual way as it 2. Literature Review
eliminates the tedious work with increased speed and efficiency. Paraphrasing or rephrasing is the conversion of a sentence
into another with alternate use of words or changing the
Keywords: Plagiarism, Detector, Machine Learning, Cosine
similarity, TF-IDF, Plagiarism checker.
sequence of words in a sentence. The recognition of paraphrase
in Natural Language Processing (NLP) is considered a rigorous
1. Introduction task. This study aims to identify plagiarism in the form of
paraphrasing through the application of the Recurrent Neural
Plagiarism detection is the process of spotting the plagiarised
Network (RNN) algorithm model. Paraphrasing detection is a
content via a trustable source or system. The similarity of
difficult process as it is not always possible to get the correct
content beyond a certain limit between two or more files is not
context of short-length content [1].
acceptable and hence, recognized as plagiarism. The task
The objective of this study is to propose a unified technique
requires many steps such as accepting the input in a particular
to detect plagiarism. It makes use of four well-known models
format, computing the resembling words and counting the
namely, Bag of Words (BOW), Latent Semantic Analysis
occurrences of a single word in both the files and finally
(LSA), Support Vector Machine (SVM), and Stylometry. The
disclose a similarity score. Now-a-days, different kinds of
study uses 25 books of various authors and computes the results
strategies are being implemented to analyze and understand the
using the usage patterns of the Most Common Words (MCW).
similarity behavior in documents as like in used in growth of
[2]
the business [7].
The study [3] suggests a new way to recognize cross-
language plagiarism using machine learning and natural
language methods. The modus operandi for this system
involves three major steps, namely, textual input, translation
detection, internet search, and report generation. The approach
applies to most of the electronic-based input documents.
Detection of plagiarism in source codes, being the core
objective of the study, the study proposes a plagiarism detector
that is not influenced by changing the identifier or program
statement order. It compares the perspective with that of a sim
plagiarism detector. The study uses Sequence Alignment and
various Syntax tree elements in the system. [4]
Fig. 1. Illustration of the cosine of vectors (dot product) The study proposes a model to spot plagiarism in Arabic texts
using Deep Learning features. It puts forward an approach to
The system proposed is a machine learning-based model. It use the word2vec model which detects the semantic similarity
uses module incorporated features in the sci-kit-learn. The Tf- between Arabic words. Word2vec is a simple deep learning
Idf Vectorizer converts the text into vector form thereupon the method used to portray words as features of vectors with great

*Corresponding author: [email protected]

H. Chavan et al. International Journal of Research in Engineering, Science and Management, VOL. 4, NO. 4, APRIL 2021 153

accuracy. It uses the concept of cosine similarity to check the 4) Similarity Score
similarity between the vectors. [5] A similarity score is generated that signifies the amount of
Xinhao Wang et al. [6] have proposed a plagiarism detection similarity detected between the two text files. The score is on a
model for non-native English speakers. The model scale of 0-1(positive values of cos ranges from 0 to 1).
differentiates between plagiarised and non-plagiarised content,
which is present in two forms, namely, text-based and speech-
based data. The system is designed for detecting plagiarism in
speaking proficiency voluntary speech is extracted using an
automated speech scoring system. In the recent years spam
became as a big problem of Internet and electronic
communication. So for overcoming these problems some
techniques are developed to fight with them [8].

3. Proposed System
In the era of Internet revolution, the abrupt act of plagiarism
has been highest than ever. Hence the world is in dire need of
up plagiarism detectors. Most of such systems in the market ask
for personal details of the user. The aim of our group through
this project is to provide the user, especially teachers and
educational institutions with a freely available, easy to use
plagiarism detector.
We propose an idea to build a plagiarism detector using built-
in machine learning libraries. We focus on sci-kit library which
contains various useful and efficient machine learning tools. Fig. 2. System flowchart
Basic techniques like Vectorization and cosine similarity
together could be used in building an efficient plagiarism Algorithm:
detection system. a) The TF-IDF technique has been used in the mentioned
system. TF-IDF stands for Term Frequency- Inverse Term
4. Implementation Frequency. This algorithm emphasises on the frequency
of a recurring word and its importance in the given context
Sci-kit-learn is a built-in library that is used for machine
of input.
learning tools. It contains tools for machine learning and
 Term Frequency = (count of the term) / (total word
statistical modeling. This library has been used in the proposed
count in the document)
system for feature extraction from the text. The Tf-idf
 Inverse Document Frequency = log (number of
vectorizer is used for word embedding, i.e., conversion of
docs) / (docs containing keyword)
textual data into an array of numbers.
This converted form of textual data into the form of a vector
TF-IDF formula,
is now utilized to detect the similarity between two text files.
Cosine similarity computes the cosine of the angle between the
two vector forms of text files. This computation results in a
score that ranges from 0-1, hence providing us the information
about the extent of similarity between the two input files.
The implementation approach involves four crucial steps that
includes,
1) Input File
The file is supposed to be the input for the plagiarism b) Cosine similarity makes use of the vectors as an input and
detection system. It should be in text format (.txt extension). fetches the cosine of those vectors. This algorithm takes
2) Vectorization of text the data vector and calculates the cosine of the two vectors
Sci-kit built-in features make sure that the words obtained using the angle between them. It provides the output in 0-
from the textual input get converted into a vector format. 1 format, signifying the similarity score.
3) Compute similarity
The resemblance of two text files is computed using the basic 5. Result
concept of Cosine Similarity. The similarity between two text The system has been presented using an easy-to-use User
files depicted in the form of vectors is computed using the dot Interface (UI). It accepts the input text file (.txt extension) using
product of both the vectors, i.e., cos ( being the angle the “Choose Files” button. The chosen 2 files get uploaded to
between the two vectors). the database. These uploaded files are then checked for
plagiarism.
H. Chavan et al. International Journal of Research in Engineering, Science and Management, VOL. 4, NO. 4, APRIL 2021 154

technical words in the answer sheet. Operation of the system

does not require any complex directions or training. It is a time-
efficient, easy to use, and effective plagiarism detection system.

References
[1] E. Hunt et al., "Machine Learning Models for Paraphrase Identification
and its Applications on Plagiarism Detection," 2019 IEEE International
Conference on Big Knowledge (ICBK), 2019, pp. 97-104.
[2] M. AlSallal, R. Iqbal, S. Amin, A. James and V. Palade, "An Integrated
Machine Learning Approach for Extrinsic Plagiarism Detection," 2016
9th International Conference on Developments in eSystems Engineering
(DeSE), 2016, pp. 203-208.
Fig. 3. Plagiarism detector: System output [3] A. Anguita, A. Beghelli and W. Creixell, "Automatic cross-language
plagiarism detection," 2011 7th International Conference on Natural
Language Processing and Knowledge Engineering, 2011, pp. 173-176.
The system displays the output as a “Similarity Score” which [4] H. Kikuchi, T. Goto, M. Wakatsuki and T. Nishino, "A source code
shows the extent of similarity between the two text files. The plagiarism detecting method using alignment with abstract syntax tree
score ranges between 0 to 1. elements," 15th IEEE/ACIS International Conference on Software
Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD), 2014, pp. 1-6.
6. Conclusion [5] Suleiman, Dima & Awajan, Arafat & Al-Madi, Nailah. (2017). Deep
Learning Based Technique for Plagiarism Detection in Arabic Texts.
In this paper, a plagiarism detector has been implemented [6] Wang, Xinhao & Evanini, Keelan & Mulholland, Matthew & Qian, Yao
using machine learning features like word2vec and cosine & Bruno, James. (2019). Application of an Automatic Plagiarism
similarity. The system works efficiently and detects the extent Detection System in a Large-scale Assessment of English Speaking
Proficiency. 435-443.
of plagiarism between the given text files. The incorporation of [7] Aditya Ambre, Praful Gaikwad, Kaustubh Pawar, Vijaykumar Patil."Web
a user interface makes it easier for a layman to utilize the service and Android Application for Comparison of E-Commerce Products",
of the system. This system can easily be used in institutions like International Journal of Advanced Engineering, Management and
Science, vol. 5, no. 4, pp. 266-268, 2019.
schools and colleges to detect plagiarism in students' [8] Bhawana S. Dakhare and Ujwala V. Gaikwad. Article: Spam Detection
assignments. The system can also be used for the evaluation of and Filtering using Different Methods. IJCA Proceedings on National
students' examination mark sheets to detect the expected Conference "MEDHA 2012" MEDHA (1):1-5, September 2012.

MCQ For Medical Student
100% (11)
MCQ For Medical Student
6 pages
Magna Carta Worksheet
0% (1)
Magna Carta Worksheet
3 pages
Hypertrophy Coach - How-To-Use-The-Workouts
67% (3)
Hypertrophy Coach - How-To-Use-The-Workouts
6 pages
ITC BLACK BOOK PROJECT Final Report
70% (10)
ITC BLACK BOOK PROJECT Final Report
62 pages
Komatsu Wheel Loader Wa320 7 JPN Shop Manual
100% (62)
Komatsu Wheel Loader Wa320 7 JPN Shop Manual
20 pages
PSYCH 1XX3 - Development 1 Module Notes
No ratings yet
PSYCH 1XX3 - Development 1 Module Notes
3 pages
Copy Checker: Keywords:-Plagiarism System, Text Mining, Data Mining
No ratings yet
Copy Checker: Keywords:-Plagiarism System, Text Mining, Data Mining
3 pages
IJRPR7794
No ratings yet
IJRPR7794
3 pages
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
No ratings yet
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
18 pages
Ijarcce 2022 114158
No ratings yet
Ijarcce 2022 114158
6 pages
AI Based Student's Assignments Plagiarism Detector
No ratings yet
AI Based Student's Assignments Plagiarism Detector
11 pages
Generative AI Report
No ratings yet
Generative AI Report
15 pages
Cppproject 4
No ratings yet
Cppproject 4
17 pages
Review1
No ratings yet
Review1
19 pages
Proposal _Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
No ratings yet
Proposal _Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
11 pages
my Project
No ratings yet
my Project
16 pages
(IJCST-V8I4P13) :M. Chilakarao, K. Sri Sahitya, K. Hari Priya, N. Bala Manikanta, M. Deepika
No ratings yet
(IJCST-V8I4P13) :M. Chilakarao, K. Sri Sahitya, K. Hari Priya, N. Bala Manikanta, M. Deepika
8 pages
pratical work
No ratings yet
pratical work
11 pages
Batch 20
No ratings yet
Batch 20
31 pages
A1.docx
No ratings yet
A1.docx
8 pages
Semantically Plagiarism Detection
No ratings yet
Semantically Plagiarism Detection
6 pages
Cppproject 5
No ratings yet
Cppproject 5
17 pages
Signed Report
No ratings yet
Signed Report
37 pages
my Projeact
No ratings yet
my Projeact
21 pages
Source Code Plagiarism
No ratings yet
Source Code Plagiarism
41 pages
IJCRT2312092
No ratings yet
IJCRT2312092
6 pages
Plagiarism Detection Using Artificial in
No ratings yet
Plagiarism Detection Using Artificial in
4 pages
Articles Plagiarism
No ratings yet
Articles Plagiarism
11 pages
White and Beige Illustrative Business Pitch Deck Presentation
No ratings yet
White and Beige Illustrative Business Pitch Deck Presentation
10 pages
A Two-Phase Plagiarism Detection System Based On Multi-Layer Long Short-Term Memory Networks
No ratings yet
A Two-Phase Plagiarism Detection System Based On Multi-Layer Long Short-Term Memory Networks
13 pages
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
No ratings yet
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
3 pages
Sodapdf Resized
No ratings yet
Sodapdf Resized
71 pages
JETIR1706044
No ratings yet
JETIR1706044
3 pages
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
No ratings yet
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
6 pages
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
No ratings yet
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
12 pages
A_New_Era_of_Plagiarism_the_Danger_of_Cheating_Using_AI
No ratings yet
A_New_Era_of_Plagiarism_the_Danger_of_Cheating_Using_AI
6 pages
Ansyari, T. H., Abdullah, D., & Rosnita, L. (2025). Plagiarism Detection Application for Computer Science Student Theses Using Cosine Similarity and Rabin-Karp
No ratings yet
Ansyari, T. H., Abdullah, D., & Rosnita, L. (2025). Plagiarism Detection Application for Computer Science Student Theses Using Cosine Similarity and Rabin-Karp
10 pages
Plagiarism
No ratings yet
Plagiarism
5 pages
Writing A Research Proposal
No ratings yet
Writing A Research Proposal
8 pages
Short Report
No ratings yet
Short Report
2 pages
Plagiarism_Detector_NLP_Theory
No ratings yet
Plagiarism_Detector_NLP_Theory
2 pages
Aicomplete 2
No ratings yet
Aicomplete 2
11 pages
alshammari-2023-ijca-922667
No ratings yet
alshammari-2023-ijca-922667
4 pages
Artificial Intelligence Capstone Project idea
No ratings yet
Artificial Intelligence Capstone Project idea
15 pages
Plagiarism Detection Process Using Data Mining Techniques
No ratings yet
Plagiarism Detection Process Using Data Mining Techniques
8 pages
1-s2.0-S187705092301846X-main
No ratings yet
1-s2.0-S187705092301846X-main
8 pages
Plagiarism chapter 1 n 2
No ratings yet
Plagiarism chapter 1 n 2
14 pages
A Deep Learning Based Technique For Plagiarism Detection: A Comparative Study
No ratings yet
A Deep Learning Based Technique For Plagiarism Detection: A Comparative Study
10 pages
JETIR2306482
No ratings yet
JETIR2306482
4 pages
Plagiarism Detection Algorithm Using Natural Language Processing Based On Grammar Analyzing
No ratings yet
Plagiarism Detection Algorithm Using Natural Language Processing Based On Grammar Analyzing
13 pages
Referat Plagiat 1
No ratings yet
Referat Plagiat 1
4 pages
Plagiarism Detection
No ratings yet
Plagiarism Detection
4 pages
Title of Project
No ratings yet
Title of Project
19 pages
Palagiarism Detection
No ratings yet
Palagiarism Detection
14 pages
IJCRT2409745
No ratings yet
IJCRT2409745
6 pages
Research Paper On Plagiarism Detection Methods: Submitted By: Supervised by
No ratings yet
Research Paper On Plagiarism Detection Methods: Submitted By: Supervised by
15 pages
Plagiarism_Detector_NLP_Theory (1)
No ratings yet
Plagiarism_Detector_NLP_Theory (1)
3 pages
External and Intrinsic Plagiarism Detection Using Vector Space Models
No ratings yet
External and Intrinsic Plagiarism Detection Using Vector Space Models
9 pages
ASurveyon Plagiarism Detection Systems
No ratings yet
ASurveyon Plagiarism Detection Systems
5 pages
Plagiarism Detection Techniques
No ratings yet
Plagiarism Detection Techniques
25 pages
plagiarismchecker[1]
No ratings yet
plagiarismchecker[1]
8 pages
Plagiarism Detection For Text and Images: Ms. Jaishma Kumari B
No ratings yet
Plagiarism Detection For Text and Images: Ms. Jaishma Kumari B
8 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
PRED - MRR Grant - Sanction of An Amount of Rs.384.6408 Crores For Taking Up 11212
No ratings yet
PRED - MRR Grant - Sanction of An Amount of Rs.384.6408 Crores For Taking Up 11212
165 pages
Bank Exams Winners Kit
No ratings yet
Bank Exams Winners Kit
62 pages
VesMatic 20
No ratings yet
VesMatic 20
76 pages
IGNOU 2024 JOB
No ratings yet
IGNOU 2024 JOB
100 pages
Teenage Pregnancy
No ratings yet
Teenage Pregnancy
2 pages
Allied Health Medicine
No ratings yet
Allied Health Medicine
9 pages
Gantrail Adjustable Soleplate R100 R60 DS 0421
No ratings yet
Gantrail Adjustable Soleplate R100 R60 DS 0421
4 pages
Kaylee Babasade Resume-2
No ratings yet
Kaylee Babasade Resume-2
2 pages
欢迎来到文章网站！
100% (2)
欢迎来到文章网站！
4 pages
Inquiry Worksheet 6.1 Answered
No ratings yet
Inquiry Worksheet 6.1 Answered
11 pages
Sap PS - Quick Guide Sap PS - Quick Guide PDF
75% (4)
Sap PS - Quick Guide Sap PS - Quick Guide PDF
51 pages
10-3511-505-ST25-03 Stratos Micra 25 Installers Handbook
No ratings yet
10-3511-505-ST25-03 Stratos Micra 25 Installers Handbook
58 pages
InEdge NXT Userguide
No ratings yet
InEdge NXT Userguide
10 pages
11-Auxiliary Power System
100% (1)
11-Auxiliary Power System
70 pages
Rural Auto Finance
No ratings yet
Rural Auto Finance
78 pages
Upstream Intermediate b2 Students Book (2)
No ratings yet
Upstream Intermediate b2 Students Book (2)
2 pages
12 Asking Questions
No ratings yet
12 Asking Questions
8 pages
Eo M331.01
No ratings yet
Eo M331.01
30 pages
Disadvantages of GLOBALIZATION
No ratings yet
Disadvantages of GLOBALIZATION
2 pages
Structured Decision Making Case Studies In Natural Resource Management Michael C Runge download
No ratings yet
Structured Decision Making Case Studies In Natural Resource Management Michael C Runge download
79 pages
GROUP 4 Conceptual Framework Research Design
No ratings yet
GROUP 4 Conceptual Framework Research Design
19 pages
1.yourself - IT Jobs
No ratings yet
1.yourself - IT Jobs
8 pages
Cell Membrane: Function and Structure Grade 11 Biology
100% (1)
Cell Membrane: Function and Structure Grade 11 Biology
51 pages
Lecture No. 1
No ratings yet
Lecture No. 1
17 pages

Ijresm V4 I4 34

Uploaded by

Ijresm V4 I4 34

Uploaded by

International Journal of Research in Engineering, Science and Management 152

Volume 4, Issue 4, April 2021

Plagiarism Detector Using Machine Learning

*Corresponding author: [email protected]

technical words in the answer sheet. Operation of the system

You might also like