Ijresm V4 I4 34
Ijresm V4 I4 34
Abstract: Plagiarism is the act of stealing someone’s idea or dot product or the cos ( being the angle between the two
work and representing it as one’s own. Plagiarism has been vectors) is computed. This result gives us the resemblance of
identified as a violation of moral rights in various countries. Today
data between both the text files, or in this case, vectors.
in the world of evolving technology and ever-growing usage of the
Internet, the unacceptable act of plagiarism has been increasing With the outbreak of the COVID-19 pandemic, the whole
on a large scale. It is often observed in many educational areas education system has been proven to be dependent on
such as research papers, blogs, articles, assignments, etc. This technology through online lectures, assignments, and
paper majorly focuses on the plagiarism that is frequently found examinations. Through this theory, an easier spotting of
in schools and colleges. Many students can be found to have copied plagiarism in student's assignments and online examinations
assignments from their classmates. A system can be developed for
could be done.
the convenience of teachers that could check the amount of
plagiarism in students’ assignments. This system could be
mentioned as an improvement from the old manual way as it 2. Literature Review
eliminates the tedious work with increased speed and efficiency. Paraphrasing or rephrasing is the conversion of a sentence
into another with alternate use of words or changing the
Keywords: Plagiarism, Detector, Machine Learning, Cosine
similarity, TF-IDF, Plagiarism checker.
sequence of words in a sentence. The recognition of paraphrase
in Natural Language Processing (NLP) is considered a rigorous
1. Introduction task. This study aims to identify plagiarism in the form of
paraphrasing through the application of the Recurrent Neural
Plagiarism detection is the process of spotting the plagiarised
Network (RNN) algorithm model. Paraphrasing detection is a
content via a trustable source or system. The similarity of
difficult process as it is not always possible to get the correct
content beyond a certain limit between two or more files is not
context of short-length content [1].
acceptable and hence, recognized as plagiarism. The task
The objective of this study is to propose a unified technique
requires many steps such as accepting the input in a particular
to detect plagiarism. It makes use of four well-known models
format, computing the resembling words and counting the
namely, Bag of Words (BOW), Latent Semantic Analysis
occurrences of a single word in both the files and finally
(LSA), Support Vector Machine (SVM), and Stylometry. The
disclose a similarity score. Now-a-days, different kinds of
study uses 25 books of various authors and computes the results
strategies are being implemented to analyze and understand the
using the usage patterns of the Most Common Words (MCW).
similarity behavior in documents as like in used in growth of
[2]
the business [7].
The study [3] suggests a new way to recognize cross-
language plagiarism using machine learning and natural
language methods. The modus operandi for this system
involves three major steps, namely, textual input, translation
detection, internet search, and report generation. The approach
applies to most of the electronic-based input documents.
Detection of plagiarism in source codes, being the core
objective of the study, the study proposes a plagiarism detector
that is not influenced by changing the identifier or program
statement order. It compares the perspective with that of a sim
plagiarism detector. The study uses Sequence Alignment and
various Syntax tree elements in the system. [4]
Fig. 1. Illustration of the cosine of vectors (dot product) The study proposes a model to spot plagiarism in Arabic texts
using Deep Learning features. It puts forward an approach to
The system proposed is a machine learning-based model. It use the word2vec model which detects the semantic similarity
uses module incorporated features in the sci-kit-learn. The Tf- between Arabic words. Word2vec is a simple deep learning
Idf Vectorizer converts the text into vector form thereupon the method used to portray words as features of vectors with great
accuracy. It uses the concept of cosine similarity to check the 4) Similarity Score
similarity between the vectors. [5] A similarity score is generated that signifies the amount of
Xinhao Wang et al. [6] have proposed a plagiarism detection similarity detected between the two text files. The score is on a
model for non-native English speakers. The model scale of 0-1(positive values of cos ranges from 0 to 1).
differentiates between plagiarised and non-plagiarised content,
which is present in two forms, namely, text-based and speech-
based data. The system is designed for detecting plagiarism in
speaking proficiency voluntary speech is extracted using an
automated speech scoring system. In the recent years spam
became as a big problem of Internet and electronic
communication. So for overcoming these problems some
techniques are developed to fight with them [8].
3. Proposed System
In the era of Internet revolution, the abrupt act of plagiarism
has been highest than ever. Hence the world is in dire need of
up plagiarism detectors. Most of such systems in the market ask
for personal details of the user. The aim of our group through
this project is to provide the user, especially teachers and
educational institutions with a freely available, easy to use
plagiarism detector.
We propose an idea to build a plagiarism detector using built-
in machine learning libraries. We focus on sci-kit library which
contains various useful and efficient machine learning tools. Fig. 2. System flowchart
Basic techniques like Vectorization and cosine similarity
together could be used in building an efficient plagiarism Algorithm:
detection system. a) The TF-IDF technique has been used in the mentioned
system. TF-IDF stands for Term Frequency- Inverse Term
4. Implementation Frequency. This algorithm emphasises on the frequency
of a recurring word and its importance in the given context
Sci-kit-learn is a built-in library that is used for machine
of input.
learning tools. It contains tools for machine learning and
Term Frequency = (count of the term) / (total word
statistical modeling. This library has been used in the proposed
count in the document)
system for feature extraction from the text. The Tf-idf
Inverse Document Frequency = log (number of
vectorizer is used for word embedding, i.e., conversion of
docs) / (docs containing keyword)
textual data into an array of numbers.
This converted form of textual data into the form of a vector
TF-IDF formula,
is now utilized to detect the similarity between two text files.
Cosine similarity computes the cosine of the angle between the
two vector forms of text files. This computation results in a
score that ranges from 0-1, hence providing us the information
about the extent of similarity between the two input files.
The implementation approach involves four crucial steps that
includes,
1) Input File
The file is supposed to be the input for the plagiarism b) Cosine similarity makes use of the vectors as an input and
detection system. It should be in text format (.txt extension). fetches the cosine of those vectors. This algorithm takes
2) Vectorization of text the data vector and calculates the cosine of the two vectors
Sci-kit built-in features make sure that the words obtained using the angle between them. It provides the output in 0-
from the textual input get converted into a vector format. 1 format, signifying the similarity score.
3) Compute similarity
The resemblance of two text files is computed using the basic 5. Result
concept of Cosine Similarity. The similarity between two text The system has been presented using an easy-to-use User
files depicted in the form of vectors is computed using the dot Interface (UI). It accepts the input text file (.txt extension) using
product of both the vectors, i.e., cos ( being the angle the “Choose Files” button. The chosen 2 files get uploaded to
between the two vectors). the database. These uploaded files are then checked for
plagiarism.
H. Chavan et al. International Journal of Research in Engineering, Science and Management, VOL. 4, NO. 4, APRIL 2021 154
References
[1] E. Hunt et al., "Machine Learning Models for Paraphrase Identification
and its Applications on Plagiarism Detection," 2019 IEEE International
Conference on Big Knowledge (ICBK), 2019, pp. 97-104.
[2] M. AlSallal, R. Iqbal, S. Amin, A. James and V. Palade, "An Integrated
Machine Learning Approach for Extrinsic Plagiarism Detection," 2016
9th International Conference on Developments in eSystems Engineering
(DeSE), 2016, pp. 203-208.
Fig. 3. Plagiarism detector: System output [3] A. Anguita, A. Beghelli and W. Creixell, "Automatic cross-language
plagiarism detection," 2011 7th International Conference on Natural
Language Processing and Knowledge Engineering, 2011, pp. 173-176.
The system displays the output as a “Similarity Score” which [4] H. Kikuchi, T. Goto, M. Wakatsuki and T. Nishino, "A source code
shows the extent of similarity between the two text files. The plagiarism detecting method using alignment with abstract syntax tree
score ranges between 0 to 1. elements," 15th IEEE/ACIS International Conference on Software
Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD), 2014, pp. 1-6.
6. Conclusion [5] Suleiman, Dima & Awajan, Arafat & Al-Madi, Nailah. (2017). Deep
Learning Based Technique for Plagiarism Detection in Arabic Texts.
In this paper, a plagiarism detector has been implemented [6] Wang, Xinhao & Evanini, Keelan & Mulholland, Matthew & Qian, Yao
using machine learning features like word2vec and cosine & Bruno, James. (2019). Application of an Automatic Plagiarism
similarity. The system works efficiently and detects the extent Detection System in a Large-scale Assessment of English Speaking
Proficiency. 435-443.
of plagiarism between the given text files. The incorporation of [7] Aditya Ambre, Praful Gaikwad, Kaustubh Pawar, Vijaykumar Patil."Web
a user interface makes it easier for a layman to utilize the service and Android Application for Comparison of E-Commerce Products",
of the system. This system can easily be used in institutions like International Journal of Advanced Engineering, Management and
Science, vol. 5, no. 4, pp. 266-268, 2019.
schools and colleges to detect plagiarism in students' [8] Bhawana S. Dakhare and Ujwala V. Gaikwad. Article: Spam Detection
assignments. The system can also be used for the evaluation of and Filtering using Different Methods. IJCA Proceedings on National
students' examination mark sheets to detect the expected Conference "MEDHA 2012" MEDHA (1):1-5, September 2012.