0% found this document useful (0 votes)
52 views

B.E. Project: Department of Computer Engineering and Information Technology

The document is a project synopsis submitted by a group of students for their B.E. Computer Engineering project. It proposes developing a big data architecture to detect phishing emails from security data collected in a honeypot. The goals are to use low-cost and efficient algorithms to accurately detect phishing emails from a large dataset. The proposed work is to collect data from multiple sources into HDFS and use a Naive Bayes algorithm to classify emails as legal or illegal, with an expected accuracy of 99.5%. The synopsis provides details on the problem statement, objectives, related work, system design, and implementation plan.

Uploaded by

swapnil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

B.E. Project: Department of Computer Engineering and Information Technology

The document is a project synopsis submitted by a group of students for their B.E. Computer Engineering project. It proposes developing a big data architecture to detect phishing emails from security data collected in a honeypot. The goals are to use low-cost and efficient algorithms to accurately detect phishing emails from a large dataset. The proposed work is to collect data from multiple sources into HDFS and use a Naive Bayes algorithm to classify emails as legal or illegal, with an expected accuracy of 99.5%. The synopsis provides details on the problem statement, objectives, related work, system design, and implementation plan.

Uploaded by

swapnil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

PVG’s College of Engineering and Technology, Pune.

Department of Computer Engineering and Information Technology.


B.E. PROJECT
(Computer Engineering)
[Academic Year 2016-17]

Application
To,
The Project co-ordinator,
Computer Engineering Department,
PVG’s COET, Pune.
Sub:- Submission of Project Synopsis.
Respected Sir,
We undersigned, students of B.E. Computer are submitting our project Synopsis. We are
bound to the decision taken by department related to our selected project title and are submitting
the final synopsis for selected project. Henceforth we will not change the project group or the
selected project title/topic due to any reason.
Thanking you.
Group Id:- 4

Title of the project:-


A Big Data Architecture for security data and its application to Phishing
Characterization

Sources of Project Idea (tick suitable):-


1. Sponsored Project (Attach Letter)
Name of the Sponsoring Authority:-…………………………………………
2. Based on own interests
3. University-based problem (related to the teaching, research or administration )
4. Extension/Enhancement of previous project

Internal guide:- Mr. A G Dongre


Problem statement:-
Use of internet grows so the cybersecurity problems also arises. Different activities are done by
attacker to gain sensitive information of the victim. After gaining the information attacker
perform illegal activities. For this we are proposing the system and the main objective of this
project is to detect phishing emails from collected data in honeypot.

Abstract:-
As the internet grows, cybersecurity problems also arise with it. Different malicious activities are
being carried out by the attackers so that they will be able to get the information of the victim.
Using this information the attackers performs their illegal activities. The development of
applications to mitigate those threats present some complicating factors such as the growth in the
amount of data, and the variety of data, that can come from different sources. In this project we
design an architecture which is being built on the top of the Big Data frameworks that aims to
mitigate the cyber security problem like phishing. In this project we introduced an architecture
that enables the implementation of big data applications to be used in the context of cyber
security. It is being designed such that we are able to detect the phishing emails in a large data
set and the information collected by the honeypot.

Technical Key Words:-


architecture, cyber security, phishing, hadoop

Goals and Objectives:-


The goal of this project that we should be able to detect the phishing emails from the given input
of the large data set along with the pcap files and blacklists sites. The objective is that the project
should be implemented using low cost and with using efficient algorithm which will give higher
accuracy to detect phishing emails.

Introduction:-
Security issues become more critical due to factors such as the large volumes and variety of data
that may be vulnerable, the diversity of data sources and formats, and the velocity in which data
are generated, typically following a stream nature with a high volume. Enterprises usually
collect terabytes of security-relevant data, including network traffic, and software application
events, among others. However, well established techniques, most of the time, are not scalable
and typically produce many false positives when dealing with large amounts of data, degrading
their efficacy. To face these emerging problems, big data analytics has attracted the interest of the
security community. The use of big data frameworks for security solutions presents several
benefits, such as the possibility of storing and using large quantities of security data. Although
analyzing logs, network flows, and system events has been used for several decades in security
solutions, conventional technologies are not adequate to be applied on such long term, large-
scale volumes. In general, the traditional infrastructure keep the data only for a limited period.
Besides that, traditional techniques are inefficient when performing analytics and complex
queries on large, unstructured datasets, while big data platforms perform these operations
efficiently. In this paper we present an architecture for cybersecurity applications based on big
data frameworks. Our architecture has the capability of collecting data from different sources,
storing, combining, and processing them effectively. For example, sources like pcap files and
other logs from a honeypot, data streams collected from black list sites can all be stored in
our system.

Present work related to the project topic:-


The architecture designed stores the emails on the top of the HDFS in the form of mailboxes.
They implemented an application to process large volumes of spam traffic collected from all the
world. The honeypot is being located in different countries and continents and stores the
messages in the mailboxes on the top of HDFS. The main contribution of the application is its
capability to identify phishing emails is a set of spam messages. Using Natural Language
Processing (NLP) and Locality-Sensitive Hashing(LSH) to inspect the text present in the
messages they were able to detect the phishing campaigns. The drawback of this system is that
their experiments showed that the method can correctly detect phishing campaigns presenting the
accuracy of 98.1%.
Proposed work of project topic:-
We locate a honeypot on one of the computer which will be connected to all the computers we
are using. The data collected from the honeypot along with the network traffic pcap files,
blacklist sites data and a data set Enroll-emails which contains 5 lakh emails in being stored on
the top of HDFS. These are given as input to the project architecture. In processing part, we use
an algorithm which differences the legal and illegal emails and then sends the suspected emails
to the phishing campaigns. These campaigns conduct different tests and based on that we finally
detect the phishing emails. The algorithm we will use is Naive Bayes algorithm which gives the
accuracy of 99.5% of detecting the emails which overtakes the drawback of the designed system.

Problem statement feasibility assessment:-


A. Time Complexity
The time complexity of an algorithm quantifies the amount of time taken by an algorithm to run
as a function of the length of the string representing the input. Time complexity of Home
automation project is O (Log (n)), where n is the no of input to the System.

B. Class of problem
When solving problems we have to decide the difficulty level of our problem. There are three
types of classes provided for that. These are as follows:
1) P Class
2) NP-Hard Class
3) NP-Complete Class
A decision problem is in P if there is a known polynomial-time algorithm to get that answer. A
decision problem is in NP if there is a known polynomial-time algorithm for a non-deterministic
machine to get the answer. Problems known to be in P are trivially in NP the nondeterministic
machine just never troubles itself to fork another process, and acts just like a deterministic one.
But there are some problems which are known to be in NP for which no poly-time deterministic
algorithm is known; in other words, we know they‘re in NP, but don‘t know if they‘re in P. A
problem is NP-complete if you can prove that (1) it‘s in NP, and (2) show that it‘s poly-time
reducible to a problem already known to be NP-complete. A problem is NP-hard if and only if
it‘s at least as hard as an NP-complete problem. The more conventional Traveling Salesman
Problem of finding the shortest route is NP-hard, not strictly NP-complete.

Relevant mathematical models:-


S = { I ,O , Fn, S, F}
I = Set of Inputs
O = Set of Outputs
Fn = Set of Functions
S = Success
F = Failure

I = Input = {Emails, pcap files and other logs from a honeypot, data streams collected from black
list sites}
O = Output = {Successfully detection of spam email}
S = Success = { Detection of spam email }
F = Failure = { Detection of spam email is fail, connection loss}

System requirements:-
1.Hardware Requirement
- LAN cable
2. Software Requirement
-ubuntu OS, Python, Java, Virtually installed Honeypot

Expected result:-
We expect that the project designed should be able to detect the phishing emails.
Plan of project execution:

Activity Start Date End Date Duration


Section 1- Planning 08/01/16 08/26/16 20d
Resource Planning 08/01/16 08/10/16 8d
Quality Planning 08/11/16 08/22/16 8d
Phase Review 08/23/16 08/26/16 4d

Section 2 –Analysis 09/01/16 09/30/16 22d


Gathering business req. 09/01/16 09/09/16 7d
Documenting Existing 09/12/16 09/19/16 6d
System
Developed Primary data & 09/20/16 09/30/16 9d
Process model

Section 3-Desing 10/15/16 06/01/17 165d


Developed design models 10/15/16 03/15/17 109d
for system

109d Section 4- 10/17/16 06/01/17 164d


Implementation
System construction 10/17/16 01/17/17 131d
Installing System 04/18/17 04/28/17 09d

Section 5 -Completion 05/01/17 06/01/17 24d


Evaluation/Testing 05/01/17 06/01/17 24d
Maintenance 06/02/17 06/01/17 1d
References
a) Center for Strategic and International Studies - McAfee, “Net Losses:
Estimating the Global Cost of Cybercrime Economic impact of cybercrime II,” White
paper, June 2014.

b) Y. Yu, Y. Mu, and G. Ateniese, “Recent advances in security and privacy in


big data,” j-jucs, Mar 2015.

c) A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analytics for


security,” IEEE Security & Privacy, 2013.

d) P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A


hadoop extension to process mail folders and its application to a spam dataset,” in
International Symposium on Computer Architecture and High Performance
Computing Workshop (SBAC-PADW), Oct 2014, pp. 108–113.

e) T. White, Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.


Project Group members:-

Sr.No. Roll No. Name Sign


1 4067 Neha Nivrutti Narkhede.
2 4068 Shweta Dilip Chaudhari.
3 3012 Manoj vaijnath nahthane.
4 2062 Himanshu anil joshi.

Counselor remark (if any):-


…………………………………………………………………………………………
Name and Sign of Counselor (with Date):-
Guide remark (if any):-
…………………………………………………………………………………………
Name and Sign of Guide (with Date):-

You might also like