100% found this document useful (1 vote)

149 views51 pages

Minor Project Ii Report Text Mining: Reuters-21578: Submitted by

The document is a project report on text mining of the Reuters-21578 dataset. It includes an abstract, problem definition, description of the dataset, preprocessing steps applied including document term weighting using TF-IDF, stop word removal, and case folding. It also describes the actual implementation of classification algorithms including Naive Bayes, k-Nearest Neighbors, Support Vector Machines, and evaluation of results with screenshots and analysis of performance. Future work and references are also mentioned.

Uploaded by

sabihuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

149 views51 pages

Minor Project Ii Report Text Mining: Reuters-21578: Submitted by

Uploaded by

sabihuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 51

MINOR PROJECT II REPORT

TEXT MINING : REUTERS-21578

SUBMITTED BY :
Aarshi Taneja (10104666)
Divya Gautam (10104673)
Nupur (10104676)
Shruti Jadon (10104776)
Batch: IT-B10
Group Code: DMB10G04
TABLE OF CONTENTS

Abstract --------------- 1

Problem definition --------------- 2

Data set chosen --------------- 3

Preprocessing applied --------------- 4

Description of algorithms --------------- 7

Actual implementation --------------- 14

Results --------------- 35

Screenshots ---------------37

Future work --------------- 42

References --------------- 43
Abstract
Text Categorization (TC), also known as Text Classification, is the task
of automatically classifying a set of text documents into different
categories from a predefined set. If a document belongs to exactly
one of the categories, it is a single-label classification task; otherwise,
it is a multi-label classification task. TC uses several tools from
Information Retrieval (IR) and Machine Learning (ML) and has
received much attention in the last years from both researchers in
the academia and industry developers.

Information Retrieval

Information Retrieval (IR) is the science of searching for information

within relational databases, documents, text, multimedia files, and
the World Wide Web . The applications of IR are diverse, they
include but not limited to extraction of information from large
documents, searching in digital libraries, information filtering, spam
filtering, object extraction from images, automatic summarization,
document classification and clustering, and web searching.
The breakthrough of the Internet and web search engines have
urged scientists and large firms to create very large scale retrieval
systems to keep pace with the exponential growth of online data.
Figure below depicts the architecture of a general IR system. The
user first submits a query which is executed over the retrieval
system. The latter, consults a database of document collection and
returns the matching document.
In general, in order to learn a classifier that is able to correctly
classify unseen documents, it is necessary to train it with some pre-
classified documents from each category, in such a way that the
classifier is then able to generalize the model it has learned from the
pre-classified documents and use that model to correctly classify the
unseen documents.

Problem definition
Our project is about categorizing the news articles into various
categories. We work on two major scenarios:

a. Classification of documents into various categories.

Making it in the form of an application where user can upload
an article and we will classify it into various categories.
b. On entering keywords by the user we show the most relevant
document for the user.
Data Set Chossen
As the caption suggests, the data set used for this particular project
is in the form of sgml files. The Reuters-21578 dataset is available at :
http://www.daviddlewis.com/resources/testcollections/reuters21578/
There are 21578 documents; according to the 'ModApte' split: 9603
training docs, 3299 test docs and 8676 unused docs. They were
labeled manually by Reuters personnel. Labels belong to 5 different
category classes, such as 'people', 'places' and 'topics'. The total
number of categories is 672, but many of them occur only very
rarely. The dataset is divided in 22 files of 1000 documents delimited
by SGML tags.
A sample SGML file:
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2">
<DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
F Y
f0708reute
d f BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
<TEXT>
<TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP
North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
</BODY></TEXT>
</REUTERS>

Each article starts with an "open tag" of the form

<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>
where
LEWISSPLIT : The possible values are TRAINING, TEST, and NOT-USED.
TRAINING indicates it was used in the training set in the experiments reported
in LEWIS91d (Chapters 9 and 10), LEWIS92b, LEWIS92e, and LEWIS94b. TEST
indicates it was used in the test set for those experiments, and NOT-USED
means it was not used in those experiments.
NEWID : The identification number (ID) the story has in the Reuters-21578,
Distribution 1.0 collection. These IDs are assigned to the stories in
chronological order.

<TOPICS>:Encloses the list of TOPICS categories, if any, for the document. If

TOPICS categories are present, each will be delimited by the tags <D> and
</D>.

<BODY>: The main text of the story.

<AUTHOR>: Author of the story.

Preprocessing applied

Document Term Weighting

Document indexing is the process of mapping a document into a
compact representation of its content that can be interpreted by a
classifier. The techniques used to index documents in TC are
borrowed from Information Retrieval, where text documents are
represented as a set of index terms which are weighted according to
their importance for a particular document. A text document dj is
represented by an n-dimensional vector −!d j of index terms or
keywords, where each index term corresponds to a word that
appears at least once in the initial text and has a weight associated to
it, which should reflect how important this index term is.
Term Frequency / Inverse Document Frequency

In this case, which is the most usual in TC, the weight of a term
in a document increases with the number of times that the term
occurs in the document and decreases with the number of times the
term occurs in the collection. This means that the importance of a
term in a document is proportional to the number of times that the
term appears in the document, while the importance of the term is
inversely proportional to the number of times that the term appears
in the entire collection.
This term-weighting approach is referred to as term
frequency/inverse document frequency .
Formally, , the weight of term ti for document , is defined as:

where is the number of times that term appears in document

, |D| is the total number of documents in the collection, and is
the number of documents where term appears.

Term Distributions

It is a recent approach of a more sophisticated term weighting

method than , based on term frequencies within a particular
class and within the collection of training documents.
The weight of a term using term distributions is determined by
combining three different factors, that depend on the average term
frequency of term ti in the documents of class ck, represented as
where represents the set of documents that belong to class
the number of documents belonging to class the
frequency of term in document of class ck, and |C|, which will
be used in the next formulas, represents the number of classes in a
collection.

Stop Words
Words that are of little value to convey the meaning of the document
and which happen to have a high frequency are totally dropped
during the tokenization process. These words are called stop words
and are generally detected by either their high frequency or by
matching them with a dictionary. Below is a stop list of twenty-five
semantically nonselective words which are common

Dropping stop words sounds a very good approach to discard some

useless redundant words; however this is not the case for some
phrases. Imagine a user is searching for "President of the United
States" or "Flight to London", the Information Retrieval system
would then search for "President" and "United States" separately or
it would search for "Flight" and "London" separately. This would of
course return an incorrect result that does not reflect the user's
initial query.

Document Length Normalization

Naturally, long documents contain more terms than short
documents. Considering that the similarity between documents can
be measured by how many terms they have in common, long
documents will have more terms in common with other documents
than short documents, so they will be more similar to other
documents than short documents. To contrast this tendency,
weights of terms for TC tasks are usually normalized so that
document vectors have unitary length.

Case-Folding
A typical strategy is to do case-folding by converting all uppercase
characters to lowercase characters. This is a form of word
normalization in which all words are reduced to a standard form.
This would equate between Door and door and between university
and UNIVERSITY. This sounds very nice; however the problem arises
when a proper noun such as Black is equated with the color black or
it can also equate between the company name VISION and the word
vision. To remedy to this, one can only convert to lowercase words at
the beginning of a sentence and words located within titles and
headings.

Relevant Algorithms/Techniques

Classification Methods

This dissertation concerns methods for the classification of natural

language text, that is, methods that, given a set of training
documents with known categories and a new document, which is
usually called the query, will predict the query’s category.

Naive Bayes
The Naıve Bayes classifier found its way into many applications
nowadays due to its simple principle but yet powerful accuracy [13].
Bayesian classifiers are based on a statistical principle. Here, the
presence or absence of a word in a textual document determines the
outcome of the prediction. In other words, each processed term is
assigned a probability that it belongs to a certain category. This
probability is calculated from the occurrences of the term in the
training documents where the categories are already known. When
all these probabilities are calculated, a new document can be
classified according to the sum of the probabilities for each category
of each term occurring within the document. However, this classifier
does not take the number of occurrences into account, which is a
potentially useful additional source of information. They are called
“naıve” because the algorithm assumes that all terms occur
independent from each other.
Given a set of r document vectors , classified along a
set C of q classes, , Bayesian classifiers estimate the
probabilities of each class ck given a document dj as:

In this equation, is the probability that a randomly picked

document has vector as its representation, and the
probability that a randomly picked document belongs to ck. Because
the number of possible documents is very high, the estimation of
is problematic.
To simplify the estimation of , Naive Bayes assumes that the
probability of a given word or term is independent of other terms
that appear in the same document. While this may seem an over
simplification, in fact Naive Bayes presents results that are very
competitive with those obtained by more elaborate methods.
Moreover, because only words and not combinations of words are
used as predictors, this naive simplification allows the computation
of the model of the data associated with this method to be far more
efficient than other non naive Bayesian approaches. Using this
simplification, it is possible to determine as the product of
the probabilities of each term that appears in the document.
So, , where , may be estimated as:

The Vector Space Model

The vector space model is a data model for representing documents
and queries in an Information Retrieval system. Every document and
query is represented by a vector whose dimensions, which are called
features, represent the words that occur within them [4]. In that
sense, each vector representing a document or query consists of a
set of features which denote words and the value of each feature is
the frequency or the number of occurrence of that particular word in
the document itself. Since an IR usually contains more than one
document, vectors are stacked together to form a matrix. Figure 2
shows a single vector populated with frequencies of the words
contained in document D.
To make things clearer, let's consider an example of finding the
occurrence of words information, processing, and language in three
documents Doc1, Doc2, and Doc3
After counting the occurrence of the three words in each of the three
documents, Doc1 is represented by the vector D1(1,2,1), Doc2 by the
vector D2(6,0,1), and Doc3 by the vector D3(0,5,1)
In order to emphasize on the contribution of the higher value
feature, those vectors are to be normalized. By normalization, we
simply mean converting all vectors to a standard length. This can be
done by dividing each dimension in a vector by the length of that
particular vector. The length of a vector can be calculated according
to the following equation: length = sqrt((ax * ax) + (ay * ay) + (az *
az)). Normalizing D1: length = sqrt((1 * 1) + (2 * 2) + (1 * 1)) = sqrt(6)
= 2.449. Now dividing each dimension by the length: 1/2.449 = 0.41 ;
2/2.449 = 0.81 ; 1/2.449 = 0.41. Final result would be D1(0.41, 0.81,
0.41). Same applies for D2 and D3 which will eventually result in
D2(0.98, 0, 0.16) and D3(0, 0.98, 0.19).
Now in order to determine the difference between two documents
or if a query matches a document, we must calculate the cosine of
the angles between the two vectors. When two documents are
identical (or when a query completely matches a document) they will
receive a cosine of 1; when they are orthogonal (share no common
terms) they will receive a cosine of 0.
Back to the previous example, let's consider a query with
corresponding normalized vector Q(0.57, 0.57, 0.57). The first task is
to compute the cosines between this vector and our three document
vectors.
Sim(D1,Q) = 0.41*0.57 + 0.81*0.57 + 0.41*0.57 = 0.92
Sim(D2,Q) = 0.65
Sim(D3,Q) = 0.67
The previous results show clearly that D1 is the closest to match Q,
then comes D3 and then comes D2 (remember that the more the
cosine is closer to 1 the more the two vectors are closer)

Term Graph Model

The term graph model is an improved version of the vector space
model [13] by weighting each term according to its relative
“importance” with regard to term associations. Specifically, for a text
document Di, it is represented as a vector of term weights
Di =< w1i,. .. . , w|T|i >, where T is the ordered set of terms that
occur at least once in at least one document in the collection. Each
weight wji represents how much the corresponding term tj
contribute to the semantics of document di. Although a number of
weighting schemes have been proposed (e.g., boolean weighting,
frequency weighting, tf-idf weighting, etc.), those schemes
determine the weight of each term individually. As a result,
important yet rich information regarding the relationships among
the terms are not captured in those weighting schemes.
We propose to determine the weight of each term in a document
collection by constructing a term graph. The basic steps are as
follows:
1. Prepocessing Step: For a collection of document, extract all the
terms.
2. Graph Building Step:
(a) For each document, we view it as a transaction: the document ID
is the corresponding transaction ID; the terms contained in the
document are the items contained in the corresponding transaction.
Association rule mining algorithms can thus be applied to mine the
frequently co-occurring terms that occur more than minsup times in
the collection.
(b) The frequent co-occurring terms are mapped to a weighted and
directed graph, i.e., the term graph.

Preprocessing
In our term graph model, we will capture the relationships among
terms using the frequent itemset mining method. To do so, we
consider each text document in the training collections as a
transaction in which each word is an item. However, not all words in
the document are important enough to be retained in the
transaction.
To reduce the processing space as well as increase the accuracy of
our model, the text documents need to be preprocessed by (1)
remove stopwords, i.e., words that appear frequently in the
document but have no essential meanings; and (2) retaining only the
root form of words by stemming their affixes as well as prefixes.

Graph Building

As mentioned above, we will capture the relationships among terms

using the frequent itemset mining method. While this idea has been
explored by previous research [9], our approach distinguish from
previous approaches in that we maintain all such important
associations in a graph. The graph not only reveals the important
semantics of the document, but also provide a basis to extract novel
features about the document, as we will shown in the next section.
Frequent Itemset Mining. After the preprocessing step, each
document in the text collection will be stored as a transaction (list of
items) in which each item (term) is represented by a unique non-
negative integer. Then frequent itemset mining algorithms can be
used to find all the subset of items that appeared more than a
threshold amount of times (controlled by minsup) in the collection.
Graph Builder. In our system, our goal is to explore the relationships
among the important terms of the text in a category and try to
define a strategy to make use of these relationships in the classifier
and other text mining tasks. Vector space model cannot express such
rich relationship among terms. Graph is thus the most suitable data
structure in our context, as, in general, each term may be associated
with more than one terms. We propose to use the following simple
method to construct the graph from the set of frequent itemsets
mined from the text collections. First, we construct a node for each
unique term that appear at least once in the frequent itemsets.
Then we create edges between two node u and v if and only if they
are both contained in one frequent itemset. Furthermore, we assign
weights to the edges in the following way: the weight of the edge
between u and v is the largest support value among all the frequent
itemsets that contains both of them.
Example Consider the frequent itemsets and their absolute support
shown in Figure below. Its corresponding graph is shown in Figure
beside.

k-Nearest Neighbors

The initial application of k-Nearest Neighbors (k-NN) to text

categorization was reported by Masand and colleagues. The basic
idea is to determine the category of a given query based not only on
the document that is nearest to it in the document space, but on the
categories of the k documents that are nearest to it. Having this in
mind, the Vector method can be viewed as an instance on the k-NN
method, where k=1.
This work uses a vector-based, distance-weighted matching function,
as did Yang, by calculating document’s similarity like the Vector
method. Then, it uses a voting strategy to find the query’s class: each
retrieved document contributes a vote for its class, weighted by its
similarity to the query. The query’s possible classifications will be
ranked according to the votes they got in the previous step.

Actual Implementation
For classifying the documents in Reuter-21578 we initially pre-
processed the data by performing various techniques :

a. Bag of words
b. Stop word removal
c. Tf-idf
d. Case Folding
e. Normalisation

Then after pre-processing we applied Naïve bayes Algorithm to

classify the documents in the training set into five categories
(exchange, organisation, people, places and topics). We further
applied our classifier model on the test documents and calculated
the accuracy by comparing it with the default answers given for the
test documents.

After Naïve bayes we implemented the Term Graph algorithm to

show a better classifier model to classify the documents. Again we
checked its accuracy for test documents and compares the accuracy
for both the Classification Algorithms.
To compare the Evaluate the two algorithms we used the following :
Precision is defined as the fraction of the retrieved documents that
are relevant, and can be viewed as a measure of the system’s
soundness, that is:

Recall is defined as the fraction of the relevant documents that is

actually retrieved, and can be viewed as a measure of the system’s
completeness, that is:
Accuracy, which is defined as the percentage of correctly classified
documents, is generally used to evaluate single-label TC tasks.

The Mean Reciprocal Rank can be calculated for each individual

query document as the reciprocal of the rank at which the first
correct category was returned, or 0 if none of the first n choices
contained the correct category. The score for a sequence of
classification queries, considering the first n choices, is the mean of
the individual query´s reciprocal ranks. So:

where is the rank of the first correct category for query i,

considering the first n categories returned by the system.

We then created an application where user can input some keywords

and based on the algorithm showing higher accuracy we show the
relevant document to the user.

WORK FLOW
\
News articles

Training set
Test set

Conversion of sgml file to

Conversion of sgml file to text file
text file

Using bag of words

Building LOCAL DICTIONARY of the alogorithm for keyword
document using bag of words. extraction
Application of K- Application of Term Graph to Appplication of naive algorithm
Nearest predict the category of test to predict its category
Neigbhour documents

To Classify the
test documents
Calculating Complexity and Comparing Accuracy
of the algorithm

Information Retrieval application

Using Vector Space Model
Information Retrieval Application
In this application user can enter keywords and based on those
keywords we show the relevant documents having the highest
similarity value and then on selecting one of the shown documents,
it will display the content of the document.

This application is based on Vector Space Model for Information

Retrieval.

NAÏVE BAYES ALGORITHM

FORMULA USED :

Checking the keyword in Test document and storing it

in a map.

Calculating yes and no frequency of each keyword in

the test document.

Calculating the probability of each keyword of the

test document.

Classifying the Test Document into various categories

on the basis of probability calculated.
TERM GRAPH ALGORITHM

Setting each unique word occurring the

document as nodes of the graph

Making Adjacency Matrix of the

keywords

Making Distance Matrix using

Dijkstra

Calculating similarity between the test

document keywords and the keywords of
each category

Classifying the test document by checking

the category with highest similarity value

K-NEAREST NEIGHBOUR

Make vector for every document in the test set.

Make centriod vector for each class.

Calculate similarity between each document vector

and class vector

Document belongs to the class for which the

similarity is maximum,

VECTOR SPACE MODEL

Make query vector.

Make Document vector.

Calculate similarity between query vector and

document vector for each document.

Retrieved Document is the one for which the

similarity is maximum
For Feature Selection (Tf-idf)
public class WordFrequencyCmd {
private String traindir;
private String traincsv;
private String traincsv2;
private String traincsv3;
private String hdfile;
private String clist;

public WordFrequencyCmd (String a, String b, String c, String d, String e, String f) {

this.traindir = a;
this.traincsv = b;
this.hdfile = c;
this.clist = d;
this.traincsv2 = e;
this.traincsv3 = f;
}
public void generatekeywords() {
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
List catg = new ArrayList(); //list to store all categories
File file = new File(traindir);
File[] inputFile=file.listFiles();

InputStream inp;
try { FileWriter writer = new FileWriter(traincsv);
inp = new FileInputStream(new File(hdfile));

Scanner sc = new Scanner(inp); // gets one word at a time from input

String word;
writer.append("doc-id");
writer.append(',');
writer.append("keyword-list");
writer.append(',');

while (sc.hasNext()) { // is there another word?

word = sc.next(); // get next word

catg.add(word);

writer.append(word);
writer.append(',');
//System.out.println(word);

writer.append('\n');
String temp = null;
File ignoreFile = new File("E:\\Mining\\longstoplist.txt");
for(int j=0;j<inputFile.length;j++) {
BufferedReader br = new BufferedReader(new FileReader(inputFile[j]));
String line = "";
StringTokenizer st = null;
List keylist = new ArrayList();
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while (st.hasMoreTokens()) {
temp = st.nextToken();
if (st.hasMoreTokens()) {
break;
}
else {
keylist.add(temp);
}
}
}

WordCounter counter = new WordCounter();

counter.ignore(ignoreFile);
counter.countWords(inputFile[j]);

String[] wrds = counter.getWords(WordCounter.SortOrder.BY_FREQUENCY);

int[] frequency = counter.getFrequencies(WordCounter.SortOrder.BY_FREQUENCY);
// System.out.println("for the"+j+"th file");
writer.append(inputFile[j].getName());
writer.append(',');
//... Display the results.
int n = counter.getEntryCount();
for (int i=0; i<n; i++) {
if(frequency[i]>1) {
//System.out.println(frequency[i] + " " + wrds[i]);
writer.append(wrds[i]+" "+frequency[i]);
writer.append("+");

//comparing the values from the hash set and table and if match increase to 1
if (words.contains(wrds[i]) == false) {
if (result.get(wrds[i]) == null)
result.put(wrds[i], 1);
else
result.put(wrds[i], result.get(wrds[i]) + 1);
words.add(wrds[i]);
}
else {
result.put(wrds[i], result.get(wrds[i]) + 1);
}
}
}

// code for yes/no

Iterator it = catg.iterator();
for (int i=0; i < catg.size(); i++) {
writer.append(',');
if(keylist.contains(it.next())) {
writer.append("yes");
}
else {
writer.append("no");
}
}

// System.out.println();

writer.append('\n');

FileOutputStream out3;
PrintStream p3;
out3 = new FileOutputStream(clist);
p3 = new PrintStream( out3 );
for (Object o: result.entrySet() ) {
Map.Entry entry = (Map.Entry) o;
int val=Integer.parseInt(entry.getValue().toString());
String k=entry.getKey().toString();
if(val>4){
// System.out.println(k+" "+val);
p3.println(k+" "+val);
}
}

//writer.flush();
writer.close();

sc.close();

} catch (IOException iox) {

System.out.println(iox);
}
}

public void indocfreq(int nod) {

double idf = 0;
String word = null;
int df = 0;
Map m = new HashMap();
try {

BufferedReader br = new BufferedReader(new FileReader(clist));

String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
word = st.nextToken();
df = Integer.parseInt(st.nextToken());
idf = Math.log(nod/df);
//System.out.print(word+": "+idf+"\t");
m.put(word,idf);
}
}
FileWriter writer = new FileWriter(traincsv2);
FileWriter writer2 = new FileWriter(traincsv3);
String artname = null;
BufferedReader br2 = new BufferedReader(new FileReader(traincsv));
String line2 = null;
StringTokenizer st2 = null;
String y = br2.readLine();
writer.append(y).append("\n");
writer2.append(y).append("\n");
double wt = 0;

String temp2 = null;

int flag = 0;
while ((line2 = br2.readLine()) != null) {

st2 = new StringTokenizer(line2,",");

artname = st2.nextToken();
writer.append(artname).append(",");
writer2.append(artname).append(",");
String temp1 = st2.nextToken();
flag = 0;

String[] keyarr = temp1.split("[+\\s]");

for(int i=0; i < keyarr.length-1; i=i+2) {
flag = 1;

temp2 = keyarr[i];
if (m.get(temp2) != null) {

wt = (Double)m.get(keyarr[i]) * Double.parseDouble(keyarr[i+1]);

if (wt <= 15) {

writer.append(keyarr[i]+" "+Integer.parseInt(keyarr[i+1])+"+");
writer2.append(keyarr[i]+" "+wt+"+");
}
}
}
if (flag == 0) {
writer.append(",").append(temp1);
writer2.append(",").append(temp1);
}

while (st2.hasMoreTokens()) {
String z = st2.nextToken();
writer.append(",").append(z);
writer2.append(",").append(z);
}
writer.append("\n");
writer2.append("\n");
}
writer.close();
writer2.close();
br2.close();

}catch (IOException iox) {

System.out.println(iox);
}

For Naive Bayes on Exchange category

public class Naive_Exg {

private Map m2; //map for yes

private Map m3; //map for no
private String clname;
private int clnum;
private static int first = 0;

public Naive_Exg(String st, int x) {

this.clname = st;

this.m2 = new HashMap();

this.m3 = new HashMap();
this.clnum = x;
}
public void generatemaps() throws IOException {

String csvFile = "E:\\Mining\\training_csvs\\Exg_tf.csv";

BufferedReader br2 = new BufferedReader(new FileReader(csvFile));
String line = "";
line = br2.readLine(); // ignore the first line of headers
StringBuffer sb = new StringBuffer(); // buffer for keywords with frequency in yes cases
StringBuffer sb2 = new StringBuffer(); // buffer for keywords with frequency in no cases
String temp3 = null;
StringTokenizer st2 = null;
while ((line = br2.readLine()) != null) {
st2 = new StringTokenizer(line, ",");
int f = 0;
st2.nextToken(); //ignore docid
while (st2.hasMoreTokens()) {
temp3 = st2.nextToken();
if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes
f = 1;

}
if(f == 0) {
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
}
else if(f==0){
sb2.append(temp3);
}
break;

}
String keys = sb.toString();
String nokeys = sb2.toString();
String[] keyarr = keys.split("[+\\s]");
String[] nokeyarr = nokeys.split("[+\\s]");
for (int i=0; i <(keyarr.length)-1; i=i+2) {

int temp5=Integer.parseInt(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}

else {
m2.put(keyarr[i],(Integer)m2.get(keyarr[i])+ temp5);
}

}
for (int i=0; i <(nokeyarr.length)-1; i=i+2) {

int temp5=Integer.parseInt(nokeyarr[i+1]);
if (m3.get(nokeyarr[i]) == null) {
m3.put(nokeyarr[i], temp5);
}

else {
m3.put(nokeyarr[i],(Integer)m3.get(nokeyarr[i])+ temp5);
}

}
}

public int testarticle(double pyes, double pno) throws IOException {

double totalprobyes = 1;
double totalprobno = 1;
System.out.println("class is "+clname);
String csvFile2 = "E:\\Mining\\Naive_result\\Exg_result.csv";
BufferedReader br2 = new BufferedReader(new FileReader(csvFile2));
String line2 = "";
StringTokenizer st2 = null;

int numyes = 0;

StringBuffer outBuffer = new StringBuffer(1024);

outBuffer.append(br2.readLine()).append("\n");

String csvFile = "E:\\Mining\\test_csv\\forTesting.csv";

BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;

line = br.readLine();
int numofart = 0;

while ((line = br.readLine()) != null) {

st = new StringTokenizer(line, ",");

String artname = st.nextToken();

if(first == 0) {
outBuffer.append(artname).append(",");
numofart++;

}
else {
numofart++;
if((line2 = br2.readLine()) != null) {
st2 = new StringTokenizer(line2, ",");
while(st2.hasMoreTokens()) {
outBuffer.append(st2.nextToken()).append(",");
}
}
}

if(st.hasMoreTokens()) {
String temp1 = st.nextToken();
String[] keyarr = temp1.split("[+\\s]");
//for yes
String temp2 = null;
double temp3 = 0;
double x = 0;
double y = 0;
double probyes = 0;
double probno = 0;
for(int i=0; i < keyarr.length-1; i=i+2) {
temp2 = keyarr[i];
temp3 = Integer.parseInt(keyarr[i+1]);
if (m2.get(temp2) != null) {
x = (Integer)m2.get(temp2);
}
else {
x = 0;
}
if (m3.get(temp2) != null) {
y = (Integer)m3.get(temp2);
}
else {
y = 0;
}
probyes = probyes + ( (temp3) * (Math.log((x+1)/(x+y+38))) );
probno = probno + ( (temp3) * (Math.log((y+1)/(x+y+38))) );
}
totalprobyes = Math.abs(pyes + probyes);
totalprobno = Math.abs(pno + probno);
if(totalprobyes > totalprobno && totalprobyes > 500) {

outBuffer.append("yes");
numyes++;
}
else {

outBuffer.append("no");
}
}
outBuffer.append("\n");
}

String out = outBuffer.toString();

try {
FileWriter writer = new FileWriter("E:\\Mining\\Naive_result\\Exg_result.csv");
writer.append(out);
writer.close();
}catch (IOException iox) {
System.out.println(iox);
}
first = 1;
System.out.println(numofart);
return numyes;

For Term –Graph on Exchange category

public class TermGraph_Exg {

private int[][] adj;
private int [][] T;
private int clnum;
private ArrayList unikeywords;
private String clname;
private int siz;
private int nVerts;
private int[] next;
private int current_edge_weight;
public TermGraph_Exg(String name,int x) {
this.clname = name;
this.clnum = x;
this.unikeywords = new ArrayList();
this.current_edge_weight = 0;
}

public void makeAdj(){

try {
// for unique keywords list
String temp1 = null;
BufferedReader br = new BufferedReader(new
FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv"));
String line = br.readLine(); //avoid first line
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, ",");
st.nextToken(); // avoid article name
temp1 = st.nextToken();
if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list
for(int i =0; i < clnum-1-1; i++){
st.nextToken();
}
}
else {
for(int i =0; i < clnum-1; i++){
st.nextToken();
}
}
if(st.nextToken().equals("yes")) {
String[] keyarr = temp1.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
if( !(unikeywords.contains(keyarr[i])) ) {
unikeywords.add(keyarr[i]);
}
}

}
br.close();
this.siz = unikeywords.size();
this.adj = new int [siz][siz];
this.nVerts = siz;
this.next = new int[siz];
this.T = new int [siz][siz];
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
adj[i][j] = T[i][j] = 0;
}
}
for(int i=0; i < nVerts; i++) { // initialize next neighbor
next[i]=-1;
}

//for adjacency matrix

int m = 0;
int n = 0;
BufferedReader br2 = new BufferedReader(new
FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv"));
String line2 = br2.readLine();
StringTokenizer st2 = null;
while ((line2 = br2.readLine()) != null) {
st2 = new StringTokenizer(line2, ",");
st2.nextToken(); // avoid article name
temp1 = st2.nextToken();
if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list
for(int i =0; i < clnum-1-1; i++){
st2.nextToken();
}
}
else {
for(int i =0; i < clnum-1; i++){
st2.nextToken();
}
}
if(st2.nextToken().equals("yes")) {
String[] keyarr = temp1.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
for(int j=i+2; j < keyarr.length-1; j=j+2) {
m = unikeywords.indexOf(keyarr[i]);
n = unikeywords.indexOf(keyarr[j]);
if(m > -1 && n > -1) {
adj[m][n] = adj[n][m] = 1;
}

}
}
}

}
br2.close();
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
}
}

}
catch (IOException iox) {
System.out.println(iox);
}
}

//graph functions
public int vertices() {
return nVerts; // return the number of vertices
}

public int edgeLength(int a, int b) {

return adj[a][b]; // return the edge length
}

public int nextneighbor(int v) {

next[v] = next[v] + 1; // initialize next[v] to the next neighbor

if(next[v] < nVerts) {

while(adj[v][next[v]] == 0 && next[v] < nVerts) {
next[v] = next[v] + 1; // initialize next[v] to the next neighbor

if(next[v] == nVerts)
break;
}
}

if(next[v] >= nVerts) {

next[v]=-1; // reset to -1
current_edge_weight = -1;
}
else {
current_edge_weight = adj[v][next[v]];
}

return next[v]; // return next neighbor of v to be processed

}

public void resetnext() {

for (int i=0; i < nVerts; i++) // reset the array next to all -1's
next[i] = -1;
}

public void dijkstra_function(int s) throws IOException {

int u, v;
int [] dist = new int[nVerts];
for(v=0; v<nVerts; v++) {
dist[v] = 99999; // 99999 represents infinity

}
dist[s] = 0;
PriorityQueue Q = new PriorityQueue(dist);
while(Q.Empty() == 0) {
u = Q.Delete_root();
v = nextneighbor(u);

while(v != -1) { // for each neighbor of u

if(dist[v] > dist[u] + edgeLength(u,v)) {
dist[v] = dist[u] + edgeLength(u,v);
Q.Update(v, dist[v]);
}

v = nextneighbor(u); // get the next neighbor of u

}
for(int col=0; col<nVerts; col++) {
T[s][col] = dist[col];
}

public void makeTermGraph() throws IOException {

for(int i = 0; i < siz; i++) {
dijkstra_function(i);
}
for(int i = 0; i < siz; i++) {
for(int j = 0; j < siz; j++) {
}

}
}

public double testarticle(String x){

String[] keyarr = x.split("[+\\s]");
double sim = 0;
double n = 0;
double w = 0;
int u = 0;
int v = 0;
for(int i=0; i < keyarr.length-1; i=i+2) {
for(int j=i+2; j < keyarr.length-1; j=j+2) {
u = unikeywords.indexOf(keyarr[i]);
v = unikeywords.indexOf(keyarr[j]);
if(u > -1 && v > -1) {
w = 0 + Math.pow(T[u][v],2);
n = n+1;
}

}
}
sim = n/w;

return (sim);

public knn(String st, int x) {

this.clname = st;
this.m2 = new HashMap();
this.qvec =new HashMap();
this.clnum = x;
}
public static void setList() throws IOException {

BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt"));

String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
unikeywords.add(st.nextToken());
st.nextToken();

}
}
br.close();
}
public void generatemaps() throws IOException {
String csvFile = "E:\\Mining\\training_csvs\\Exg3_wts.csv";
BufferedReader br2 = new BufferedReader(new FileReader(csvFile));
String line = "";
line = br2.readLine(); // ignore the first line of headers
StringBuffer sb = new StringBuffer();
String temp3 = null;
StringTokenizer st2 = null;
while ((line = br2.readLine()) != null) {
st2 = new StringTokenizer(line, ",");
int f = 0;
st2.nextToken(); //ignore docid
while (st2.hasMoreTokens()) {
temp3 = st2.nextToken();
if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes
f = 1;
// System.out.println(temp3);
}
//System.out.println("temp3 is "+temp3);
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
//System.out.print(temp3+" ");

}
else if(f==0){
sb2.append(temp3);
}
break;

}
String keys = sb.toString();
String[] keyarr = keys.split("[+\\s]");
for(int i=0; i<unikeywords.size(); i++) {
m2.put((String)unikeywords.get(i),0.0);
}

for (int i=0; i <(keyarr.length)-1; i=i+2) {

Double temp5=Double.parseDouble(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}

else {
m2.put(keyarr[i],(Double)m2.get(keyarr[i])+ temp5);
}

}
for (int i=0; i <(keyarr.length)-1; i=i+2)
{if (m2.get(keyarr[i]) != null)
m2.put(keyarr[i],(Double)m2.get(keyarr[i])/276);
}
// System.out.println("centroid :"+m2);
}

public void setQuevec(String q) {

// docname = aname;
// System.out.println(aname);
String[] keyar = q.split("[+\\s]");
for(int i=0; i<unikeywords.size(); i++) {
qvec.put((String)unikeywords.get(i),0.0);
}
int temp1 = 0;
for(int i=0; i < (keyar.length)-1; i=i+2) {
// System.out.println(keyar[i]);
// temp1 = qvec.get(keyar[i]);
Double temp5=Double.parseDouble(keyar[i+1]);
if (qvec.get(keyar[i]) == null) {
qvec.put(keyar[i], temp5);
}

else {
qvec.put(keyar[i],(Double)qvec.get(keyar[i])+ temp5);
}
}

// System.out.println("testvector: "+qvec);
}
public double calcsim() {
double sim = 0;
double dprod = 0;
double dmag = 0;
double qmag = 0;
double sumofsq = 0;
for (Map.Entry<String, Double> entry : m2.entrySet()) {
// System.out.println("hey"+entry.getKey());
dprod = dprod + entry.getValue() * qvec.get(entry.getKey());

// System.out.println(dprod);
}
// System.out.println("hey");
for (Map.Entry<String, Double> entry2 : m2.entrySet()) {
sumofsq = sumofsq + Math.pow(entry2.getValue(),2);
}
dmag = Math.sqrt(sumofsq);

sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);

sim = dprod/(dmag*qmag);
return sim;
}
public static void main(String[] args) throws IOException {
setList();
InputStream inp;
List catg = new ArrayList();
inp = new FileInputStream(new File("E:\\Mining\\headerfiles\\all-exchanges.txt"));
Scanner sc = new Scanner(inp); // gets one word at a time from input
String word = null;

//System.out.printf("My Little Program%n%n");

while (sc.hasNext()) { // is there another word?

word = sc.next(); // get next word
catg.add(word);

knn [] obj = new knn [catg.size()];

int j = 0;
String csvFile = "E:\\Mining\\test_csv\\forTesting.csv";
BufferedReader br = new BufferedReader(new FileReader(csvFile));

String line = "";

StringTokenizer st = null;
StringBuffer sb=new StringBuffer();
line = br.readLine();
String artname=null;
Iterator it2 = catg.iterator();
String name = null;
String temp1 = null;
FileWriter writer = new FileWriter("E:\\Mining\\knn.csv");

writer.append("doc-id");

// System.out.println(catg.size());
for (int i=0; i < catg.size(); i++) {
name = (String)catg.get(i);
obj[i] = new knn(name,i+1);
writer.append(",");
writer.append(catg.get(i).toString());
}
writer.append("\n");
while ((line = br.readLine()) != null) {
double s = 0.0;
double value=0.0;

st = new StringTokenizer(line,",");
while(st.hasMoreTokens()) {
artname = st.nextToken();
// System.out.println(artname);
writer.append(artname).append(",");

if(st.hasMoreTokens()) {
temp1 = st.nextToken();
int index=0;
for (int i=0; i < catg.size(); i++) {
obj[i].setQuevec(temp1);
// System.out.println(catg.get(i));
obj[i].generatemaps();

s = obj[i].calcsim();
// int flag;
if(s>value)
{ value=s;
index=i;

} //System.out.println("class name : "+catg.get(index));

System.out.println(value);
if(value!=0.0)
{for(int p=0;p<index;p++)
writer.append("no").append(",");
writer.append("yes").append(",");
for(int p=index+1;p<=catg.size()-1;p++)
writer.append("no").append(",");
writer.append("\n");
}
else
{
for (int i=0; i < catg.size(); i++)
{
writer.append("no").append(",");
}
writer.append("\n");
}

}
else
{
for (int i=0; i < catg.size(); i++)
{
writer.append("no").append(",");
}
writer.append("\n");
}

writer.close();
}

Vector Space Model for Information Retrieval

public class Vsm_Exg {
private static ArrayList unikeywords = new ArrayList();
private Map<String, Double> docvec;
private String docname;
private static Map<String, Double> qvec = new HashMap <String,Double>();

public Vsm_Exg() {
this.docname = null;
this.docvec = new HashMap <String,Double>();
}

public static void setList() throws IOException {

BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt"));

String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
unikeywords.add(st.nextToken());
st.nextToken();
//df = Integer.parseInt(st.nextToken());

}
}
br.close();
//System.out.println(unikeywords);
}

public static void setQuevec(String q) {

String[] keyar = q.split(" ");
for(int i=0; i<unikeywords.size(); i++) {
qvec.put((String)unikeywords.get(i),0.0);
}
double temp1 = 0;
for(int i=0; i < keyar.length; i++) {
if(qvec.get(keyar[i]) != null){
temp1 = qvec.get(keyar[i]);
qvec.put(keyar[i],temp1+1);
}
}
double normval = 0;
double dummy = 0;
for (Map.Entry<String, Double> entry : qvec.entrySet()) {
dummy = dummy + Math.pow(entry.getValue(),2);
}
normval = Math.sqrt(dummy);
for(int i=0; i<unikeywords.size(); i++) {
String x = (String)unikeywords.get(i);
double val = qvec.get(x)/normval ;
qvec.put(x,val);
}
System.out.println("Queryvector: "+qvec);
}
public void setDocvec(String aname, String keywrds) throws IOException {
docname = aname;
for(int i=0; i<unikeywords.size(); i++) {
docvec.put((String)unikeywords.get(i),0.0);
}
//System.out.println("Initial vector: "+docvec);
String[] keyarr = keywrds.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
docvec.put(keyarr[i],Double.parseDouble(keyarr[i+1]));
}
double normval = 0;
double dummy = 0;
for (Map.Entry<String, Double> entry : docvec.entrySet()) {
dummy = dummy + Math.pow(entry.getValue(),2);
}
normval = Math.sqrt(dummy);
for(int i=0; i<unikeywords.size(); i++) {
String x = (String)unikeywords.get(i);
double val = docvec.get(x)/normval ;
docvec.put(x,val);
}
System.out.println("Docvector: "+docvec);
}

public double calcsim() {

double sim = 0;
double dprod = 0;
double dmag = 0;
double qmag = 0;
double sumofsq = 0;
for (Map.Entry<String, Double> entry : docvec.entrySet()) {
dprod = dprod + entry.getValue() * qvec.get(entry.getKey());
}
for (Map.Entry<String, Double> entry2 : docvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry2.getValue(),2);
}
dmag = Math.sqrt(sumofsq);

sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);
sim = dprod/(dmag*qmag);
return sim;
}

Accuracy Calculation
public class CalAccuracy_TermGraph {
private String csvFile;
private String csvFile2;
private String catname;
public CalAccuracy_TermGraph(String a, String b, String c){
csvFile = a;
csvFile2 = b;
catname = c;
}
public void acc() throws IOException{
System.out.println(catname+":");
BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;

BufferedReader br2 = new BufferedReader(new FileReader(csvFile2));

String line2 = "";
StringTokenizer st2 = null;
line = br.readLine();
line2 = br2.readLine();
String temp1 = null;
String temp2 = null;
double a = 0;
double b = 0;
double c = 0;
double d = 0;
double accuracy = 0;
double precision = 0;
double recall = 0;
double f = 0;
while ((line = br.readLine()) != null && (line2 = br2.readLine()) != null) {
st = new StringTokenizer(line, ",");
st2 = new StringTokenizer(line2, ",");
st.nextToken();
st2.nextToken();
while (st2.hasMoreTokens()) {
temp1 = st.nextToken(); //actual
temp2 = st2.nextToken(); //predicted
if (temp1.equals("yes") && temp2.equals("yes")) {
a++;

}
if (temp1.equals("yes") && temp2.equals("no")) {
b++;

}
if (temp1.equals("no") && temp2.equals("yes")) {
c++;

}
if (temp1.equals("no") && temp2.equals("no")) {
d++;

}
}
}
accuracy = (((a+d) * 100)/(a+b+c+d));
precision = (a)/(a+c);
recall = (a)/(a+b);
f = (2*precision*recall)/(precision+recall);
System.out.println(a);
System.out.println(b);
System.out.println(c);
System.out.println(d);

System.out.println("accuracy : "+accuracy+"% precision : "+precision100+"% recall : "+recall100+"% f:

"+f);

Result:
We compared the accuracy of Naïve Bayes, Term Graph and Knn for
Text classification of our articles of Reuter 21578.
As shown above in the bar graph, we found that Knn shows the best
result with accuracy as follows :
FOR EXCHANGE CATEGORY
KNN 98.00
NAÏVE 74.68
TERM GRAPH 97.41

FOR ORGANISATION CATEGORY

KNN 98.51
NAÏVE 51.43
TERM GRAPH 98.23

FOR PEOPLE CATEGORY

KNN
NAÏVE 33.19
TERM GRAPH 99.61

FOR TOPICS CATEGORY

KNN
NAÏVE 81.80
TERM GRAPH 99.19

FOR PLACES CATEGORY

KNN
NAÏVE 72.23
TERM GRAPH 99.19

Our project is about categorizing the news articles into various

categories. We have made a web application for user where, the user
can enter some keywords (in the form of query) and we can generate
the article based on those keywords by applying the Vector Space
Model.
Conclusion:
We conclude that knn shows the maximum accuracy as compared to
the Naive Bayes and Term- Graph.
The drawback for KNN is that its time complexity is high but gives a
better accuracy than others.
We used Tf-idf with Term graph Rather than the traditional Term-
Graph used with AFOPT. This hybrid shows a better result than the
traditional combination.
Finally we made an INFORMATION RETREIVAL APPLICATION using
Vector Space Model to give the result of the query entered by the
client by showing the relevant document.
Screenshots:

Output of feature generation

Screenshot showing the keywords with their respective frequencies for
each document and stating whether the document belongs to the header
class or not.
Screenshot showing the keywords with their respective weights calculated
using tf-idf algorithm for each document,and stating whether the document
belongs to the header class or not.
Output of Naive bayes
Showing the documents and keywords classified on the application of Naïve
bayes.
Output for Term Graph
Showing the Keywords and the Documents classification by applying Term
Graph

Output for k-nearest neighbour

Accuracy output
Showing the accuracy of various categories by applying Term- graph.

Output for Vector Space Model

Output for Vector Space Model with GUI:

Showing the application where user enters a query and we output the
appropriate article relating to the query using Vector Space Model.

Future Work
We will focus more in future on

a. Reducing Complexity
b. Increasing Accuracy
c. Text Summarization
Such similar applications are used in yahoo alerts where relevant
documents are shown on entering keywords by the user.
References
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf

http://web.mit.edu/6.863/www/fall2012/projects/writeups/newspaper-
article-classifier.pdf

http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf

http://jatit.org/volumes/research-papers/Vol3No2/9vol3.pdf

Improved knn Classification Algortihm Research in Text Categorization --Lijun

Wang, Xiquing Zhao

A Comparison of Event Models For Naïve Bayes Text Classification ---

Andrew McCallum, Kamal Nigam

An Improved TF-IDF Approach for Text Classification(2004) -- Zyang Yun-tao,

Gong Ling, Wang Yong-cheng

Term Graph Model for Text Classification --- Wei Wang, Diep Bich Do, and
Xuemin Lin

(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
100% (5)
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
81 pages
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
A Review On Machine Learning and Deep Learning Image-Based Plant Disease Classification For Industrial Farming Systems
No ratings yet
A Review On Machine Learning and Deep Learning Image-Based Plant Disease Classification For Industrial Farming Systems
18 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Support Vector Machine
100% (2)
Support Vector Machine
11 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
100% (2)
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
55 pages
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
100% (1)
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
51 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Ridge and Lasso in Python PDF
No ratings yet
Ridge and Lasso in Python PDF
5 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
5 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
House Price Prediction Using Machine Learning
No ratings yet
House Price Prediction Using Machine Learning
6 pages
Association Rule Mining Lesson PDF
No ratings yet
Association Rule Mining Lesson PDF
9 pages
Statistics Probability
No ratings yet
Statistics Probability
66 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
34 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
All Life Bank - AIML_ML_Project_low_code_notebook
No ratings yet
All Life Bank - AIML_ML_Project_low_code_notebook
78 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Generalized Additive Model
No ratings yet
Generalized Additive Model
10 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
ML Cheatsheet Final
No ratings yet
ML Cheatsheet Final
32 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
Weather Forecasting Basepaper
100% (1)
Weather Forecasting Basepaper
14 pages
AIML Online
No ratings yet
AIML Online
16 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
6 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Predictive Modeling: Project Documentation Team 10
No ratings yet
Predictive Modeling: Project Documentation Team 10
16 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
7 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Sentiment Analysis On Online Product Reviews
No ratings yet
Sentiment Analysis On Online Product Reviews
10 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
DA Project Report
No ratings yet
DA Project Report
17 pages
Polynomial Regression and Step Function
100% (1)
Polynomial Regression and Step Function
6 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
Lead Scoring Case Study Presentation
100% (2)
Lead Scoring Case Study Presentation
11 pages
House Price Prediction Using Machine Learning: Bachelor of Technology
No ratings yet
House Price Prediction Using Machine Learning: Bachelor of Technology
20 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Internship Report On Machine Learning
100% (1)
Internship Report On Machine Learning
26 pages
Pandas
100% (1)
Pandas
1,131 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
From Everand
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
Dattaraj Rao
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Apache Mahout Essentials
From Everand
Apache Mahout Essentials
Jayani Withanawasam
No ratings yet
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
From Everand
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
Dr. Rajkumar Tekchandani
No ratings yet
Practicum 1 (2020S8W1) Progress Report: Project Summary
No ratings yet
Practicum 1 (2020S8W1) Progress Report: Project Summary
1 page
Assignment 4 - Updated
No ratings yet
Assignment 4 - Updated
4 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Running Head: Assignment 2 1
No ratings yet
Running Head: Assignment 2 1
7 pages
Multiple Linear Regressions For Predicting Rainfall For Bangladesh
No ratings yet
Multiple Linear Regressions For Predicting Rainfall For Bangladesh
4 pages
Assignment 4 Complete
No ratings yet
Assignment 4 Complete
2 pages
MSC575 - Sabih - Uddin - WEEK 8 - LAB PDF
No ratings yet
MSC575 - Sabih - Uddin - WEEK 8 - LAB PDF
400 pages
Marketing Analytics Slides (1) HBR
No ratings yet
Marketing Analytics Slides (1) HBR
22 pages
Info Technology Research Meth Business Intelligence Statistical Meth & Exp Design Misc
No ratings yet
Info Technology Research Meth Business Intelligence Statistical Meth & Exp Design Misc
1 page
The Sample With A Built-In Bias: How To Lie With Statistics by Darrell Huff Summery by Chapters (1-5)
No ratings yet
The Sample With A Built-In Bias: How To Lie With Statistics by Darrell Huff Summery by Chapters (1-5)
4 pages
Presentation ON Continental Biscuits Limited (CBL)
No ratings yet
Presentation ON Continental Biscuits Limited (CBL)
12 pages
Cocacola Ecommerce
100% (1)
Cocacola Ecommerce
26 pages
Imperative Renders It Morally Sound. The Same Thing Can Be Said When Universally Applying
No ratings yet
Imperative Renders It Morally Sound. The Same Thing Can Be Said When Universally Applying
2 pages
Ethical Theories Compared
No ratings yet
Ethical Theories Compared
24 pages
Systematic & Unsystematic Risks Market Risk I. Liquidity Risk Liquidity Risk and Performance of The Banking System
No ratings yet
Systematic & Unsystematic Risks Market Risk I. Liquidity Risk Liquidity Risk and Performance of The Banking System
2 pages
A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique
No ratings yet
A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique
18 pages
unit-3
No ratings yet
unit-3
30 pages
Recognition of Food Type and Calorie Estimation Using Neural Network
No ratings yet
Recognition of Food Type and Calorie Estimation Using Neural Network
22 pages
A Novel Machine Learning Model To Predict Autism Spectrum Disorders Risk Gene
No ratings yet
A Novel Machine Learning Model To Predict Autism Spectrum Disorders Risk Gene
7 pages
1-s2.0-S2214785321076379-main
No ratings yet
1-s2.0-S2214785321076379-main
6 pages
Industrial Training Report
No ratings yet
Industrial Training Report
33 pages
Reading 3 Machine Learning - Answers
No ratings yet
Reading 3 Machine Learning - Answers
11 pages
Copie Sghira
No ratings yet
Copie Sghira
9 pages
Predicting Mental Health Disorders Using Machine Learning
No ratings yet
Predicting Mental Health Disorders Using Machine Learning
5 pages
Gait Recognition Based On Modified Gait Energy Image
No ratings yet
Gait Recognition Based On Modified Gait Energy Image
4 pages
Study of Compaction in Hot-Mix Asphalt Using Computer Simulations
No ratings yet
Study of Compaction in Hot-Mix Asphalt Using Computer Simulations
7 pages
Density Based Clustering
No ratings yet
Density Based Clustering
22 pages
DroidFusion Accepted Version
No ratings yet
DroidFusion Accepted Version
14 pages
Final Report
No ratings yet
Final Report
36 pages
Plagarism
No ratings yet
Plagarism
51 pages
Chapter 1.1 Regression
No ratings yet
Chapter 1.1 Regression
47 pages
Nitin Jha (05114802819)
No ratings yet
Nitin Jha (05114802819)
21 pages
Invisible Mask: Practical Attacks On Face Recognition With Infrared
No ratings yet
Invisible Mask: Practical Attacks On Face Recognition With Infrared
13 pages
Tutorial_DataMiningENG
No ratings yet
Tutorial_DataMiningENG
8 pages
Quality Reliability Eng - 2024 - Yuan - Combined improved tuna swarm optimization with graph convolutional neural network
No ratings yet
Quality Reliability Eng - 2024 - Yuan - Combined improved tuna swarm optimization with graph convolutional neural network
18 pages
J1(SkillDzire)
No ratings yet
J1(SkillDzire)
49 pages
11.cluster Validation PDF
No ratings yet
11.cluster Validation PDF
37 pages
SHUBHAM KUMAR TIWARI - Finalprojectreport - Shubham Tiwari
No ratings yet
SHUBHAM KUMAR TIWARI - Finalprojectreport - Shubham Tiwari
81 pages
Proposed Intelligent System To Support Local Businesses in Kenya
No ratings yet
Proposed Intelligent System To Support Local Businesses in Kenya
13 pages
Extended Isolation Forest
No ratings yet
Extended Isolation Forest
11 pages
Air Quality Index Using Machine Learning a Jordan Case Study
No ratings yet
Air Quality Index Using Machine Learning a Jordan Case Study
11 pages
Dinov 2
No ratings yet
Dinov 2
31 pages
Unsupervised Traffic Accident Detection in First-Person Videos
No ratings yet
Unsupervised Traffic Accident Detection in First-Person Videos
9 pages
Integration of Unsupervised and Supervised Machine Learning Algorithms For Credit Risk Assessment
No ratings yet
Integration of Unsupervised and Supervised Machine Learning Algorithms For Credit Risk Assessment
15 pages