100% found this document useful (1 vote)
149 views51 pages

Minor Project Ii Report Text Mining: Reuters-21578: Submitted by

The document is a project report on text mining of the Reuters-21578 dataset. It includes an abstract, problem definition, description of the dataset, preprocessing steps applied including document term weighting using TF-IDF, stop word removal, and case folding. It also describes the actual implementation of classification algorithms including Naive Bayes, k-Nearest Neighbors, Support Vector Machines, and evaluation of results with screenshots and analysis of performance. Future work and references are also mentioned.

Uploaded by

sabihuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
149 views51 pages

Minor Project Ii Report Text Mining: Reuters-21578: Submitted by

The document is a project report on text mining of the Reuters-21578 dataset. It includes an abstract, problem definition, description of the dataset, preprocessing steps applied including document term weighting using TF-IDF, stop word removal, and case folding. It also describes the actual implementation of classification algorithms including Naive Bayes, k-Nearest Neighbors, Support Vector Machines, and evaluation of results with screenshots and analysis of performance. Future work and references are also mentioned.

Uploaded by

sabihuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

MINOR PROJECT II REPORT

TEXT MINING : REUTERS-21578

SUBMITTED BY :
Aarshi Taneja (10104666)
Divya Gautam (10104673)
Nupur (10104676)
Shruti Jadon (10104776)
Batch: IT-B10
Group Code: DMB10G04
TABLE OF CONTENTS

Abstract --------------- 1

Problem definition --------------- 2

Data set chosen --------------- 3

Preprocessing applied --------------- 4

Description of algorithms --------------- 7

Actual implementation --------------- 14

Results --------------- 35

Screenshots ---------------37

Future work --------------- 42

References --------------- 43
Abstract
Text Categorization (TC), also known as Text Classification, is the task
of automatically classifying a set of text documents into different
categories from a predefined set. If a document belongs to exactly
one of the categories, it is a single-label classification task; otherwise,
it is a multi-label classification task. TC uses several tools from
Information Retrieval (IR) and Machine Learning (ML) and has
received much attention in the last years from both researchers in
the academia and industry developers.

Information Retrieval

Information Retrieval (IR) is the science of searching for information


within relational databases, documents, text, multimedia files, and
the World Wide Web . The applications of IR are diverse, they
include but not limited to extraction of information from large
documents, searching in digital libraries, information filtering, spam
filtering, object extraction from images, automatic summarization,
document classification and clustering, and web searching.
The breakthrough of the Internet and web search engines have
urged scientists and large firms to create very large scale retrieval
systems to keep pace with the exponential growth of online data.
Figure below depicts the architecture of a general IR system. The
user first submits a query which is executed over the retrieval
system. The latter, consults a database of document collection and
returns the matching document.
In general, in order to learn a classifier that is able to correctly
classify unseen documents, it is necessary to train it with some pre-
classified documents from each category, in such a way that the
classifier is then able to generalize the model it has learned from the
pre-classified documents and use that model to correctly classify the
unseen documents.

Problem definition
Our project is about categorizing the news articles into various
categories. We work on two major scenarios:

a. Classification of documents into various categories.


Making it in the form of an application where user can upload
an article and we will classify it into various categories.
b. On entering keywords by the user we show the most relevant
document for the user.
Data Set Chossen
As the caption suggests, the data set used for this particular project
is in the form of sgml files. The Reuters-21578 dataset is available at :
http://www.daviddlewis.com/resources/testcollections/reuters21578/
There are 21578 documents; according to the 'ModApte' split: 9603
training docs, 3299 test docs and 8676 unused docs. They were
labeled manually by Reuters personnel. Labels belong to 5 different
category classes, such as 'people', 'places' and 'topics'. The total
number of categories is 672, but many of them occur only very
rarely. The dataset is divided in 22 files of 1000 documents delimited
by SGML tags.
A sample SGML file:
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2">
<DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
&#5;&#5;&#5;F Y
&#22;&#22;&#1;f0708&#31;reute
d f BC-STANDARD-OIL-&lt;SRD>-TO 02-26 0082</UNKNOWN>
<TEXT>&#2;
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP
North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>

Each article starts with an "open tag" of the form


<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>
where
LEWISSPLIT : The possible values are TRAINING, TEST, and NOT-USED.
TRAINING indicates it was used in the training set in the experiments reported
in LEWIS91d (Chapters 9 and 10), LEWIS92b, LEWIS92e, and LEWIS94b. TEST
indicates it was used in the test set for those experiments, and NOT-USED
means it was not used in those experiments.
NEWID : The identification number (ID) the story has in the Reuters-21578,
Distribution 1.0 collection. These IDs are assigned to the stories in
chronological order.

<TOPICS>:Encloses the list of TOPICS categories, if any, for the document. If


TOPICS categories are present, each will be delimited by the tags <D> and
</D>.

<BODY>: The main text of the story.

<AUTHOR>: Author of the story.

Preprocessing applied

Document Term Weighting


Document indexing is the process of mapping a document into a
compact representation of its content that can be interpreted by a
classifier. The techniques used to index documents in TC are
borrowed from Information Retrieval, where text documents are
represented as a set of index terms which are weighted according to
their importance for a particular document. A text document dj is
represented by an n-dimensional vector −!d j of index terms or
keywords, where each index term corresponds to a word that
appears at least once in the initial text and has a weight associated to
it, which should reflect how important this index term is.
Term Frequency / Inverse Document Frequency

In this case, which is the most usual in TC, the weight of a term
in a document increases with the number of times that the term
occurs in the document and decreases with the number of times the
term occurs in the collection. This means that the importance of a
term in a document is proportional to the number of times that the
term appears in the document, while the importance of the term is
inversely proportional to the number of times that the term appears
in the entire collection.
This term-weighting approach is referred to as term
frequency/inverse document frequency .
Formally, , the weight of term ti for document , is defined as:

where is the number of times that term appears in document


, |D| is the total number of documents in the collection, and is
the number of documents where term appears.

Term Distributions

It is a recent approach of a more sophisticated term weighting


method than , based on term frequencies within a particular
class and within the collection of training documents.
The weight of a term using term distributions is determined by
combining three different factors, that depend on the average term
frequency of term ti in the documents of class ck, represented as
where represents the set of documents that belong to class
the number of documents belonging to class the
frequency of term in document of class ck, and |C|, which will
be used in the next formulas, represents the number of classes in a
collection.

Stop Words
Words that are of little value to convey the meaning of the document
and which happen to have a high frequency are totally dropped
during the tokenization process. These words are called stop words
and are generally detected by either their high frequency or by
matching them with a dictionary. Below is a stop list of twenty-five
semantically nonselective words which are common

Dropping stop words sounds a very good approach to discard some


useless redundant words; however this is not the case for some
phrases. Imagine a user is searching for "President of the United
States" or "Flight to London", the Information Retrieval system
would then search for "President" and "United States" separately or
it would search for "Flight" and "London" separately. This would of
course return an incorrect result that does not reflect the user's
initial query.

Document Length Normalization


Naturally, long documents contain more terms than short
documents. Considering that the similarity between documents can
be measured by how many terms they have in common, long
documents will have more terms in common with other documents
than short documents, so they will be more similar to other
documents than short documents. To contrast this tendency,
weights of terms for TC tasks are usually normalized so that
document vectors have unitary length.

Case-Folding
A typical strategy is to do case-folding by converting all uppercase
characters to lowercase characters. This is a form of word
normalization in which all words are reduced to a standard form.
This would equate between Door and door and between university
and UNIVERSITY. This sounds very nice; however the problem arises
when a proper noun such as Black is equated with the color black or
it can also equate between the company name VISION and the word
vision. To remedy to this, one can only convert to lowercase words at
the beginning of a sentence and words located within titles and
headings.

Relevant Algorithms/Techniques

Classification Methods

This dissertation concerns methods for the classification of natural


language text, that is, methods that, given a set of training
documents with known categories and a new document, which is
usually called the query, will predict the query’s category.

Naive Bayes
The Naıve Bayes classifier found its way into many applications
nowadays due to its simple principle but yet powerful accuracy [13].
Bayesian classifiers are based on a statistical principle. Here, the
presence or absence of a word in a textual document determines the
outcome of the prediction. In other words, each processed term is
assigned a probability that it belongs to a certain category. This
probability is calculated from the occurrences of the term in the
training documents where the categories are already known. When
all these probabilities are calculated, a new document can be
classified according to the sum of the probabilities for each category
of each term occurring within the document. However, this classifier
does not take the number of occurrences into account, which is a
potentially useful additional source of information. They are called
“naıve” because the algorithm assumes that all terms occur
independent from each other.
Given a set of r document vectors , classified along a
set C of q classes, , Bayesian classifiers estimate the
probabilities of each class ck given a document dj as:

In this equation, is the probability that a randomly picked


document has vector as its representation, and the
probability that a randomly picked document belongs to ck. Because
the number of possible documents is very high, the estimation of
is problematic.
To simplify the estimation of , Naive Bayes assumes that the
probability of a given word or term is independent of other terms
that appear in the same document. While this may seem an over
simplification, in fact Naive Bayes presents results that are very
competitive with those obtained by more elaborate methods.
Moreover, because only words and not combinations of words are
used as predictors, this naive simplification allows the computation
of the model of the data associated with this method to be far more
efficient than other non naive Bayesian approaches. Using this
simplification, it is possible to determine as the product of
the probabilities of each term that appears in the document.
So, , where , may be estimated as:

The Vector Space Model


The vector space model is a data model for representing documents
and queries in an Information Retrieval system. Every document and
query is represented by a vector whose dimensions, which are called
features, represent the words that occur within them [4]. In that
sense, each vector representing a document or query consists of a
set of features which denote words and the value of each feature is
the frequency or the number of occurrence of that particular word in
the document itself. Since an IR usually contains more than one
document, vectors are stacked together to form a matrix. Figure 2
shows a single vector populated with frequencies of the words
contained in document D.
To make things clearer, let's consider an example of finding the
occurrence of words information, processing, and language in three
documents Doc1, Doc2, and Doc3
After counting the occurrence of the three words in each of the three
documents, Doc1 is represented by the vector D1(1,2,1), Doc2 by the
vector D2(6,0,1), and Doc3 by the vector D3(0,5,1)
In order to emphasize on the contribution of the higher value
feature, those vectors are to be normalized. By normalization, we
simply mean converting all vectors to a standard length. This can be
done by dividing each dimension in a vector by the length of that
particular vector. The length of a vector can be calculated according
to the following equation: length = sqrt((ax * ax) + (ay * ay) + (az *
az)). Normalizing D1: length = sqrt((1 * 1) + (2 * 2) + (1 * 1)) = sqrt(6)
= 2.449. Now dividing each dimension by the length: 1/2.449 = 0.41 ;
2/2.449 = 0.81 ; 1/2.449 = 0.41. Final result would be D1(0.41, 0.81,
0.41). Same applies for D2 and D3 which will eventually result in
D2(0.98, 0, 0.16) and D3(0, 0.98, 0.19).
Now in order to determine the difference between two documents
or if a query matches a document, we must calculate the cosine of
the angles between the two vectors. When two documents are
identical (or when a query completely matches a document) they will
receive a cosine of 1; when they are orthogonal (share no common
terms) they will receive a cosine of 0.
Back to the previous example, let's consider a query with
corresponding normalized vector Q(0.57, 0.57, 0.57). The first task is
to compute the cosines between this vector and our three document
vectors.
Sim(D1,Q) = 0.41*0.57 + 0.81*0.57 + 0.41*0.57 = 0.92
Sim(D2,Q) = 0.65
Sim(D3,Q) = 0.67
The previous results show clearly that D1 is the closest to match Q,
then comes D3 and then comes D2 (remember that the more the
cosine is closer to 1 the more the two vectors are closer)

Term Graph Model


The term graph model is an improved version of the vector space
model [13] by weighting each term according to its relative
“importance” with regard to term associations. Specifically, for a text
document Di, it is represented as a vector of term weights
Di =< w1i,. .. . , w|T|i >, where T is the ordered set of terms that
occur at least once in at least one document in the collection. Each
weight wji represents how much the corresponding term tj
contribute to the semantics of document di. Although a number of
weighting schemes have been proposed (e.g., boolean weighting,
frequency weighting, tf-idf weighting, etc.), those schemes
determine the weight of each term individually. As a result,
important yet rich information regarding the relationships among
the terms are not captured in those weighting schemes.
We propose to determine the weight of each term in a document
collection by constructing a term graph. The basic steps are as
follows:
1. Prepocessing Step: For a collection of document, extract all the
terms.
2. Graph Building Step:
(a) For each document, we view it as a transaction: the document ID
is the corresponding transaction ID; the terms contained in the
document are the items contained in the corresponding transaction.
Association rule mining algorithms can thus be applied to mine the
frequently co-occurring terms that occur more than minsup times in
the collection.
(b) The frequent co-occurring terms are mapped to a weighted and
directed graph, i.e., the term graph.

Preprocessing
In our term graph model, we will capture the relationships among
terms using the frequent itemset mining method. To do so, we
consider each text document in the training collections as a
transaction in which each word is an item. However, not all words in
the document are important enough to be retained in the
transaction.
To reduce the processing space as well as increase the accuracy of
our model, the text documents need to be preprocessed by (1)
remove stopwords, i.e., words that appear frequently in the
document but have no essential meanings; and (2) retaining only the
root form of words by stemming their affixes as well as prefixes.

Graph Building

As mentioned above, we will capture the relationships among terms


using the frequent itemset mining method. While this idea has been
explored by previous research [9], our approach distinguish from
previous approaches in that we maintain all such important
associations in a graph. The graph not only reveals the important
semantics of the document, but also provide a basis to extract novel
features about the document, as we will shown in the next section.
Frequent Itemset Mining. After the preprocessing step, each
document in the text collection will be stored as a transaction (list of
items) in which each item (term) is represented by a unique non-
negative integer. Then frequent itemset mining algorithms can be
used to find all the subset of items that appeared more than a
threshold amount of times (controlled by minsup) in the collection.
Graph Builder. In our system, our goal is to explore the relationships
among the important terms of the text in a category and try to
define a strategy to make use of these relationships in the classifier
and other text mining tasks. Vector space model cannot express such
rich relationship among terms. Graph is thus the most suitable data
structure in our context, as, in general, each term may be associated
with more than one terms. We propose to use the following simple
method to construct the graph from the set of frequent itemsets
mined from the text collections. First, we construct a node for each
unique term that appear at least once in the frequent itemsets.
Then we create edges between two node u and v if and only if they
are both contained in one frequent itemset. Furthermore, we assign
weights to the edges in the following way: the weight of the edge
between u and v is the largest support value among all the frequent
itemsets that contains both of them.
Example Consider the frequent itemsets and their absolute support
shown in Figure below. Its corresponding graph is shown in Figure
beside.

k-Nearest Neighbors

The initial application of k-Nearest Neighbors (k-NN) to text


categorization was reported by Masand and colleagues. The basic
idea is to determine the category of a given query based not only on
the document that is nearest to it in the document space, but on the
categories of the k documents that are nearest to it. Having this in
mind, the Vector method can be viewed as an instance on the k-NN
method, where k=1.
This work uses a vector-based, distance-weighted matching function,
as did Yang, by calculating document’s similarity like the Vector
method. Then, it uses a voting strategy to find the query’s class: each
retrieved document contributes a vote for its class, weighted by its
similarity to the query. The query’s possible classifications will be
ranked according to the votes they got in the previous step.

Actual Implementation
For classifying the documents in Reuter-21578 we initially pre-
processed the data by performing various techniques :

a. Bag of words
b. Stop word removal
c. Tf-idf
d. Case Folding
e. Normalisation

Then after pre-processing we applied Naïve bayes Algorithm to


classify the documents in the training set into five categories
(exchange, organisation, people, places and topics). We further
applied our classifier model on the test documents and calculated
the accuracy by comparing it with the default answers given for the
test documents.

After Naïve bayes we implemented the Term Graph algorithm to


show a better classifier model to classify the documents. Again we
checked its accuracy for test documents and compares the accuracy
for both the Classification Algorithms.
To compare the Evaluate the two algorithms we used the following :
Precision is defined as the fraction of the retrieved documents that
are relevant, and can be viewed as a measure of the system’s
soundness, that is:

Recall is defined as the fraction of the relevant documents that is


actually retrieved, and can be viewed as a measure of the system’s
completeness, that is:
Accuracy, which is defined as the percentage of correctly classified
documents, is generally used to evaluate single-label TC tasks.

The Mean Reciprocal Rank can be calculated for each individual


query document as the reciprocal of the rank at which the first
correct category was returned, or 0 if none of the first n choices
contained the correct category. The score for a sequence of
classification queries, considering the first n choices, is the mean of
the individual query´s reciprocal ranks. So:

where is the rank of the first correct category for query i,


considering the first n categories returned by the system.

We then created an application where user can input some keywords


and based on the algorithm showing higher accuracy we show the
relevant document to the user.

WORK FLOW
\
News articles

Training set
Test set

Conversion of sgml file to


Conversion of sgml file to text file
text file

Using bag of words


Building LOCAL DICTIONARY of the alogorithm for keyword
document using bag of words. extraction
Application of K- Application of Term Graph to Appplication of naive algorithm
Nearest predict the category of test to predict its category
Neigbhour documents

To Classify the
test documents
Calculating Complexity and Comparing Accuracy
of the algorithm

Information Retrieval application


Using Vector Space Model
Information Retrieval Application
In this application user can enter keywords and based on those
keywords we show the relevant documents having the highest
similarity value and then on selecting one of the shown documents,
it will display the content of the document.

This application is based on Vector Space Model for Information


Retrieval.

NAÏVE BAYES ALGORITHM

FORMULA USED :

Checking the keyword in Test document and storing it


in a map.

Calculating yes and no frequency of each keyword in


the test document.

Calculating the probability of each keyword of the


test document.

Classifying the Test Document into various categories


on the basis of probability calculated.
TERM GRAPH ALGORITHM

Setting each unique word occurring the


document as nodes of the graph

Making Adjacency Matrix of the


keywords

Making Distance Matrix using


Dijkstra

Calculating similarity between the test


document keywords and the keywords of
each category

Classifying the test document by checking


the category with highest similarity value

K-NEAREST NEIGHBOUR

Make vector for every document in the test set.

Make centriod vector for each class.

Calculate similarity between each document vector


and class vector

Document belongs to the class for which the


similarity is maximum,

VECTOR SPACE MODEL


Make query vector.

Make Document vector.

Calculate similarity between query vector and


document vector for each document.

Retrieved Document is the one for which the


similarity is maximum
For Feature Selection (Tf-idf)
public class WordFrequencyCmd {
private String traindir;
private String traincsv;
private String traincsv2;
private String traincsv3;
private String hdfile;
private String clist;

public WordFrequencyCmd (String a, String b, String c, String d, String e, String f) {


this.traindir = a;
this.traincsv = b;
this.hdfile = c;
this.clist = d;
this.traincsv2 = e;
this.traincsv3 = f;
}
public void generatekeywords() {
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
List catg = new ArrayList(); //list to store all categories
File file = new File(traindir);
File[] inputFile=file.listFiles();

InputStream inp;
try { FileWriter writer = new FileWriter(traincsv);
inp = new FileInputStream(new File(hdfile));

Scanner sc = new Scanner(inp); // gets one word at a time from input


String word;
writer.append("doc-id");
writer.append(',');
writer.append("keyword-list");
writer.append(',');

while (sc.hasNext()) { // is there another word?

word = sc.next(); // get next word

catg.add(word);

writer.append(word);
writer.append(',');
//System.out.println(word);

writer.append('\n');
String temp = null;
File ignoreFile = new File("E:\\Mining\\longstoplist.txt");
for(int j=0;j<inputFile.length;j++) {
BufferedReader br = new BufferedReader(new FileReader(inputFile[j]));
String line = "";
StringTokenizer st = null;
List keylist = new ArrayList();
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while (st.hasMoreTokens()) {
temp = st.nextToken();
if (st.hasMoreTokens()) {
break;
}
else {
keylist.add(temp);
}
}
}

WordCounter counter = new WordCounter();


counter.ignore(ignoreFile);
counter.countWords(inputFile[j]);

String[] wrds = counter.getWords(WordCounter.SortOrder.BY_FREQUENCY);


int[] frequency = counter.getFrequencies(WordCounter.SortOrder.BY_FREQUENCY);
// System.out.println("for the"+j+"th file");
writer.append(inputFile[j].getName());
writer.append(',');
//... Display the results.
int n = counter.getEntryCount();
for (int i=0; i<n; i++) {
if(frequency[i]>1) {
//System.out.println(frequency[i] + " " + wrds[i]);
writer.append(wrds[i]+" "+frequency[i]);
writer.append("+");

//comparing the values from the hash set and table and if match increase to 1
if (words.contains(wrds[i]) == false) {
if (result.get(wrds[i]) == null)
result.put(wrds[i], 1);
else
result.put(wrds[i], result.get(wrds[i]) + 1);
words.add(wrds[i]);
}
else {
result.put(wrds[i], result.get(wrds[i]) + 1);
}
}
}

// code for yes/no


Iterator it = catg.iterator();
for (int i=0; i < catg.size(); i++) {
writer.append(',');
if(keylist.contains(it.next())) {
writer.append("yes");
}
else {
writer.append("no");
}
}

// System.out.println();

writer.append('\n');

FileOutputStream out3;
PrintStream p3;
out3 = new FileOutputStream(clist);
p3 = new PrintStream( out3 );
for (Object o: result.entrySet() ) {
Map.Entry entry = (Map.Entry) o;
int val=Integer.parseInt(entry.getValue().toString());
String k=entry.getKey().toString();
if(val>4){
// System.out.println(k+" "+val);
p3.println(k+" "+val);
}
}

//writer.flush();
writer.close();

sc.close();

} catch (IOException iox) {


System.out.println(iox);
}
}

public void indocfreq(int nod) {


double idf = 0;
String word = null;
int df = 0;
Map m = new HashMap();
try {

BufferedReader br = new BufferedReader(new FileReader(clist));


String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
word = st.nextToken();
df = Integer.parseInt(st.nextToken());
idf = Math.log(nod/df);
//System.out.print(word+": "+idf+"\t");
m.put(word,idf);
}
}
FileWriter writer = new FileWriter(traincsv2);
FileWriter writer2 = new FileWriter(traincsv3);
String artname = null;
BufferedReader br2 = new BufferedReader(new FileReader(traincsv));
String line2 = null;
StringTokenizer st2 = null;
String y = br2.readLine();
writer.append(y).append("\n");
writer2.append(y).append("\n");
double wt = 0;

String temp2 = null;


int flag = 0;
while ((line2 = br2.readLine()) != null) {

st2 = new StringTokenizer(line2,",");


artname = st2.nextToken();
writer.append(artname).append(",");
writer2.append(artname).append(",");
String temp1 = st2.nextToken();
flag = 0;

String[] keyarr = temp1.split("[+\\s]");


for(int i=0; i < keyarr.length-1; i=i+2) {
flag = 1;

temp2 = keyarr[i];
if (m.get(temp2) != null) {

wt = (Double)m.get(keyarr[i]) * Double.parseDouble(keyarr[i+1]);

if (wt <= 15) {

writer.append(keyarr[i]+" "+Integer.parseInt(keyarr[i+1])+"+");
writer2.append(keyarr[i]+" "+wt+"+");
}
}
}
if (flag == 0) {
writer.append(",").append(temp1);
writer2.append(",").append(temp1);
}

while (st2.hasMoreTokens()) {
String z = st2.nextToken();
writer.append(",").append(z);
writer2.append(",").append(z);
}
writer.append("\n");
writer2.append("\n");
}
writer.close();
writer2.close();
br2.close();

}catch (IOException iox) {


System.out.println(iox);
}

For Naive Bayes on Exchange category


public class Naive_Exg {

private Map m2; //map for yes


private Map m3; //map for no
private String clname;
private int clnum;
private static int first = 0;

public Naive_Exg(String st, int x) {


this.clname = st;

this.m2 = new HashMap();


this.m3 = new HashMap();
this.clnum = x;
}
public void generatemaps() throws IOException {

String csvFile = "E:\\Mining\\training_csvs\\Exg_tf.csv";


BufferedReader br2 = new BufferedReader(new FileReader(csvFile));
String line = "";
line = br2.readLine(); // ignore the first line of headers
StringBuffer sb = new StringBuffer(); // buffer for keywords with frequency in yes cases
StringBuffer sb2 = new StringBuffer(); // buffer for keywords with frequency in no cases
String temp3 = null;
StringTokenizer st2 = null;
while ((line = br2.readLine()) != null) {
st2 = new StringTokenizer(line, ",");
int f = 0;
st2.nextToken(); //ignore docid
while (st2.hasMoreTokens()) {
temp3 = st2.nextToken();
if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes
f = 1;

}
if(f == 0) {
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
}
else if(f==0){
sb2.append(temp3);
}
break;

}
String keys = sb.toString();
String nokeys = sb2.toString();
String[] keyarr = keys.split("[+\\s]");
String[] nokeyarr = nokeys.split("[+\\s]");
for (int i=0; i <(keyarr.length)-1; i=i+2) {

int temp5=Integer.parseInt(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}

else {
m2.put(keyarr[i],(Integer)m2.get(keyarr[i])+ temp5);
}

}
for (int i=0; i <(nokeyarr.length)-1; i=i+2) {

int temp5=Integer.parseInt(nokeyarr[i+1]);
if (m3.get(nokeyarr[i]) == null) {
m3.put(nokeyarr[i], temp5);
}

else {
m3.put(nokeyarr[i],(Integer)m3.get(nokeyarr[i])+ temp5);
}

}
}

public int testarticle(double pyes, double pno) throws IOException {


double totalprobyes = 1;
double totalprobno = 1;
System.out.println("class is "+clname);
String csvFile2 = "E:\\Mining\\Naive_result\\Exg_result.csv";
BufferedReader br2 = new BufferedReader(new FileReader(csvFile2));
String line2 = "";
StringTokenizer st2 = null;

int numyes = 0;

StringBuffer outBuffer = new StringBuffer(1024);


outBuffer.append(br2.readLine()).append("\n");

String csvFile = "E:\\Mining\\test_csv\\forTesting.csv";


BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;

line = br.readLine();
int numofart = 0;

while ((line = br.readLine()) != null) {

st = new StringTokenizer(line, ",");


String artname = st.nextToken();

if(first == 0) {
outBuffer.append(artname).append(",");
numofart++;

}
else {
numofart++;
if((line2 = br2.readLine()) != null) {
st2 = new StringTokenizer(line2, ",");
while(st2.hasMoreTokens()) {
outBuffer.append(st2.nextToken()).append(",");
}
}
}

if(st.hasMoreTokens()) {
String temp1 = st.nextToken();
String[] keyarr = temp1.split("[+\\s]");
//for yes
String temp2 = null;
double temp3 = 0;
double x = 0;
double y = 0;
double probyes = 0;
double probno = 0;
for(int i=0; i < keyarr.length-1; i=i+2) {
temp2 = keyarr[i];
temp3 = Integer.parseInt(keyarr[i+1]);
if (m2.get(temp2) != null) {
x = (Integer)m2.get(temp2);
}
else {
x = 0;
}
if (m3.get(temp2) != null) {
y = (Integer)m3.get(temp2);
}
else {
y = 0;
}
probyes = probyes + ( (temp3) * (Math.log((x+1)/(x+y+38))) );
probno = probno + ( (temp3) * (Math.log((y+1)/(x+y+38))) );
}
totalprobyes = Math.abs(pyes + probyes);
totalprobno = Math.abs(pno + probno);
if(totalprobyes > totalprobno && totalprobyes > 500) {

outBuffer.append("yes");
numyes++;
}
else {

outBuffer.append("no");
}
}
outBuffer.append("\n");
}

String out = outBuffer.toString();


try {
FileWriter writer = new FileWriter("E:\\Mining\\Naive_result\\Exg_result.csv");
writer.append(out);
writer.close();
}catch (IOException iox) {
System.out.println(iox);
}
first = 1;
System.out.println(numofart);
return numyes;

For Term –Graph on Exchange category

public class TermGraph_Exg {


private int[][] adj;
private int [][] T;
private int clnum;
private ArrayList unikeywords;
private String clname;
private int siz;
private int nVerts;
private int[] next;
private int current_edge_weight;
public TermGraph_Exg(String name,int x) {
this.clname = name;
this.clnum = x;
this.unikeywords = new ArrayList();
this.current_edge_weight = 0;
}

public void makeAdj(){


try {
// for unique keywords list
String temp1 = null;
BufferedReader br = new BufferedReader(new
FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv"));
String line = br.readLine(); //avoid first line
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, ",");
st.nextToken(); // avoid article name
temp1 = st.nextToken();
if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list
for(int i =0; i < clnum-1-1; i++){
st.nextToken();
}
}
else {
for(int i =0; i < clnum-1; i++){
st.nextToken();
}
}
if(st.nextToken().equals("yes")) {
String[] keyarr = temp1.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
if( !(unikeywords.contains(keyarr[i])) ) {
unikeywords.add(keyarr[i]);
}
}

}
br.close();
this.siz = unikeywords.size();
this.adj = new int [siz][siz];
this.nVerts = siz;
this.next = new int[siz];
this.T = new int [siz][siz];
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
adj[i][j] = T[i][j] = 0;
}
}
for(int i=0; i < nVerts; i++) { // initialize next neighbor
next[i]=-1;
}

//for adjacency matrix


int m = 0;
int n = 0;
BufferedReader br2 = new BufferedReader(new
FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv"));
String line2 = br2.readLine();
StringTokenizer st2 = null;
while ((line2 = br2.readLine()) != null) {
st2 = new StringTokenizer(line2, ",");
st2.nextToken(); // avoid article name
temp1 = st2.nextToken();
if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list
for(int i =0; i < clnum-1-1; i++){
st2.nextToken();
}
}
else {
for(int i =0; i < clnum-1; i++){
st2.nextToken();
}
}
if(st2.nextToken().equals("yes")) {
String[] keyarr = temp1.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
for(int j=i+2; j < keyarr.length-1; j=j+2) {
m = unikeywords.indexOf(keyarr[i]);
n = unikeywords.indexOf(keyarr[j]);
if(m > -1 && n > -1) {
adj[m][n] = adj[n][m] = 1;
}

}
}
}

}
br2.close();
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
}
}

}
catch (IOException iox) {
System.out.println(iox);
}
}

//graph functions
public int vertices() {
return nVerts; // return the number of vertices
}

public int edgeLength(int a, int b) {


return adj[a][b]; // return the edge length
}

public int nextneighbor(int v) {

next[v] = next[v] + 1; // initialize next[v] to the next neighbor

if(next[v] < nVerts) {


while(adj[v][next[v]] == 0 && next[v] < nVerts) {
next[v] = next[v] + 1; // initialize next[v] to the next neighbor

if(next[v] == nVerts)
break;
}
}

if(next[v] >= nVerts) {


next[v]=-1; // reset to -1
current_edge_weight = -1;
}
else {
current_edge_weight = adj[v][next[v]];
}

return next[v]; // return next neighbor of v to be processed


}

public void resetnext() {


for (int i=0; i < nVerts; i++) // reset the array next to all -1's
next[i] = -1;
}

public void dijkstra_function(int s) throws IOException {


int u, v;
int [] dist = new int[nVerts];
for(v=0; v<nVerts; v++) {
dist[v] = 99999; // 99999 represents infinity

}
dist[s] = 0;
PriorityQueue Q = new PriorityQueue(dist);
while(Q.Empty() == 0) {
u = Q.Delete_root();
v = nextneighbor(u);

while(v != -1) { // for each neighbor of u


if(dist[v] > dist[u] + edgeLength(u,v)) {
dist[v] = dist[u] + edgeLength(u,v);
Q.Update(v, dist[v]);
}

v = nextneighbor(u); // get the next neighbor of u

}
for(int col=0; col<nVerts; col++) {
T[s][col] = dist[col];
}

public void makeTermGraph() throws IOException {


for(int i = 0; i < siz; i++) {
dijkstra_function(i);
}
for(int i = 0; i < siz; i++) {
for(int j = 0; j < siz; j++) {
}

}
}

public double testarticle(String x){


String[] keyarr = x.split("[+\\s]");
double sim = 0;
double n = 0;
double w = 0;
int u = 0;
int v = 0;
for(int i=0; i < keyarr.length-1; i=i+2) {
for(int j=i+2; j < keyarr.length-1; j=j+2) {
u = unikeywords.indexOf(keyarr[i]);
v = unikeywords.indexOf(keyarr[j]);
if(u > -1 && v > -1) {
w = 0 + Math.pow(T[u][v],2);
n = n+1;
}

}
}
sim = n/w;

return (sim);

k-nearest neighbour on Exchange Category


public class knn {
private Map<String, Double> m2;
private String clname;
private int clnum;
private static ArrayList unikeywords = new ArrayList();
private Map<String, Double> docvec;
private String docname;
private Map<String, Double> qvec;

public knn(String st, int x) {


this.clname = st;
this.m2 = new HashMap();
this.qvec =new HashMap();
this.clnum = x;
}
public static void setList() throws IOException {

BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt"));


String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
unikeywords.add(st.nextToken());
st.nextToken();

}
}
br.close();
}
public void generatemaps() throws IOException {
String csvFile = "E:\\Mining\\training_csvs\\Exg3_wts.csv";
BufferedReader br2 = new BufferedReader(new FileReader(csvFile));
String line = "";
line = br2.readLine(); // ignore the first line of headers
StringBuffer sb = new StringBuffer();
String temp3 = null;
StringTokenizer st2 = null;
while ((line = br2.readLine()) != null) {
st2 = new StringTokenizer(line, ",");
int f = 0;
st2.nextToken(); //ignore docid
while (st2.hasMoreTokens()) {
temp3 = st2.nextToken();
if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes
f = 1;
// System.out.println(temp3);
}
//System.out.println("temp3 is "+temp3);
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
//System.out.print(temp3+" ");

}
else if(f==0){
sb2.append(temp3);
}
break;

}
String keys = sb.toString();
String[] keyarr = keys.split("[+\\s]");
for(int i=0; i<unikeywords.size(); i++) {
m2.put((String)unikeywords.get(i),0.0);
}

for (int i=0; i <(keyarr.length)-1; i=i+2) {

Double temp5=Double.parseDouble(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}

else {
m2.put(keyarr[i],(Double)m2.get(keyarr[i])+ temp5);
}

}
for (int i=0; i <(keyarr.length)-1; i=i+2)
{if (m2.get(keyarr[i]) != null)
m2.put(keyarr[i],(Double)m2.get(keyarr[i])/276);
}
// System.out.println("centroid :"+m2);
}

public void setQuevec(String q) {


// docname = aname;
// System.out.println(aname);
String[] keyar = q.split("[+\\s]");
for(int i=0; i<unikeywords.size(); i++) {
qvec.put((String)unikeywords.get(i),0.0);
}
int temp1 = 0;
for(int i=0; i < (keyar.length)-1; i=i+2) {
// System.out.println(keyar[i]);
// temp1 = qvec.get(keyar[i]);
Double temp5=Double.parseDouble(keyar[i+1]);
if (qvec.get(keyar[i]) == null) {
qvec.put(keyar[i], temp5);
}

else {
qvec.put(keyar[i],(Double)qvec.get(keyar[i])+ temp5);
}
}

// System.out.println("testvector: "+qvec);
}
public double calcsim() {
double sim = 0;
double dprod = 0;
double dmag = 0;
double qmag = 0;
double sumofsq = 0;
for (Map.Entry<String, Double> entry : m2.entrySet()) {
// System.out.println("hey"+entry.getKey());
dprod = dprod + entry.getValue() * qvec.get(entry.getKey());

// System.out.println(dprod);
}
// System.out.println("hey");
for (Map.Entry<String, Double> entry2 : m2.entrySet()) {
sumofsq = sumofsq + Math.pow(entry2.getValue(),2);
}
dmag = Math.sqrt(sumofsq);

sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);

sim = dprod/(dmag*qmag);
return sim;
}
public static void main(String[] args) throws IOException {
setList();
InputStream inp;
List catg = new ArrayList();
inp = new FileInputStream(new File("E:\\Mining\\headerfiles\\all-exchanges.txt"));
Scanner sc = new Scanner(inp); // gets one word at a time from input
String word = null;

//System.out.printf("My Little Program%n%n");

while (sc.hasNext()) { // is there another word?


word = sc.next(); // get next word
catg.add(word);

knn [] obj = new knn [catg.size()];


int j = 0;
String csvFile = "E:\\Mining\\test_csv\\forTesting.csv";
BufferedReader br = new BufferedReader(new FileReader(csvFile));

String line = "";


StringTokenizer st = null;
StringBuffer sb=new StringBuffer();
line = br.readLine();
String artname=null;
Iterator it2 = catg.iterator();
String name = null;
String temp1 = null;
FileWriter writer = new FileWriter("E:\\Mining\\knn.csv");

writer.append("doc-id");

// System.out.println(catg.size());
for (int i=0; i < catg.size(); i++) {
name = (String)catg.get(i);
obj[i] = new knn(name,i+1);
writer.append(",");
writer.append(catg.get(i).toString());
}
writer.append("\n");
while ((line = br.readLine()) != null) {
double s = 0.0;
double value=0.0;

st = new StringTokenizer(line,",");
while(st.hasMoreTokens()) {
artname = st.nextToken();
// System.out.println(artname);
writer.append(artname).append(",");

if(st.hasMoreTokens()) {
temp1 = st.nextToken();
int index=0;
for (int i=0; i < catg.size(); i++) {
obj[i].setQuevec(temp1);
// System.out.println(catg.get(i));
obj[i].generatemaps();

s = obj[i].calcsim();
// int flag;
if(s>value)
{ value=s;
index=i;

} //System.out.println("class name : "+catg.get(index));


System.out.println(value);
if(value!=0.0)
{for(int p=0;p<index;p++)
writer.append("no").append(",");
writer.append("yes").append(",");
for(int p=index+1;p<=catg.size()-1;p++)
writer.append("no").append(",");
writer.append("\n");
}
else
{
for (int i=0; i < catg.size(); i++)
{
writer.append("no").append(",");
}
writer.append("\n");
}

}
else
{
for (int i=0; i < catg.size(); i++)
{
writer.append("no").append(",");
}
writer.append("\n");
}

writer.close();
}

Vector Space Model for Information Retrieval


public class Vsm_Exg {
private static ArrayList unikeywords = new ArrayList();
private Map<String, Double> docvec;
private String docname;
private static Map<String, Double> qvec = new HashMap <String,Double>();

public Vsm_Exg() {
this.docname = null;
this.docvec = new HashMap <String,Double>();
}

public static void setList() throws IOException {

BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt"));


String line = "";
StringTokenizer st = null;
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while(st.hasMoreTokens()) {
unikeywords.add(st.nextToken());
st.nextToken();
//df = Integer.parseInt(st.nextToken());

}
}
br.close();
//System.out.println(unikeywords);
}

public static void setQuevec(String q) {


String[] keyar = q.split(" ");
for(int i=0; i<unikeywords.size(); i++) {
qvec.put((String)unikeywords.get(i),0.0);
}
double temp1 = 0;
for(int i=0; i < keyar.length; i++) {
if(qvec.get(keyar[i]) != null){
temp1 = qvec.get(keyar[i]);
qvec.put(keyar[i],temp1+1);
}
}
double normval = 0;
double dummy = 0;
for (Map.Entry<String, Double> entry : qvec.entrySet()) {
dummy = dummy + Math.pow(entry.getValue(),2);
}
normval = Math.sqrt(dummy);
for(int i=0; i<unikeywords.size(); i++) {
String x = (String)unikeywords.get(i);
double val = qvec.get(x)/normval ;
qvec.put(x,val);
}
System.out.println("Queryvector: "+qvec);
}
public void setDocvec(String aname, String keywrds) throws IOException {
docname = aname;
for(int i=0; i<unikeywords.size(); i++) {
docvec.put((String)unikeywords.get(i),0.0);
}
//System.out.println("Initial vector: "+docvec);
String[] keyarr = keywrds.split("[+\\s]");
for(int i=0; i < keyarr.length-1; i=i+2) {
docvec.put(keyarr[i],Double.parseDouble(keyarr[i+1]));
}
double normval = 0;
double dummy = 0;
for (Map.Entry<String, Double> entry : docvec.entrySet()) {
dummy = dummy + Math.pow(entry.getValue(),2);
}
normval = Math.sqrt(dummy);
for(int i=0; i<unikeywords.size(); i++) {
String x = (String)unikeywords.get(i);
double val = docvec.get(x)/normval ;
docvec.put(x,val);
}
System.out.println("Docvector: "+docvec);
}

public double calcsim() {


double sim = 0;
double dprod = 0;
double dmag = 0;
double qmag = 0;
double sumofsq = 0;
for (Map.Entry<String, Double> entry : docvec.entrySet()) {
dprod = dprod + entry.getValue() * qvec.get(entry.getKey());
}
for (Map.Entry<String, Double> entry2 : docvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry2.getValue(),2);
}
dmag = Math.sqrt(sumofsq);

sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);
sim = dprod/(dmag*qmag);
return sim;
}

Accuracy Calculation
public class CalAccuracy_TermGraph {
private String csvFile;
private String csvFile2;
private String catname;
public CalAccuracy_TermGraph(String a, String b, String c){
csvFile = a;
csvFile2 = b;
catname = c;
}
public void acc() throws IOException{
System.out.println(catname+":");
BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;

BufferedReader br2 = new BufferedReader(new FileReader(csvFile2));


String line2 = "";
StringTokenizer st2 = null;
line = br.readLine();
line2 = br2.readLine();
String temp1 = null;
String temp2 = null;
double a = 0;
double b = 0;
double c = 0;
double d = 0;
double accuracy = 0;
double precision = 0;
double recall = 0;
double f = 0;
while ((line = br.readLine()) != null && (line2 = br2.readLine()) != null) {
st = new StringTokenizer(line, ",");
st2 = new StringTokenizer(line2, ",");
st.nextToken();
st2.nextToken();
while (st2.hasMoreTokens()) {
temp1 = st.nextToken(); //actual
temp2 = st2.nextToken(); //predicted
if (temp1.equals("yes") && temp2.equals("yes")) {
a++;

}
if (temp1.equals("yes") && temp2.equals("no")) {
b++;

}
if (temp1.equals("no") && temp2.equals("yes")) {
c++;

}
if (temp1.equals("no") && temp2.equals("no")) {
d++;

}
}
}
accuracy = (((a+d) * 100)/(a+b+c+d));
precision = (a)/(a+c);
recall = (a)/(a+b);
f = (2*precision*recall)/(precision+recall);
System.out.println(a);
System.out.println(b);
System.out.println(c);
System.out.println(d);

System.out.println("accuracy : "+accuracy+"% precision : "+precision*100+"% recall : "+recall*100+"% f:


"+f);

Result:
We compared the accuracy of Naïve Bayes, Term Graph and Knn for
Text classification of our articles of Reuter 21578.
As shown above in the bar graph, we found that Knn shows the best
result with accuracy as follows :
FOR EXCHANGE CATEGORY
KNN 98.00
NAÏVE 74.68
TERM GRAPH 97.41

FOR ORGANISATION CATEGORY


KNN 98.51
NAÏVE 51.43
TERM GRAPH 98.23

FOR PEOPLE CATEGORY


KNN
NAÏVE 33.19
TERM GRAPH 99.61

FOR TOPICS CATEGORY


KNN
NAÏVE 81.80
TERM GRAPH 99.19

FOR PLACES CATEGORY


KNN
NAÏVE 72.23
TERM GRAPH 99.19

Our project is about categorizing the news articles into various


categories. We have made a web application for user where, the user
can enter some keywords (in the form of query) and we can generate
the article based on those keywords by applying the Vector Space
Model.
Conclusion:
We conclude that knn shows the maximum accuracy as compared to
the Naive Bayes and Term- Graph.
The drawback for KNN is that its time complexity is high but gives a
better accuracy than others.
We used Tf-idf with Term graph Rather than the traditional Term-
Graph used with AFOPT. This hybrid shows a better result than the
traditional combination.
Finally we made an INFORMATION RETREIVAL APPLICATION using
Vector Space Model to give the result of the query entered by the
client by showing the relevant document.
Screenshots:

Output of feature generation


Screenshot showing the keywords with their respective frequencies for
each document and stating whether the document belongs to the header
class or not.
Screenshot showing the keywords with their respective weights calculated
using tf-idf algorithm for each document,and stating whether the document
belongs to the header class or not.
Output of Naive bayes
Showing the documents and keywords classified on the application of Naïve
bayes.
Output for Term Graph
Showing the Keywords and the Documents classification by applying Term
Graph

Output for k-nearest neighbour


Accuracy output
Showing the accuracy of various categories by applying Term- graph.

Output for Vector Space Model

Output for Vector Space Model with GUI:


Showing the application where user enters a query and we output the
appropriate article relating to the query using Vector Space Model.

Future Work
We will focus more in future on

a. Reducing Complexity
b. Increasing Accuracy
c. Text Summarization
Such similar applications are used in yahoo alerts where relevant
documents are shown on entering keywords by the user.
References
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf

http://web.mit.edu/6.863/www/fall2012/projects/writeups/newspaper-
article-classifier.pdf

http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf

http://jatit.org/volumes/research-papers/Vol3No2/9vol3.pdf

Improved knn Classification Algortihm Research in Text Categorization --Lijun


Wang, Xiquing Zhao

A Comparison of Event Models For Naïve Bayes Text Classification ---


Andrew McCallum, Kamal Nigam

An Improved TF-IDF Approach for Text Classification(2004) -- Zyang Yun-tao,


Gong Ling, Wang Yong-cheng

Term Graph Model for Text Classification --- Wei Wang, Diep Bich Do, and
Xuemin Lin

You might also like