Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
SUBMITTED BY :
Aarshi Taneja (10104666)
Divya Gautam (10104673)
Nupur (10104676)
Shruti Jadon (10104776)
Batch: IT-B10
Group Code: DMB10G04
TABLE OF CONTENTS
Abstract --------------- 1
Results --------------- 35
Screenshots ---------------37
References --------------- 43
Abstract
Text Categorization (TC), also known as Text Classification, is the task
of automatically classifying a set of text documents into different
categories from a predefined set. If a document belongs to exactly
one of the categories, it is a single-label classification task; otherwise,
it is a multi-label classification task. TC uses several tools from
Information Retrieval (IR) and Machine Learning (ML) and has
received much attention in the last years from both researchers in
the academia and industry developers.
Information Retrieval
Problem definition
Our project is about categorizing the news articles into various
categories. We work on two major scenarios:
Reuter
</BODY></TEXT>
</REUTERS>
Preprocessing applied
In this case, which is the most usual in TC, the weight of a term
in a document increases with the number of times that the term
occurs in the document and decreases with the number of times the
term occurs in the collection. This means that the importance of a
term in a document is proportional to the number of times that the
term appears in the document, while the importance of the term is
inversely proportional to the number of times that the term appears
in the entire collection.
This term-weighting approach is referred to as term
frequency/inverse document frequency .
Formally, , the weight of term ti for document , is defined as:
Term Distributions
Stop Words
Words that are of little value to convey the meaning of the document
and which happen to have a high frequency are totally dropped
during the tokenization process. These words are called stop words
and are generally detected by either their high frequency or by
matching them with a dictionary. Below is a stop list of twenty-five
semantically nonselective words which are common
Case-Folding
A typical strategy is to do case-folding by converting all uppercase
characters to lowercase characters. This is a form of word
normalization in which all words are reduced to a standard form.
This would equate between Door and door and between university
and UNIVERSITY. This sounds very nice; however the problem arises
when a proper noun such as Black is equated with the color black or
it can also equate between the company name VISION and the word
vision. To remedy to this, one can only convert to lowercase words at
the beginning of a sentence and words located within titles and
headings.
Relevant Algorithms/Techniques
Classification Methods
Naive Bayes
The Naıve Bayes classifier found its way into many applications
nowadays due to its simple principle but yet powerful accuracy [13].
Bayesian classifiers are based on a statistical principle. Here, the
presence or absence of a word in a textual document determines the
outcome of the prediction. In other words, each processed term is
assigned a probability that it belongs to a certain category. This
probability is calculated from the occurrences of the term in the
training documents where the categories are already known. When
all these probabilities are calculated, a new document can be
classified according to the sum of the probabilities for each category
of each term occurring within the document. However, this classifier
does not take the number of occurrences into account, which is a
potentially useful additional source of information. They are called
“naıve” because the algorithm assumes that all terms occur
independent from each other.
Given a set of r document vectors , classified along a
set C of q classes, , Bayesian classifiers estimate the
probabilities of each class ck given a document dj as:
Preprocessing
In our term graph model, we will capture the relationships among
terms using the frequent itemset mining method. To do so, we
consider each text document in the training collections as a
transaction in which each word is an item. However, not all words in
the document are important enough to be retained in the
transaction.
To reduce the processing space as well as increase the accuracy of
our model, the text documents need to be preprocessed by (1)
remove stopwords, i.e., words that appear frequently in the
document but have no essential meanings; and (2) retaining only the
root form of words by stemming their affixes as well as prefixes.
Graph Building
k-Nearest Neighbors
Actual Implementation
For classifying the documents in Reuter-21578 we initially pre-
processed the data by performing various techniques :
a. Bag of words
b. Stop word removal
c. Tf-idf
d. Case Folding
e. Normalisation
WORK FLOW
\
News articles
Training set
Test set
To Classify the
test documents
Calculating Complexity and Comparing Accuracy
of the algorithm
FORMULA USED :
K-NEAREST NEIGHBOUR
InputStream inp;
try { FileWriter writer = new FileWriter(traincsv);
inp = new FileInputStream(new File(hdfile));
catg.add(word);
writer.append(word);
writer.append(',');
//System.out.println(word);
writer.append('\n');
String temp = null;
File ignoreFile = new File("E:\\Mining\\longstoplist.txt");
for(int j=0;j<inputFile.length;j++) {
BufferedReader br = new BufferedReader(new FileReader(inputFile[j]));
String line = "";
StringTokenizer st = null;
List keylist = new ArrayList();
while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, " ");
while (st.hasMoreTokens()) {
temp = st.nextToken();
if (st.hasMoreTokens()) {
break;
}
else {
keylist.add(temp);
}
}
}
//comparing the values from the hash set and table and if match increase to 1
if (words.contains(wrds[i]) == false) {
if (result.get(wrds[i]) == null)
result.put(wrds[i], 1);
else
result.put(wrds[i], result.get(wrds[i]) + 1);
words.add(wrds[i]);
}
else {
result.put(wrds[i], result.get(wrds[i]) + 1);
}
}
}
// System.out.println();
writer.append('\n');
FileOutputStream out3;
PrintStream p3;
out3 = new FileOutputStream(clist);
p3 = new PrintStream( out3 );
for (Object o: result.entrySet() ) {
Map.Entry entry = (Map.Entry) o;
int val=Integer.parseInt(entry.getValue().toString());
String k=entry.getKey().toString();
if(val>4){
// System.out.println(k+" "+val);
p3.println(k+" "+val);
}
}
//writer.flush();
writer.close();
sc.close();
temp2 = keyarr[i];
if (m.get(temp2) != null) {
wt = (Double)m.get(keyarr[i]) * Double.parseDouble(keyarr[i+1]);
writer.append(keyarr[i]+" "+Integer.parseInt(keyarr[i+1])+"+");
writer2.append(keyarr[i]+" "+wt+"+");
}
}
}
if (flag == 0) {
writer.append(",").append(temp1);
writer2.append(",").append(temp1);
}
while (st2.hasMoreTokens()) {
String z = st2.nextToken();
writer.append(",").append(z);
writer2.append(",").append(z);
}
writer.append("\n");
writer2.append("\n");
}
writer.close();
writer2.close();
br2.close();
}
if(f == 0) {
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
}
else if(f==0){
sb2.append(temp3);
}
break;
}
String keys = sb.toString();
String nokeys = sb2.toString();
String[] keyarr = keys.split("[+\\s]");
String[] nokeyarr = nokeys.split("[+\\s]");
for (int i=0; i <(keyarr.length)-1; i=i+2) {
int temp5=Integer.parseInt(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}
else {
m2.put(keyarr[i],(Integer)m2.get(keyarr[i])+ temp5);
}
}
for (int i=0; i <(nokeyarr.length)-1; i=i+2) {
int temp5=Integer.parseInt(nokeyarr[i+1]);
if (m3.get(nokeyarr[i]) == null) {
m3.put(nokeyarr[i], temp5);
}
else {
m3.put(nokeyarr[i],(Integer)m3.get(nokeyarr[i])+ temp5);
}
}
}
int numyes = 0;
line = br.readLine();
int numofart = 0;
if(first == 0) {
outBuffer.append(artname).append(",");
numofart++;
}
else {
numofart++;
if((line2 = br2.readLine()) != null) {
st2 = new StringTokenizer(line2, ",");
while(st2.hasMoreTokens()) {
outBuffer.append(st2.nextToken()).append(",");
}
}
}
if(st.hasMoreTokens()) {
String temp1 = st.nextToken();
String[] keyarr = temp1.split("[+\\s]");
//for yes
String temp2 = null;
double temp3 = 0;
double x = 0;
double y = 0;
double probyes = 0;
double probno = 0;
for(int i=0; i < keyarr.length-1; i=i+2) {
temp2 = keyarr[i];
temp3 = Integer.parseInt(keyarr[i+1]);
if (m2.get(temp2) != null) {
x = (Integer)m2.get(temp2);
}
else {
x = 0;
}
if (m3.get(temp2) != null) {
y = (Integer)m3.get(temp2);
}
else {
y = 0;
}
probyes = probyes + ( (temp3) * (Math.log((x+1)/(x+y+38))) );
probno = probno + ( (temp3) * (Math.log((y+1)/(x+y+38))) );
}
totalprobyes = Math.abs(pyes + probyes);
totalprobno = Math.abs(pno + probno);
if(totalprobyes > totalprobno && totalprobyes > 500) {
outBuffer.append("yes");
numyes++;
}
else {
outBuffer.append("no");
}
}
outBuffer.append("\n");
}
}
br.close();
this.siz = unikeywords.size();
this.adj = new int [siz][siz];
this.nVerts = siz;
this.next = new int[siz];
this.T = new int [siz][siz];
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
adj[i][j] = T[i][j] = 0;
}
}
for(int i=0; i < nVerts; i++) { // initialize next neighbor
next[i]=-1;
}
}
}
}
}
br2.close();
for (int i=0; i < siz; i++){
for(int j=0; j < siz; j++) {
}
}
}
catch (IOException iox) {
System.out.println(iox);
}
}
//graph functions
public int vertices() {
return nVerts; // return the number of vertices
}
if(next[v] == nVerts)
break;
}
}
}
dist[s] = 0;
PriorityQueue Q = new PriorityQueue(dist);
while(Q.Empty() == 0) {
u = Q.Delete_root();
v = nextneighbor(u);
}
for(int col=0; col<nVerts; col++) {
T[s][col] = dist[col];
}
}
}
}
}
sim = n/w;
return (sim);
}
}
br.close();
}
public void generatemaps() throws IOException {
String csvFile = "E:\\Mining\\training_csvs\\Exg3_wts.csv";
BufferedReader br2 = new BufferedReader(new FileReader(csvFile));
String line = "";
line = br2.readLine(); // ignore the first line of headers
StringBuffer sb = new StringBuffer();
String temp3 = null;
StringTokenizer st2 = null;
while ((line = br2.readLine()) != null) {
st2 = new StringTokenizer(line, ",");
int f = 0;
st2.nextToken(); //ignore docid
while (st2.hasMoreTokens()) {
temp3 = st2.nextToken();
if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes
f = 1;
// System.out.println(temp3);
}
//System.out.println("temp3 is "+temp3);
for(int i =0;i < clnum-1; i++){
st2.nextToken();
}
if(f == 0 && st2.nextToken().equals("yes")) {
sb.append(temp3);
//System.out.print(temp3+" ");
}
else if(f==0){
sb2.append(temp3);
}
break;
}
String keys = sb.toString();
String[] keyarr = keys.split("[+\\s]");
for(int i=0; i<unikeywords.size(); i++) {
m2.put((String)unikeywords.get(i),0.0);
}
Double temp5=Double.parseDouble(keyarr[i+1]);
if (m2.get(keyarr[i]) == null) {
m2.put(keyarr[i], temp5);
}
else {
m2.put(keyarr[i],(Double)m2.get(keyarr[i])+ temp5);
}
}
for (int i=0; i <(keyarr.length)-1; i=i+2)
{if (m2.get(keyarr[i]) != null)
m2.put(keyarr[i],(Double)m2.get(keyarr[i])/276);
}
// System.out.println("centroid :"+m2);
}
else {
qvec.put(keyar[i],(Double)qvec.get(keyar[i])+ temp5);
}
}
// System.out.println("testvector: "+qvec);
}
public double calcsim() {
double sim = 0;
double dprod = 0;
double dmag = 0;
double qmag = 0;
double sumofsq = 0;
for (Map.Entry<String, Double> entry : m2.entrySet()) {
// System.out.println("hey"+entry.getKey());
dprod = dprod + entry.getValue() * qvec.get(entry.getKey());
// System.out.println(dprod);
}
// System.out.println("hey");
for (Map.Entry<String, Double> entry2 : m2.entrySet()) {
sumofsq = sumofsq + Math.pow(entry2.getValue(),2);
}
dmag = Math.sqrt(sumofsq);
sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);
sim = dprod/(dmag*qmag);
return sim;
}
public static void main(String[] args) throws IOException {
setList();
InputStream inp;
List catg = new ArrayList();
inp = new FileInputStream(new File("E:\\Mining\\headerfiles\\all-exchanges.txt"));
Scanner sc = new Scanner(inp); // gets one word at a time from input
String word = null;
writer.append("doc-id");
// System.out.println(catg.size());
for (int i=0; i < catg.size(); i++) {
name = (String)catg.get(i);
obj[i] = new knn(name,i+1);
writer.append(",");
writer.append(catg.get(i).toString());
}
writer.append("\n");
while ((line = br.readLine()) != null) {
double s = 0.0;
double value=0.0;
st = new StringTokenizer(line,",");
while(st.hasMoreTokens()) {
artname = st.nextToken();
// System.out.println(artname);
writer.append(artname).append(",");
if(st.hasMoreTokens()) {
temp1 = st.nextToken();
int index=0;
for (int i=0; i < catg.size(); i++) {
obj[i].setQuevec(temp1);
// System.out.println(catg.get(i));
obj[i].generatemaps();
s = obj[i].calcsim();
// int flag;
if(s>value)
{ value=s;
index=i;
}
else
{
for (int i=0; i < catg.size(); i++)
{
writer.append("no").append(",");
}
writer.append("\n");
}
writer.close();
}
public Vsm_Exg() {
this.docname = null;
this.docvec = new HashMap <String,Double>();
}
}
}
br.close();
//System.out.println(unikeywords);
}
sumofsq = 0;
for (Map.Entry<String, Double> entry3 : qvec.entrySet()) {
sumofsq = sumofsq + Math.pow(entry3.getValue(),2);
}
qmag = Math.sqrt(sumofsq);
sim = dprod/(dmag*qmag);
return sim;
}
Accuracy Calculation
public class CalAccuracy_TermGraph {
private String csvFile;
private String csvFile2;
private String catname;
public CalAccuracy_TermGraph(String a, String b, String c){
csvFile = a;
csvFile2 = b;
catname = c;
}
public void acc() throws IOException{
System.out.println(catname+":");
BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;
}
if (temp1.equals("yes") && temp2.equals("no")) {
b++;
}
if (temp1.equals("no") && temp2.equals("yes")) {
c++;
}
if (temp1.equals("no") && temp2.equals("no")) {
d++;
}
}
}
accuracy = (((a+d) * 100)/(a+b+c+d));
precision = (a)/(a+c);
recall = (a)/(a+b);
f = (2*precision*recall)/(precision+recall);
System.out.println(a);
System.out.println(b);
System.out.println(c);
System.out.println(d);
Result:
We compared the accuracy of Naïve Bayes, Term Graph and Knn for
Text classification of our articles of Reuter 21578.
As shown above in the bar graph, we found that Knn shows the best
result with accuracy as follows :
FOR EXCHANGE CATEGORY
KNN 98.00
NAÏVE 74.68
TERM GRAPH 97.41
Future Work
We will focus more in future on
a. Reducing Complexity
b. Increasing Accuracy
c. Text Summarization
Such similar applications are used in yahoo alerts where relevant
documents are shown on entering keywords by the user.
References
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf
http://web.mit.edu/6.863/www/fall2012/projects/writeups/newspaper-
article-classifier.pdf
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf
http://jatit.org/volumes/research-papers/Vol3No2/9vol3.pdf
Term Graph Model for Text Classification --- Wei Wang, Diep Bich Do, and
Xuemin Lin