Skip to content

Commit 9cb0398

Browse files
committed
update doc
1 parent bc3ff70 commit 9cb0398

File tree

1 file changed

+94
-102
lines changed

1 file changed

+94
-102
lines changed

README.md

Lines changed: 94 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -1,127 +1,119 @@
11
Objective
2-
=========
2+
---------
33
This is a platform writing in Python that can use variant data-mining algorithms to get results from a source (like matrix, text documents).
44
Algorithms can using xml configuration to make them run one-by-one. E.g. at first, we may run PCA(principle components analysis) for feature selection, then we may run random forest for classification.
55
Now, algorithms are mainly design for tasks can be done in a single computer, good scalability of the architecture allows you in a very short period of time to complete the algorithm you want, and use it in your project(believe me, it's faster, better, and easier than Weka). The another important feature is this platfrom can support text classification or clustering operation very good.
66

77

88
Get start
9-
=========
10-
Just write code like this, you will get amazing result (a naive-bayes training and testing),
11-
--------------------------------------------------------------------------------------------
12-
```
13-
# load configuratuon from xml file
14-
config = Configuration.FromFile("conf/test.xml")
15-
GlobalInfo.Init(config, "__global__")
16-
17-
# init module that can create matrix from text file
18-
txt2mat = Text2Matrix(config, "__matrix__")
19-
20-
# create matrix for training (with tag) from a text file "train.txt"
21-
[trainx, trainy] = txt2mat.CreateTrainMatrix("data/train.txt")
22-
23-
# create a chisquare filter from config file
24-
chiFilter = ChiSquareFilter(config, "__filter__")
25-
26-
# get filter model from training matrix
27-
chiFilter.TrainFilter(trainx, trainy)
28-
29-
# filter training matrix
30-
[trainx, trainy] = chiFilter.MatrixFilter(trainx, trainy)
31-
32-
# train naive bayes model
33-
nbModel = NaiveBayes(config, "naive_bayes")
34-
nbModel.Train(trainx, trainy)
35-
36-
# create matrix for test
37-
[testx, testy] = txt2mat.CreatePredictMatrix("data/test.txt")
38-
39-
# using chisquare filter do filtering
40-
[testx, testy] = chiFilter.MatrixFilter(testx, testy)
41-
42-
# test matrix and get result (save in resultY) and precision
43-
[resultY, precision] = nbModel.Test(testx, testy)
44-
45-
print precision
46-
```
47-
48-
And you need define some paramters in a configuration file (It will be loaded by ```Configuration.FromFile(...xml)```).
49-
```
50-
<config>
51-
<__segmenter__>
52-
<main_dict>dict/dict.main</main_dict>
53-
</__segmenter__>
54-
55-
<__matrix__>
56-
</__matrix__>
57-
58-
<__global__>
59-
<term_to_id>mining/term_to_id</term_to_id>
60-
<id_to_term>mining/id_to_term</id_to_term>
61-
<id_to_doc_count>mining/id_to_doc_count</id_to_doc_count>
62-
<class_to_doc_count>mining/class_to_doc_count</class_to_doc_count>
63-
<id_to_idf>mining/id_to_idf</id_to_idf>
64-
<newid_to_id>mining/newid_to_id</newid_to_id>
65-
</__global__>
66-
67-
<__filter__>
68-
<rate>0.2</rate>
69-
<method>max</method>
70-
<log_path>mining/filter.log</log_path>
71-
<model_path>mining/filter.model</model_path>
72-
</__filter__>
73-
74-
<naive_bayes>
75-
<model_path>mining/naive_bayes.model</model_path>
76-
<log_path>mining/naive_bayes.log</log_path>
77-
</naive_bayes>
78-
79-
<twc_naive_bayes>
80-
<model_path>mining/naive_bayes.model</model_path>
81-
<log_path>mining/naive_bayes.log</log_path>
82-
</twc_naive_bayes>
83-
84-
<smo_csvc>
85-
<model_path>mining/smo_csvc.model</model_path>
86-
<log_path>mining/smo_csvc.log</log_path>
87-
<c>100</c>
88-
<eps>0.001</eps>
89-
<tolerance>0.000000000001</tolerance>
90-
<times>50</times>
91-
<kernel>
92-
<name>RBF</name>
93-
<parameters>10</parameters>
94-
</kernel>
95-
<cachesize>300</cachesize>
96-
</smo_csvc>
97-
</config>
98-
99-
```
9+
---------
10+
*Just write code like this, you will get amazing result (a naive-bayes training and testing),*
11+
12+
# load configuratuon from xml file
13+
config = Configuration.FromFile("conf/test.xml")
14+
GlobalInfo.Init(config, "__global__")
15+
16+
# init module that can create matrix from text file
17+
txt2mat = Text2Matrix(config, "__matrix__")
18+
19+
# create matrix for training (with tag) from a text file "train.txt"
20+
[trainx, trainy] = txt2mat.CreateTrainMatrix("data/train.txt")
21+
22+
# create a chisquare filter from config file
23+
chiFilter = ChiSquareFilter(config, "__filter__")
24+
25+
# get filter model from training matrix
26+
chiFilter.TrainFilter(trainx, trainy)
27+
28+
# filter training matrix
29+
[trainx, trainy] = chiFilter.MatrixFilter(trainx, trainy)
30+
31+
# train naive bayes model
32+
nbModel = NaiveBayes(config, "naive_bayes")
33+
nbModel.Train(trainx, trainy)
34+
35+
# create matrix for test
36+
[testx, testy] = txt2mat.CreatePredictMatrix("data/test.txt")
37+
38+
# using chisquare filter do filtering
39+
[testx, testy] = chiFilter.MatrixFilter(testx, testy)
40+
41+
# test matrix and get result (save in resultY) and precision
42+
[resultY, precision] = nbModel.Test(testx, testy)
43+
44+
print precision
45+
46+
47+
And you need define some paramters in a configuration file (It will be loaded by Configuration.FromFile(...xml)).
48+
49+
<config>
50+
<__segmenter__>
51+
<main_dict>dict/dict.main</main_dict>
52+
</__segmenter__>
53+
54+
<__matrix__>
55+
</__matrix__>
56+
57+
<__global__>
58+
<term_to_id>mining/term_to_id</term_to_id>
59+
<id_to_term>mining/id_to_term</id_to_term>
60+
<id_to_doc_count>mining/id_to_doc_count</id_to_doc_count>
61+
<class_to_doc_count>mining/class_to_doc_count</class_to_doc_count>
62+
<id_to_idf>mining/id_to_idf</id_to_idf>
63+
<newid_to_id>mining/newid_to_id</newid_to_id>
64+
</__global__>
65+
66+
<__filter__>
67+
<rate>0.2</rate>
68+
<method>max</method>
69+
<log_path>mining/filter.log</log_path>
70+
<model_path>mining/filter.model</model_path>
71+
</__filter__>
72+
73+
<naive_bayes>
74+
<model_path>mining/naive_bayes.model</model_path>
75+
<log_path>mining/naive_bayes.log</log_path>
76+
</naive_bayes>
77+
78+
<twc_naive_bayes>
79+
<model_path>mining/naive_bayes.model</model_path>
80+
<log_path>mining/naive_bayes.log</log_path>
81+
</twc_naive_bayes>
82+
83+
<smo_csvc>
84+
<model_path>mining/smo_csvc.model</model_path>
85+
<log_path>mining/smo_csvc.log</log_path>
86+
<c>100</c>
87+
<eps>0.001</eps>
88+
<tolerance>0.000000000001</tolerance>
89+
<times>50</times>
90+
<kernel>
91+
<name>RBF</name>
92+
<parameters>10</parameters>
93+
</kernel>
94+
<cachesize>300</cachesize>
95+
</smo_csvc>
96+
</config>
10097

10198
Features
10299
========
103-
Clustering algorithm
104-
--------------------
100+
*Clustering algorithm*
105101
+ KMeans
106102

107-
Classification algorithm
108-
------------------------
103+
*Classification algorithm*
109104
+ Random forest
110105
+ Naive Bayes
111106
+ TWC Naive Bayes
112107
+ SVM
113108

114-
Feature selector
115-
----------------
109+
*Feature selector*
116110
+ Chisquare
117111
+ PCA
118112

119-
Mathematic
120-
----------
113+
*Mathematic*
121114
+ Basic operations (like bagging, transpose, etc.)
122115

123-
Data source support
124-
-------------------
116+
*Data source support*
125117
+ Matrix (with csv format)
126118
+ Raw text (now only support Chinese, English to be added)
127119

0 commit comments

Comments
 (0)