|
1 | 1 | Objective |
2 | | -========= |
| 2 | +--------- |
3 | 3 | This is a platform writing in Python that can use variant data-mining algorithms to get results from a source (like matrix, text documents). |
4 | 4 | Algorithms can using xml configuration to make them run one-by-one. E.g. at first, we may run PCA(principle components analysis) for feature selection, then we may run random forest for classification. |
5 | 5 | Now, algorithms are mainly design for tasks can be done in a single computer, good scalability of the architecture allows you in a very short period of time to complete the algorithm you want, and use it in your project(believe me, it's faster, better, and easier than Weka). The another important feature is this platfrom can support text classification or clustering operation very good. |
6 | 6 |
|
7 | 7 |
|
8 | 8 | Get start |
9 | | -========= |
10 | | -Just write code like this, you will get amazing result (a naive-bayes training and testing), |
11 | | --------------------------------------------------------------------------------------------- |
12 | | -``` |
13 | | -# load configuratuon from xml file |
14 | | -config = Configuration.FromFile("conf/test.xml") |
15 | | -GlobalInfo.Init(config, "__global__") |
16 | | -
|
17 | | -# init module that can create matrix from text file |
18 | | -txt2mat = Text2Matrix(config, "__matrix__") |
19 | | -
|
20 | | -# create matrix for training (with tag) from a text file "train.txt" |
21 | | -[trainx, trainy] = txt2mat.CreateTrainMatrix("data/train.txt") |
22 | | -
|
23 | | -# create a chisquare filter from config file |
24 | | -chiFilter = ChiSquareFilter(config, "__filter__") |
25 | | -
|
26 | | -# get filter model from training matrix |
27 | | -chiFilter.TrainFilter(trainx, trainy) |
28 | | -
|
29 | | -# filter training matrix |
30 | | -[trainx, trainy] = chiFilter.MatrixFilter(trainx, trainy) |
31 | | -
|
32 | | -# train naive bayes model |
33 | | -nbModel = NaiveBayes(config, "naive_bayes") |
34 | | -nbModel.Train(trainx, trainy) |
35 | | -
|
36 | | -# create matrix for test |
37 | | -[testx, testy] = txt2mat.CreatePredictMatrix("data/test.txt") |
38 | | -
|
39 | | -# using chisquare filter do filtering |
40 | | -[testx, testy] = chiFilter.MatrixFilter(testx, testy) |
41 | | -
|
42 | | -# test matrix and get result (save in resultY) and precision |
43 | | -[resultY, precision] = nbModel.Test(testx, testy) |
44 | | -
|
45 | | -print precision |
46 | | -``` |
47 | | - |
48 | | -And you need define some paramters in a configuration file (It will be loaded by ```Configuration.FromFile(...xml)```). |
49 | | -``` |
50 | | -<config> |
51 | | - <__segmenter__> |
52 | | - <main_dict>dict/dict.main</main_dict> |
53 | | - </__segmenter__> |
54 | | -
|
55 | | - <__matrix__> |
56 | | - </__matrix__> |
57 | | -
|
58 | | - <__global__> |
59 | | - <term_to_id>mining/term_to_id</term_to_id> |
60 | | - <id_to_term>mining/id_to_term</id_to_term> |
61 | | - <id_to_doc_count>mining/id_to_doc_count</id_to_doc_count> |
62 | | - <class_to_doc_count>mining/class_to_doc_count</class_to_doc_count> |
63 | | - <id_to_idf>mining/id_to_idf</id_to_idf> |
64 | | - <newid_to_id>mining/newid_to_id</newid_to_id> |
65 | | - </__global__> |
66 | | -
|
67 | | - <__filter__> |
68 | | - <rate>0.2</rate> |
69 | | - <method>max</method> |
70 | | - <log_path>mining/filter.log</log_path> |
71 | | - <model_path>mining/filter.model</model_path> |
72 | | - </__filter__> |
73 | | -
|
74 | | - <naive_bayes> |
75 | | - <model_path>mining/naive_bayes.model</model_path> |
76 | | - <log_path>mining/naive_bayes.log</log_path> |
77 | | - </naive_bayes> |
78 | | -
|
79 | | - <twc_naive_bayes> |
80 | | - <model_path>mining/naive_bayes.model</model_path> |
81 | | - <log_path>mining/naive_bayes.log</log_path> |
82 | | - </twc_naive_bayes> |
83 | | - |
84 | | - <smo_csvc> |
85 | | - <model_path>mining/smo_csvc.model</model_path> |
86 | | - <log_path>mining/smo_csvc.log</log_path> |
87 | | - <c>100</c> |
88 | | - <eps>0.001</eps> |
89 | | - <tolerance>0.000000000001</tolerance> |
90 | | - <times>50</times> |
91 | | - <kernel> |
92 | | - <name>RBF</name> |
93 | | - <parameters>10</parameters> |
94 | | - </kernel> |
95 | | - <cachesize>300</cachesize> |
96 | | - </smo_csvc> |
97 | | -</config> |
98 | | -
|
99 | | -``` |
| 9 | +--------- |
| 10 | +*Just write code like this, you will get amazing result (a naive-bayes training and testing),* |
| 11 | + |
| 12 | + # load configuratuon from xml file |
| 13 | + config = Configuration.FromFile("conf/test.xml") |
| 14 | + GlobalInfo.Init(config, "__global__") |
| 15 | + |
| 16 | + # init module that can create matrix from text file |
| 17 | + txt2mat = Text2Matrix(config, "__matrix__") |
| 18 | + |
| 19 | + # create matrix for training (with tag) from a text file "train.txt" |
| 20 | + [trainx, trainy] = txt2mat.CreateTrainMatrix("data/train.txt") |
| 21 | + |
| 22 | + # create a chisquare filter from config file |
| 23 | + chiFilter = ChiSquareFilter(config, "__filter__") |
| 24 | + |
| 25 | + # get filter model from training matrix |
| 26 | + chiFilter.TrainFilter(trainx, trainy) |
| 27 | + |
| 28 | + # filter training matrix |
| 29 | + [trainx, trainy] = chiFilter.MatrixFilter(trainx, trainy) |
| 30 | + |
| 31 | + # train naive bayes model |
| 32 | + nbModel = NaiveBayes(config, "naive_bayes") |
| 33 | + nbModel.Train(trainx, trainy) |
| 34 | + |
| 35 | + # create matrix for test |
| 36 | + [testx, testy] = txt2mat.CreatePredictMatrix("data/test.txt") |
| 37 | + |
| 38 | + # using chisquare filter do filtering |
| 39 | + [testx, testy] = chiFilter.MatrixFilter(testx, testy) |
| 40 | + |
| 41 | + # test matrix and get result (save in resultY) and precision |
| 42 | + [resultY, precision] = nbModel.Test(testx, testy) |
| 43 | + |
| 44 | + print precision |
| 45 | + |
| 46 | + |
| 47 | +And you need define some paramters in a configuration file (It will be loaded by Configuration.FromFile(...xml)). |
| 48 | + |
| 49 | + <config> |
| 50 | + <__segmenter__> |
| 51 | + <main_dict>dict/dict.main</main_dict> |
| 52 | + </__segmenter__> |
| 53 | + |
| 54 | + <__matrix__> |
| 55 | + </__matrix__> |
| 56 | + |
| 57 | + <__global__> |
| 58 | + <term_to_id>mining/term_to_id</term_to_id> |
| 59 | + <id_to_term>mining/id_to_term</id_to_term> |
| 60 | + <id_to_doc_count>mining/id_to_doc_count</id_to_doc_count> |
| 61 | + <class_to_doc_count>mining/class_to_doc_count</class_to_doc_count> |
| 62 | + <id_to_idf>mining/id_to_idf</id_to_idf> |
| 63 | + <newid_to_id>mining/newid_to_id</newid_to_id> |
| 64 | + </__global__> |
| 65 | + |
| 66 | + <__filter__> |
| 67 | + <rate>0.2</rate> |
| 68 | + <method>max</method> |
| 69 | + <log_path>mining/filter.log</log_path> |
| 70 | + <model_path>mining/filter.model</model_path> |
| 71 | + </__filter__> |
| 72 | + |
| 73 | + <naive_bayes> |
| 74 | + <model_path>mining/naive_bayes.model</model_path> |
| 75 | + <log_path>mining/naive_bayes.log</log_path> |
| 76 | + </naive_bayes> |
| 77 | + |
| 78 | + <twc_naive_bayes> |
| 79 | + <model_path>mining/naive_bayes.model</model_path> |
| 80 | + <log_path>mining/naive_bayes.log</log_path> |
| 81 | + </twc_naive_bayes> |
| 82 | + |
| 83 | + <smo_csvc> |
| 84 | + <model_path>mining/smo_csvc.model</model_path> |
| 85 | + <log_path>mining/smo_csvc.log</log_path> |
| 86 | + <c>100</c> |
| 87 | + <eps>0.001</eps> |
| 88 | + <tolerance>0.000000000001</tolerance> |
| 89 | + <times>50</times> |
| 90 | + <kernel> |
| 91 | + <name>RBF</name> |
| 92 | + <parameters>10</parameters> |
| 93 | + </kernel> |
| 94 | + <cachesize>300</cachesize> |
| 95 | + </smo_csvc> |
| 96 | + </config> |
100 | 97 |
|
101 | 98 | Features |
102 | 99 | ======== |
103 | | -Clustering algorithm |
104 | | --------------------- |
| 100 | +*Clustering algorithm* |
105 | 101 | + KMeans |
106 | 102 |
|
107 | | -Classification algorithm |
108 | | ------------------------- |
| 103 | +*Classification algorithm* |
109 | 104 | + Random forest |
110 | 105 | + Naive Bayes |
111 | 106 | + TWC Naive Bayes |
112 | 107 | + SVM |
113 | 108 |
|
114 | | -Feature selector |
115 | | ----------------- |
| 109 | +*Feature selector* |
116 | 110 | + Chisquare |
117 | 111 | + PCA |
118 | 112 |
|
119 | | -Mathematic |
120 | | ----------- |
| 113 | +*Mathematic* |
121 | 114 | + Basic operations (like bagging, transpose, etc.) |
122 | 115 |
|
123 | | -Data source support |
124 | | -------------------- |
| 116 | +*Data source support* |
125 | 117 | + Matrix (with csv format) |
126 | 118 | + Raw text (now only support Chinese, English to be added) |
127 | 119 |
|
|
0 commit comments