Skip to content

Commit 82a15b7

Browse files
committed
M readme
1 parent 40fd8b6 commit 82a15b7

File tree

2 files changed

+262
-0
lines changed

2 files changed

+262
-0
lines changed

README.md

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
## 介绍
2+
3+
[Fasttext](https://github.com/facebookresearch/fastText/)是facebookresearch的一个text representation and classification C++程序库。这个库提供两个功能,文本分类和词嵌入学习。
4+
5+
## mynlp-fasttext
6+
7+
在mynlp中提供一个fasttext java版本的实现,使用kotlin编写。有如下特征:
8+
9+
* 100% 纯java实现
10+
* 兼容原版模型文件
11+
12+
fasttext官方提供各种预先训练的模型,可以直接读取
13+
* 兼容原版乘积量化压缩模型
14+
* java版本也提供训练API(性能与原版相当)
15+
* 支持私有的存储格式
16+
* 在私有存储格式里,支持mmap读取模型文件
17+
18+
官方提供的中文wiki模型大小为2.8G,需要jvm至少4G才能运行,需要加载时间也很长。通过mmap方式,只需少量内存,在3秒左右即可加载完毕模型文件。
19+
20+
## Building and Intalling
21+
22+
目前还没有发布到maven中央仓库,在mayabot的公开仓库中
23+
24+
在Gradle增加一个maven仓库地址
25+
```
26+
repositories {
27+
maven {
28+
url = "https://nexus.mayabot.com/content/groups/public/"
29+
}
30+
}
31+
```
32+
33+
34+
### Gradle
35+
```
36+
compile 'com.mayabot:mynlp-fasttext:1.1.0'
37+
```
38+
39+
### Maven
40+
```
41+
<dependency>
42+
<!-- mynlp-fasttest @ https://mynlp.info/ -->
43+
<groupId>com.mayabot</groupId>
44+
<artifactId>mynlp-fasttext</artifactId>
45+
<version>1.1.0</version>
46+
</dependency>
47+
```
48+
49+
50+
## Api
51+
```java
52+
53+
/**
54+
* 在sup模型上,预测分类label
55+
*/
56+
List<FloatStringPair> predict(Iterable<String> tokens, k: Int)
57+
58+
/**
59+
* 近邻搜索(相似词搜索)
60+
* @param word
61+
* @param k k个最相似的词
62+
*/
63+
List<FloatStringPair> nearestNeighbor(String word, k: Int)
64+
65+
/**
66+
* 类比搜索
67+
* Query triplet (A - B + C)?
68+
*/
69+
List<FloatStringPair> analogies(String A,String B,String C, k: Int)
70+
71+
/**
72+
* 查询指定词的向量
73+
*/
74+
Vector getWordVector(String word)
75+
76+
/**
77+
* 获得短语的向量表示
78+
*/
79+
Vector getSentenceVector(Iterable<String> tokens)
80+
81+
/**
82+
* 保存词向量为文本格式
83+
*/
84+
saveVectors(String fileName)
85+
86+
/**
87+
* 保存模型为二进制格式
88+
*/
89+
saveModel(String file)
90+
91+
/**
92+
* 训练一个模型
93+
* @param File trainFile
94+
* @param model_name
95+
* sg skipgram 词向量之skipgram算法
96+
* cow cbow 词向量之cbow算法
97+
* sup supervised 文本分类
98+
* @param args 训练参数
99+
**/
100+
FastText FastText.train(File trainFile, ModelName model_name, TrainArgs args)
101+
102+
/**
103+
* 加载有saveModel方法保存的模型
104+
* @param file
105+
* @param mmap 是否采用mmap加载模型文件,可以在有限内存下,快速加载大模型文件
106+
*/
107+
Fasttext.loadModel(String file,boolean mmap)
108+
109+
110+
/**
111+
* 加载facebook官方C程序保存的文件模型,支持bin和ftz模型
112+
*/
113+
Fasttext.loadFasttextBinModel(String binFile)
114+
```
115+
116+
## Example use cases
117+
118+
### 1.词向量表示学习
119+
```java
120+
File file = new File("data/fasttext/data.text");
121+
122+
FastText fastText = FastText.train(file, ModelName.sg);
123+
124+
fastText.saveModel("data/fasttext/model.bin");
125+
```
126+
data.txt是训练文件,采用utf-8编码存储。训练文本中词需要预先分词,采用空格分割。默认设置下,采用3-6的char ngram。
127+
除了sg算法,你还可以采用cow算法。如果需要更多的参数设置,请提供TrainArgs对象进行设置。
128+
129+
### 2.分类模型训练
130+
```java
131+
File file = new File("data/fasttext/data.txt");
132+
133+
FastText fastText = FastText.train(file, ModelName.sup);
134+
135+
fastText.saveModel("data/fasttext/model.bin");
136+
```
137+
data.txt同样也是utf-8编码的文件,每一行一个example,同样需要预先分词。每一行中存在一个```__label__```为前缀的字符串,表示该example的分类目标,比如```__label__正面```,每个example可以存在多个label。你可以设置TrainArgs中label属性,指定自定义的前缀。
138+
获得模型后,可以通过predict方法进行分类结果预测。
139+
140+
141+
142+
### 3.加载官方模型文件,另存为java模型格式
143+
```java
144+
FastText fastText = FastText.loadFasttextBinModel("data/fasttext/wiki.zh.bin");
145+
fastText.saveModel("data/fasttext/wiki.model");
146+
```
147+
148+
### 4.分类预测
149+
```java
150+
//predict传入一个分词后的结果
151+
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
152+
List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5);
153+
```
154+
155+
### 5.Nearest Neighbor 近邻查询
156+
```java
157+
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
158+
159+
List<FloatStringPair> predict = fastText.nearestNeighbor("中国",5);
160+
```
161+
162+
### 6.Analogies 类比
163+
给定三个词语A、B、C,返回与(A - B + C)语义距离最近的词语及其相似度列表。
164+
```java
165+
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
166+
167+
List<FloatStringPair> predict = fastText.analogies("国王","皇后","",5);
168+
```
169+
170+
## TrainArgs和相关参数
171+
java版本的参数和C++版本的保持一致,参考如下:
172+
```
173+
The following arguments for the dictionary are optional:
174+
-minCount minimal number of word occurences [1]
175+
-minCountLabel minimal number of label occurences [0]
176+
-wordNgrams max length of word ngram [1]
177+
-bucket number of buckets [2000000]
178+
-minn min length of char ngram [0]
179+
-maxn max length of char ngram [0]
180+
-t sampling threshold [0.0001]
181+
-label labels prefix [__label__]
182+
183+
The following arguments for training are optional:
184+
-lr learning rate [0.1]
185+
-lrUpdateRate change the rate of updates for the learning rate [100]
186+
-dim size of word vectors [100]
187+
-ws size of the context window [5]
188+
-epoch number of epochs [5]
189+
-neg number of negatives sampled [5]
190+
-loss loss function {ns, hs, softmax} [softmax]
191+
-thread number of threads [12]
192+
-pretrainedVectors pretrained word vectors for supervised learning []
193+
-saveOutput whether output params should be saved [0]
194+
195+
The following arguments for quantization are optional:
196+
-cutoff number of words and ngrams to retain [0]
197+
-retrain finetune embeddings if a cutoff is applied [0]
198+
-qnorm quantizing the norm separately [0]
199+
-qout quantizing the classifier [0]
200+
-dsub size of each sub-vector [2]
201+
```
202+
203+
## 资源
204+
### 官方预训练模型
205+
Recent state-of-the-art [English word vectors](https://fasttext.cc/docs/en/english-vectors.html).<br/>
206+
Word vectors for [157 languages trained on Wikipedia and Crawl](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md).<br/>
207+
Models for [language identification](https://fasttext.cc/docs/en/language-identification.html#content) and [various supervised tasks](https://fasttext.cc/docs/en/supervised-models.html#content).
208+
209+
## References
210+
211+
Please cite [1](#enriching-word-vectors-with-subword-information) if using this code for learning word representations or [2](#bag-of-tricks-for-efficient-text-classification) if using for text classification.
212+
213+
### Enriching Word Vectors with Subword Information
214+
215+
[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)
216+
217+
```
218+
@article{bojanowski2017enriching,
219+
title={Enriching Word Vectors with Subword Information},
220+
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
221+
journal={Transactions of the Association for Computational Linguistics},
222+
volume={5},
223+
year={2017},
224+
issn={2307-387X},
225+
pages={135--146}
226+
}
227+
```
228+
229+
### Bag of Tricks for Efficient Text Classification
230+
231+
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
232+
233+
```
234+
@InProceedings{joulin2017bag,
235+
title={Bag of Tricks for Efficient Text Classification},
236+
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
237+
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
238+
month={April},
239+
year={2017},
240+
publisher={Association for Computational Linguistics},
241+
pages={427--431},
242+
}
243+
```
244+
245+
### FastText.zip: Compressing text classification models
246+
247+
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)
248+
249+
```
250+
@article{joulin2016fasttext,
251+
title={FastText.zip: Compressing text classification models},
252+
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
253+
journal={arXiv preprint arXiv:1612.03651},
254+
year={2016}
255+
}
256+
```
257+
258+
(\* These authors contributed equally.)
259+
260+
## License
261+
262+
fastText is BSD-licensed. [Facebook持有专利](https://github.com/facebookresearch/fastText/blob/master/PATENTS)

readme.md

Whitespace-only changes.

0 commit comments

Comments
 (0)