|
| 1 | +## 介绍 |
| 2 | + |
| 3 | + [Fasttext](https://github.com/facebookresearch/fastText/)是facebookresearch的一个text representation and classification C++程序库。这个库提供两个功能,文本分类和词嵌入学习。 |
| 4 | + |
| 5 | +## mynlp-fasttext |
| 6 | + |
| 7 | + 在mynlp中提供一个fasttext java版本的实现,使用kotlin编写。有如下特征: |
| 8 | + |
| 9 | + * 100% 纯java实现 |
| 10 | + * 兼容原版模型文件 |
| 11 | + |
| 12 | + fasttext官方提供各种预先训练的模型,可以直接读取 |
| 13 | + * 兼容原版乘积量化压缩模型 |
| 14 | + * java版本也提供训练API(性能与原版相当) |
| 15 | + * 支持私有的存储格式 |
| 16 | + * 在私有存储格式里,支持mmap读取模型文件 |
| 17 | + |
| 18 | + 官方提供的中文wiki模型大小为2.8G,需要jvm至少4G才能运行,需要加载时间也很长。通过mmap方式,只需少量内存,在3秒左右即可加载完毕模型文件。 |
| 19 | + |
| 20 | +## Building and Intalling |
| 21 | + |
| 22 | +目前还没有发布到maven中央仓库,在mayabot的公开仓库中 |
| 23 | + |
| 24 | +在Gradle增加一个maven仓库地址 |
| 25 | +``` |
| 26 | +repositories { |
| 27 | + maven { |
| 28 | + url = "https://nexus.mayabot.com/content/groups/public/" |
| 29 | + } |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | + |
| 34 | +### Gradle |
| 35 | +``` |
| 36 | +compile 'com.mayabot:mynlp-fasttext:1.1.0' |
| 37 | +``` |
| 38 | + |
| 39 | +### Maven |
| 40 | +``` |
| 41 | +<dependency> |
| 42 | + <!-- mynlp-fasttest @ https://mynlp.info/ --> |
| 43 | + <groupId>com.mayabot</groupId> |
| 44 | + <artifactId>mynlp-fasttext</artifactId> |
| 45 | + <version>1.1.0</version> |
| 46 | +</dependency> |
| 47 | +``` |
| 48 | + |
| 49 | + |
| 50 | +## Api |
| 51 | +```java |
| 52 | + |
| 53 | + /** |
| 54 | + * 在sup模型上,预测分类label |
| 55 | + */ |
| 56 | + List<FloatStringPair> predict(Iterable<String> tokens, k: Int) |
| 57 | + |
| 58 | + /** |
| 59 | + * 近邻搜索(相似词搜索) |
| 60 | + * @param word |
| 61 | + * @param k k个最相似的词 |
| 62 | + */ |
| 63 | + List<FloatStringPair> nearestNeighbor(String word, k: Int) |
| 64 | + |
| 65 | +/** |
| 66 | + * 类比搜索 |
| 67 | + * Query triplet (A - B + C)? |
| 68 | + */ |
| 69 | + List<FloatStringPair> analogies(String A,String B,String C, k: Int) |
| 70 | + |
| 71 | + /** |
| 72 | + * 查询指定词的向量 |
| 73 | + */ |
| 74 | + Vector getWordVector(String word) |
| 75 | + |
| 76 | + /** |
| 77 | + * 获得短语的向量表示 |
| 78 | + */ |
| 79 | + Vector getSentenceVector(Iterable<String> tokens) |
| 80 | + |
| 81 | + /** |
| 82 | + * 保存词向量为文本格式 |
| 83 | + */ |
| 84 | + saveVectors(String fileName) |
| 85 | + |
| 86 | + /** |
| 87 | + * 保存模型为二进制格式 |
| 88 | + */ |
| 89 | + saveModel(String file) |
| 90 | + |
| 91 | + /** |
| 92 | + * 训练一个模型 |
| 93 | + * @param File trainFile |
| 94 | + * @param model_name |
| 95 | + * sg skipgram 词向量之skipgram算法 |
| 96 | + * cow cbow 词向量之cbow算法 |
| 97 | + * sup supervised 文本分类 |
| 98 | + * @param args 训练参数 |
| 99 | + **/ |
| 100 | + FastText FastText.train(File trainFile, ModelName model_name, TrainArgs args) |
| 101 | + |
| 102 | + /** |
| 103 | + * 加载有saveModel方法保存的模型 |
| 104 | + * @param file |
| 105 | + * @param mmap 是否采用mmap加载模型文件,可以在有限内存下,快速加载大模型文件 |
| 106 | + */ |
| 107 | + Fasttext.loadModel(String file,boolean mmap) |
| 108 | + |
| 109 | + |
| 110 | + /** |
| 111 | + * 加载facebook官方C程序保存的文件模型,支持bin和ftz模型 |
| 112 | + */ |
| 113 | + Fasttext.loadFasttextBinModel(String binFile) |
| 114 | +``` |
| 115 | + |
| 116 | +## Example use cases |
| 117 | + |
| 118 | +### 1.词向量表示学习 |
| 119 | +```java |
| 120 | +File file = new File("data/fasttext/data.text"); |
| 121 | + |
| 122 | +FastText fastText = FastText.train(file, ModelName.sg); |
| 123 | + |
| 124 | +fastText.saveModel("data/fasttext/model.bin"); |
| 125 | +``` |
| 126 | +data.txt是训练文件,采用utf-8编码存储。训练文本中词需要预先分词,采用空格分割。默认设置下,采用3-6的char ngram。 |
| 127 | +除了sg算法,你还可以采用cow算法。如果需要更多的参数设置,请提供TrainArgs对象进行设置。 |
| 128 | + |
| 129 | +### 2.分类模型训练 |
| 130 | +```java |
| 131 | +File file = new File("data/fasttext/data.txt"); |
| 132 | + |
| 133 | +FastText fastText = FastText.train(file, ModelName.sup); |
| 134 | + |
| 135 | +fastText.saveModel("data/fasttext/model.bin"); |
| 136 | +``` |
| 137 | +data.txt同样也是utf-8编码的文件,每一行一个example,同样需要预先分词。每一行中存在一个```__label__```为前缀的字符串,表示该example的分类目标,比如```__label__正面```,每个example可以存在多个label。你可以设置TrainArgs中label属性,指定自定义的前缀。 |
| 138 | +获得模型后,可以通过predict方法进行分类结果预测。 |
| 139 | + |
| 140 | + |
| 141 | + |
| 142 | +### 3.加载官方模型文件,另存为java模型格式 |
| 143 | +```java |
| 144 | +FastText fastText = FastText.loadFasttextBinModel("data/fasttext/wiki.zh.bin"); |
| 145 | +fastText.saveModel("data/fasttext/wiki.model"); |
| 146 | +``` |
| 147 | + |
| 148 | +### 4.分类预测 |
| 149 | +```java |
| 150 | +//predict传入一个分词后的结果 |
| 151 | +FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin"); |
| 152 | +List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5); |
| 153 | +``` |
| 154 | + |
| 155 | +### 5.Nearest Neighbor 近邻查询 |
| 156 | +```java |
| 157 | +FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin"); |
| 158 | + |
| 159 | +List<FloatStringPair> predict = fastText.nearestNeighbor("中国",5); |
| 160 | +``` |
| 161 | + |
| 162 | +### 6.Analogies 类比 |
| 163 | +给定三个词语A、B、C,返回与(A - B + C)语义距离最近的词语及其相似度列表。 |
| 164 | +```java |
| 165 | +FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin"); |
| 166 | + |
| 167 | +List<FloatStringPair> predict = fastText.analogies("国王","皇后","男",5); |
| 168 | +``` |
| 169 | + |
| 170 | +## TrainArgs和相关参数 |
| 171 | +java版本的参数和C++版本的保持一致,参考如下: |
| 172 | +``` |
| 173 | +The following arguments for the dictionary are optional: |
| 174 | + -minCount minimal number of word occurences [1] |
| 175 | + -minCountLabel minimal number of label occurences [0] |
| 176 | + -wordNgrams max length of word ngram [1] |
| 177 | + -bucket number of buckets [2000000] |
| 178 | + -minn min length of char ngram [0] |
| 179 | + -maxn max length of char ngram [0] |
| 180 | + -t sampling threshold [0.0001] |
| 181 | + -label labels prefix [__label__] |
| 182 | +
|
| 183 | +The following arguments for training are optional: |
| 184 | + -lr learning rate [0.1] |
| 185 | + -lrUpdateRate change the rate of updates for the learning rate [100] |
| 186 | + -dim size of word vectors [100] |
| 187 | + -ws size of the context window [5] |
| 188 | + -epoch number of epochs [5] |
| 189 | + -neg number of negatives sampled [5] |
| 190 | + -loss loss function {ns, hs, softmax} [softmax] |
| 191 | + -thread number of threads [12] |
| 192 | + -pretrainedVectors pretrained word vectors for supervised learning [] |
| 193 | + -saveOutput whether output params should be saved [0] |
| 194 | +
|
| 195 | +The following arguments for quantization are optional: |
| 196 | + -cutoff number of words and ngrams to retain [0] |
| 197 | + -retrain finetune embeddings if a cutoff is applied [0] |
| 198 | + -qnorm quantizing the norm separately [0] |
| 199 | + -qout quantizing the classifier [0] |
| 200 | + -dsub size of each sub-vector [2] |
| 201 | +``` |
| 202 | + |
| 203 | +## 资源 |
| 204 | +### 官方预训练模型 |
| 205 | +Recent state-of-the-art [English word vectors](https://fasttext.cc/docs/en/english-vectors.html).<br/> |
| 206 | +Word vectors for [157 languages trained on Wikipedia and Crawl](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md).<br/> |
| 207 | +Models for [language identification](https://fasttext.cc/docs/en/language-identification.html#content) and [various supervised tasks](https://fasttext.cc/docs/en/supervised-models.html#content). |
| 208 | + |
| 209 | +## References |
| 210 | + |
| 211 | +Please cite [1](#enriching-word-vectors-with-subword-information) if using this code for learning word representations or [2](#bag-of-tricks-for-efficient-text-classification) if using for text classification. |
| 212 | + |
| 213 | +### Enriching Word Vectors with Subword Information |
| 214 | + |
| 215 | +[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606) |
| 216 | + |
| 217 | +``` |
| 218 | +@article{bojanowski2017enriching, |
| 219 | + title={Enriching Word Vectors with Subword Information}, |
| 220 | + author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, |
| 221 | + journal={Transactions of the Association for Computational Linguistics}, |
| 222 | + volume={5}, |
| 223 | + year={2017}, |
| 224 | + issn={2307-387X}, |
| 225 | + pages={135--146} |
| 226 | +} |
| 227 | +``` |
| 228 | + |
| 229 | +### Bag of Tricks for Efficient Text Classification |
| 230 | + |
| 231 | +[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) |
| 232 | + |
| 233 | +``` |
| 234 | +@InProceedings{joulin2017bag, |
| 235 | + title={Bag of Tricks for Efficient Text Classification}, |
| 236 | + author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, |
| 237 | + booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, |
| 238 | + month={April}, |
| 239 | + year={2017}, |
| 240 | + publisher={Association for Computational Linguistics}, |
| 241 | + pages={427--431}, |
| 242 | +} |
| 243 | +``` |
| 244 | + |
| 245 | +### FastText.zip: Compressing text classification models |
| 246 | + |
| 247 | +[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651) |
| 248 | + |
| 249 | +``` |
| 250 | +@article{joulin2016fasttext, |
| 251 | + title={FastText.zip: Compressing text classification models}, |
| 252 | + author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, |
| 253 | + journal={arXiv preprint arXiv:1612.03651}, |
| 254 | + year={2016} |
| 255 | +} |
| 256 | +``` |
| 257 | + |
| 258 | +(\* These authors contributed equally.) |
| 259 | + |
| 260 | +## License |
| 261 | + |
| 262 | +fastText is BSD-licensed. [Facebook持有专利](https://github.com/facebookresearch/fastText/blob/master/PATENTS) |
0 commit comments