81
81
82
82
| Elasticsearch | Plugin | Release date |
83
83
| -------------- | -------------- | ------------ |
84
+ | 2.0.0 | 2.0.0.0 | Nov 12, 2015 |
84
85
| 2.0.0-beta2 | 2.0.0-beta2.0 | Sep 19, 2015 |
85
86
| 1.6.0 | 1.6.0.0 | Jul 1, 2015 |
86
87
| 1.4.0 | 1.4.4.2 | Apr 3, 2015 |
98
99
99
100
## Installation Elasticsearch 2.x
100
101
101
- ./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.0.0-beta2 .0/elasticsearch-langdetect-2.0.0-beta2 .0-plugin.zip
102
+ ./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.0.0.0/elasticsearch-langdetect-2.0.0.0-plugin.zip
102
103
103
104
Do not forget to restart the node after installing.
104
105
@@ -112,75 +113,155 @@ All feedback is welcome! If you find issues, please post them at [Github](https:
112
113
113
114
# Examples
114
115
115
- ## Language detection mapping example
116
+ ## A simple language detection example
116
117
117
- curl -XDELETE 'localhost:9200/test'
118
+ In this example, we create a simple detector field, and write text to it for detection.
118
119
119
- curl -XPUT 'localhost:9200/test'
120
+ curl -XDELETE 'localhost:9200/test'
120
121
121
- curl -XPOST 'localhost:9200/test/article/_mapping' -d '
122
- {
123
- "article" : {
124
- "properties" : {
125
- "content" : { "type" : "langdetect" }
126
- }
127
- }
128
- }
129
- '
122
+ curl -XPUT 'localhost:9200/test'
130
123
131
- curl -XPUT 'localhost:9200/test/article/1' -d '
132
- {
133
- "title" : "Some title",
134
- "content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
124
+ curl -XPOST 'localhost:9200/test/article/_mapping' -d '
125
+ {
126
+ "article" : {
127
+ "properties" : {
128
+ "content" : { "type" : "langdetect" }
135
129
}
136
- '
130
+ }
131
+ }
132
+ '
137
133
138
- curl -XPUT 'localhost:9200/test/article/2 ' -d '
139
- {
140
- "title" : "Ein Titel ",
141
- "content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland! "
142
- }
143
- '
134
+ curl -XPUT 'localhost:9200/test/article/1 ' -d '
135
+ {
136
+ "title" : "Some title ",
137
+ "content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming? "
138
+ }
139
+ '
144
140
145
- curl -XPUT 'localhost:9200/test/article/3 ' -d '
146
- {
147
- "title" : "Un titre ",
148
- "content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé !"
149
- }
150
- '
141
+ curl -XPUT 'localhost:9200/test/article/2 ' -d '
142
+ {
143
+ "title" : "Ein Titel ",
144
+ "content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland !"
145
+ }
146
+ '
151
147
152
- curl -XGET 'localhost:9200/test/_refresh'
148
+ curl -XPUT 'localhost:9200/test/article/3' -d '
149
+ {
150
+ "title" : "Un titre",
151
+ "content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
152
+ }
153
+ '
153
154
154
- curl -XPOST 'localhost:9200/test/_search' -d '
155
- {
156
- "query" : {
157
- "term" : {
158
- "content" : "en"
159
- }
155
+ A search for the detected language codes is a simple term query, like this:
156
+
157
+ curl -XGET 'localhost:9200/test/_refresh'
158
+
159
+ curl -XPOST 'localhost:9200/test/_search' -d '
160
+ {
161
+ "query" : {
162
+ "term" : {
163
+ "content" : "en"
160
164
}
161
- }
162
- '
163
- curl -XPOST 'localhost:9200/test/_search' -d '
164
- {
165
- "query" : {
166
- "term " : {
167
- "content " : "de"
168
- }
165
+ }
166
+ }
167
+ '
168
+ curl -XPOST 'localhost:9200/test/_search' -d '
169
+ {
170
+ "query " : {
171
+ "term " : {
172
+ "content" : "de"
169
173
}
170
- }
171
- '
174
+ }
175
+ }
176
+ '
172
177
173
- curl -XPOST 'localhost:9200/test/_search' -d '
174
- {
175
- "query" : {
176
- "term" : {
177
- "content" : "fr"
178
+ curl -XPOST 'localhost:9200/test/_search' -d '
179
+ {
180
+ "query" : {
181
+ "term" : {
182
+ "content" : "fr"
183
+ }
184
+ }
185
+ }
186
+ '
187
+
188
+ ## Show stored language codes
189
+
190
+ Using multifields, it is possible to store the text alongside with the detected language(s).
191
+ Here, we use another (short nonsense) example text for demonstration,
192
+ which has more than one detected language code.
193
+
194
+ curl -XDELETE 'localhost:9200/test'
195
+
196
+ curl -XPUT 'localhost:9200/test'
197
+
198
+ curl -XPOST 'localhost:9200/test/article/_mapping' -d '
199
+ {
200
+ "article" : {
201
+ "properties" : {
202
+ "content" : {
203
+ "type" : "multi_field",
204
+ "fields" : {
205
+ "content" : {
206
+ "type" : "string"
207
+ },
208
+ "language" : {
209
+ "type": "langdetect",
210
+ "store" : true
211
+ }
178
212
}
179
213
}
180
214
}
181
- '
215
+ }
216
+ }
217
+ '
218
+
219
+ curl -XPUT 'localhost:9200/test/article/1' -d '
220
+ {
221
+ "content" : "watt datt"
222
+ }
223
+ '
182
224
183
- ## Language detection with attachment mapper plugin example
225
+ curl -XGET 'localhost:9200/test/_refresh'
226
+
227
+ curl -XPOST 'localhost:9200/test/_search?pretty' -d '
228
+ {
229
+ "fields" : "content.language",
230
+ "query" : {
231
+ "match" : {
232
+ "content" : "watt datt"
233
+ }
234
+ }
235
+ }
236
+ '
237
+
238
+ The result is
239
+
240
+ {
241
+ "took" : 2,
242
+ "timed_out" : false,
243
+ "_shards" : {
244
+ "total" : 5,
245
+ "successful" : 5,
246
+ "failed" : 0
247
+ },
248
+ "hits" : {
249
+ "total" : 1,
250
+ "max_score" : 0.51623213,
251
+ "hits" : [ {
252
+ "_index" : "test",
253
+ "_type" : "article",
254
+ "_id" : "1",
255
+ "_score" : 0.51623213,
256
+ "fields" : {
257
+ "content.language" : [ "sv", "it", "nl" ]
258
+ }
259
+ } ]
260
+ }
261
+ }
262
+
263
+
264
+ ## Language detection with attachment mapper plugin
184
265
185
266
curl -XDELETE 'localhost:9200/test'
186
267
@@ -289,6 +370,36 @@ All feedback is welcome! If you find issues, please post them at [Github](https:
289
370
} ]
290
371
}
291
372
373
+
374
+ # Settings
375
+
376
+ These settings can be used in ` elasticsearch.yml ` to modify language detection.
377
+
378
+ Use with caution. You don't need to modify settings. This list is just for the sake of completeness.
379
+ For successful modification of the model parameters, you should study the source code
380
+ and be familiar with probabilistic matching using naive bayes with character n-gram.
381
+ See also Ted Dunning,
382
+ [ Statistical Identification of Language] ( http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958 ) , 1994.
383
+
384
+ ` langdetect.languages ` - a comma-separated list of language codes used to restrict the detection
385
+
386
+ ` langdetect.map.<code> ` - a substitution code for a language code
387
+
388
+ ` langdetect.number_of_trials ` - number of trials, affects CPU usage (default: 7)
389
+
390
+ ` langdetect.alpha ` - additional smoothing parameter, default: 0.5
391
+
392
+ ` langdetect.alpha_width ` - the width of smoothing, default: 0.05
393
+
394
+ ` langdetect.iteration_limit ` - safeguard to break loop, default: 10000
395
+
396
+ ` langdetect.prob_threshold ` - default: 0.1
397
+
398
+ ` langdetect.conv_threshold ` - detection is terminated when normalized probability exceeds
399
+ this threshold, default: 0.99999
400
+
401
+ ` langdetect.base_freq ` - default 10000
402
+
292
403
# Credits
293
404
294
405
Thanks to Alexander Reelsen for his OpenNLP plugin, from where I have copied and
0 commit comments