-
Notifications
You must be signed in to change notification settings - Fork 94
Add support for language detection #44
Comments
Nice article about language detection accuracy [1]. [1] - http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html |
The langdetect plugin can now be combined with the attachment mapper plugin. See jprante/elasticsearch-langdetect#5 |
- Using Language detection feature available in Tika: https://tika.apache.org/1.4/detection.html#Language_Detection Closes elastic#44. Closes elastic#45.
Heya, I started to play with your PR (rebasing, changing some part of code, adding some other tests). See branch here: https://github.com/dadoonet/elasticsearch-mapper-attachments/tree/issue/44-langdetect Some comments:
What do you think? |
I am not sure about accuracy. I have used elasticsearch-langdetect [1] but did not find it accurate. Regarding performance I definitely agree language detection should be disabled by default. |
Regarding accuracy, the language-detection which is used in elasticsearch-langdetect was found to be more accurate than Tika: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html If you have observed accuracy issues, feel free to open an issue https://github.com/jprante/elasticsearch-langdetect/issues |
This link is the main reason I have been using elasticsearch-langdetect plugin. I have create a new issue jprante/elasticsearch-langdetect#8 |
Based on PR elastic#45, we add a new language detection option using Language detection feature available in Tika: https://tika.apache.org/1.4/detection.html#Language_Detection By default, language detection is disabled (`false`) as it could come with a cost. This default value can be changed by setting the `index.mapping.attachment.detect_language` setting. It can also be provided on a per document indexed using the `_detect_language` parameter. Closes elastic#45. Closes elastic#44.
Tika has a language detection feature [1]. It would be very useful to include in this plugin.
There is already a plugin for language detection in ES [2] but I was not able to get it working for
attachment
type.[1] - https://tika.apache.org/1.4/detection.html#Language_Detection
[2] - https://github.com/jprante/elasticsearch-langdetect
The text was updated successfully, but these errors were encountered: