Skip to content
This repository was archived by the owner on Jun 20, 2023. It is now read-only.

Add support for language detection #44

Closed
richardwilly98 opened this issue Oct 24, 2013 · 6 comments
Closed

Add support for language detection #44

richardwilly98 opened this issue Oct 24, 2013 · 6 comments
Assignees

Comments

@richardwilly98
Copy link
Contributor

Tika has a language detection feature [1]. It would be very useful to include in this plugin.

There is already a plugin for language detection in ES [2] but I was not able to get it working for attachment type.

[1] - https://tika.apache.org/1.4/detection.html#Language_Detection
[2] - https://github.com/jprante/elasticsearch-langdetect

@richardwilly98
Copy link
Contributor Author

Nice article about language detection accuracy [1].

[1] - http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

@jprante
Copy link

jprante commented Nov 5, 2013

The langdetect plugin can now be combined with the attachment mapper plugin. See jprante/elasticsearch-langdetect#5

dadoonet pushed a commit to dadoonet/elasticsearch-mapper-attachments that referenced this issue Jan 13, 2014
@dadoonet
Copy link
Member

Heya,

I started to play with your PR (rebasing, changing some part of code, adding some other tests). See branch here: https://github.com/dadoonet/elasticsearch-mapper-attachments/tree/issue/44-langdetect

Some comments:

What do you think?

@ghost ghost assigned dadoonet Jan 13, 2014
@richardwilly98
Copy link
Contributor Author

I am not sure about accuracy. I have used elasticsearch-langdetect [1] but did not find it accurate.
I tested Tika version as well and found it pretty good (but I have not done extensive testing). And your example is funny :-)

Regarding performance I definitely agree language detection should be disabled by default.

[1] - https://github.com/jprante/elasticsearch-langdetect

@jprante
Copy link

jprante commented Jan 14, 2014

Regarding accuracy, the language-detection which is used in elasticsearch-langdetect was found to be more accurate than Tika:

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

If you have observed accuracy issues, feel free to open an issue https://github.com/jprante/elasticsearch-langdetect/issues

@richardwilly98
Copy link
Contributor Author

This link is the main reason I have been using elasticsearch-langdetect plugin.

I have create a new issue jprante/elasticsearch-langdetect#8

dadoonet pushed a commit to dadoonet/elasticsearch-mapper-attachments that referenced this issue Jan 14, 2014
Based on PR elastic#45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection

By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.

Closes elastic#45.
Closes elastic#44.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants