url often generates lang:en on small text #17

juliendangers · 2014-07-19T12:25:34Z

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

jprante · 2014-07-19T13:06:54Z

Yes, the input data can not be reliable processed if text is either short (single words) or short and mixed. To me it makes sense: in first text there is the word facebook and posts, in the second there is no english word.

This restriction is due to the underlying lang detect module, this plugin can not change this.

juliendangers · 2014-07-19T13:19:45Z

Yes I agree that it makes sense that english is detected with the url in it. But I do not see the sense of using url in language detection.

I've done the following :

added a pattern for url

private final static Pattern urlPattern = Pattern.compile("^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);

(not sure Pattern.UNICODE_CHARACTER_CLASS is necessary here)

replaced

text.replaceAll(word.pattern(), " ")

By

text.replaceAll(word.pattern(), " ").replaceAll(urlPattern.pattern(), " ")

in Detector.detect and Detector.detectAll

But you're right, this should be done in the underlying lang detect module, I'm going to submit a PR to it.

This issue can be closed, don't you think ?

jprante · 2014-07-19T15:16:27Z

I see the point that URL is not text. But there is many data that is not text. So I think URL/URI is only one example.

For this plugin, I think the most viable approach is to only use input for lang detect that is preprocessed in the sense that it is recognizable language.

Most general approach would be part-of-speech (POS) tagging like in natural language processing / text mining. It would be a good idea to combine POS tagger with language detection like this plugin can do.

jprante closed this as completed Jul 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

url often generates lang:en on small text #17

url often generates lang:en on small text #17

juliendangers commented Jul 19, 2014

jprante commented Jul 19, 2014

Uh oh!

juliendangers commented Jul 19, 2014

Uh oh!

jprante commented Jul 19, 2014

Uh oh!

url often generates lang:en on small text #17

url often generates lang:en on small text #17

Comments

juliendangers commented Jul 19, 2014

jprante commented Jul 19, 2014

Uh oh!

juliendangers commented Jul 19, 2014

Uh oh!

jprante commented Jul 19, 2014

Uh oh!