Skip to content
This repository was archived by the owner on Dec 18, 2021. It is now read-only.

url often generates lang:en on small text #17

Closed
juliendangers opened this issue Jul 19, 2014 · 3 comments
Closed

url often generates lang:en on small text #17

juliendangers opened this issue Jul 19, 2014 · 3 comments

Comments

@juliendangers
Copy link
Contributor

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

@jprante
Copy link
Owner

jprante commented Jul 19, 2014

Yes, the input data can not be reliable processed if text is either short (single words) or short and mixed. To me it makes sense: in first text there is the word facebook and posts, in the second there is no english word.

This restriction is due to the underlying lang detect module, this plugin can not change this.

@juliendangers
Copy link
Contributor Author

Yes I agree that it makes sense that english is detected with the url in it. But I do not see the sense of using url in language detection.

I've done the following :

  • added a pattern for url
private final static Pattern urlPattern = Pattern.compile("^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);

(not sure Pattern.UNICODE_CHARACTER_CLASS is necessary here)

  • replaced
text.replaceAll(word.pattern(), " ")

By

text.replaceAll(word.pattern(), " ").replaceAll(urlPattern.pattern(), " ")

in Detector.detect and Detector.detectAll

But you're right, this should be done in the underlying lang detect module, I'm going to submit a PR to it.

This issue can be closed, don't you think ?

@jprante
Copy link
Owner

jprante commented Jul 19, 2014

I see the point that URL is not text. But there is many data that is not text. So I think URL/URI is only one example.

For this plugin, I think the most viable approach is to only use input for lang detect that is preprocessed in the sense that it is recognizable language.

Most general approach would be part-of-speech (POS) tagging like in natural language processing / text mining. It would be a good idea to combine POS tagger with language detection like this plugin can do.

@jprante jprante closed this as completed Jul 19, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants