Skip to content
This repository was archived by the owner on Dec 18, 2021. It is now read-only.

apply filter on text before ngram detection #18

Closed
juliendangers opened this issue Jul 19, 2014 · 2 comments
Closed

apply filter on text before ngram detection #18

juliendangers opened this issue Jul 19, 2014 · 2 comments

Comments

@juliendangers
Copy link
Contributor

I ran into an issue which could be solved by running some custom filters (I do not mean Lucene filters, but more things like predefined filters, eg lowercase, uppercase, ...) :

I get the following french tweet with an uppercase text :

COMMENT DES GENS PEUVENT TROUVER DES CÉLÉBRITÉS DANS LES MAGASINS JE PEUX MÊME PAS TROUVER MA MÈRE
which is detected as english :

{
    "language": "en",
    "probability": 0.9999937971825049
}

But when I ask for the exact same text lowercased,

comment des gens peuvent trouver des célébrités dans les magasins je peux même pas trouver ma mère

{
    "language": "fr",
    "probability": 0.9999970343219597
}

french is now detected

@jprante
Copy link
Owner

jprante commented Jul 19, 2014

The lang detect module is case sensitive. It is possible that for french, ngram freqs for upper case have not been recorded. This is due to the upstream lang detect module, not this plugin.

@juliendangers
Copy link
Contributor Author

Ok, thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants