Skip to content
This repository was archived by the owner on Dec 18, 2021. It is now read-only.

problem with decoding escaped unicode string #60

Closed
lexand opened this issue Apr 3, 2017 · 2 comments
Closed

problem with decoding escaped unicode string #60

lexand opened this issue Apr 3, 2017 · 2 comments

Comments

@lexand
Copy link

lexand commented Apr 3, 2017

Hi.
My config is

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "mYc2RK-",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "WCHUEzGyR8yTPfvtJWTBFQ",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Lang detect 5.3.0.1

Example 1

GET _langdetect
{  "text" : "какой-то не очень длинный русский текст"}

{
  "languages": [
    {
      "language": "ru",
      "probability": 0.999997235732777
    }
  ]
}

Example 2

GET _langdetect
{"text":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"}

{
  "languages": [
    {
      "language": "hr",
      "probability": 0.9999997870025434
    }
  ]
}

Both texts are identical, but first sends as is, second is unicode escaped.
In first example language was determined correctly.

Escaped unicode strings gets from ES PHP library v 5.1.3
(elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)

Another example with escaped unicode string which shows that problem probably is in langdetect plugin.
Create new doc with unicode escaped string:

POST test/test
{
  "Title":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"
}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET test/test/AVszuHOlZhOlEAMN9jBe

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Create new doc with unicode string:

POST test/test
{"Title": "какой-то не очень длинный русский текст"}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET lookmytrips/Image/AVszvDhMZhOlEAMN9jBh

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszvDhMZhOlEAMN9jBh",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Us you can see both texts were stored and displayed in correct way.

@jprante
Copy link
Owner

jprante commented Apr 3, 2017

There was an error in the REST action. Now, with 5.3.0.2, the body is considered as JSON, and parsed as JSON:

POST /_langdetect
{
  "text": "..."
}

@lexand
Copy link
Author

lexand commented Apr 4, 2017

thanks a lot

@lexand lexand closed this as completed Apr 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants