Skip to content
This repository was archived by the owner on Dec 18, 2021. It is now read-only.
This repository was archived by the owner on Dec 18, 2021. It is now read-only.

problem with decoding escaped unicode string #60

Closed
@lexand

Description

@lexand

Hi.
My config is

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "mYc2RK-",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "WCHUEzGyR8yTPfvtJWTBFQ",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Lang detect 5.3.0.1

Example 1

GET _langdetect
{  "text" : "какой-то не очень длинный русский текст"}

{
  "languages": [
    {
      "language": "ru",
      "probability": 0.999997235732777
    }
  ]
}

Example 2

GET _langdetect
{"text":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"}

{
  "languages": [
    {
      "language": "hr",
      "probability": 0.9999997870025434
    }
  ]
}

Both texts are identical, but first sends as is, second is unicode escaped.
In first example language was determined correctly.

Escaped unicode strings gets from ES PHP library v 5.1.3
(elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)

Another example with escaped unicode string which shows that problem probably is in langdetect plugin.
Create new doc with unicode escaped string:

POST test/test
{
  "Title":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"
}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET test/test/AVszuHOlZhOlEAMN9jBe

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Create new doc with unicode string:

POST test/test
{"Title": "какой-то не очень длинный русский текст"}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET lookmytrips/Image/AVszvDhMZhOlEAMN9jBh

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszvDhMZhOlEAMN9jBh",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Us you can see both texts were stored and displayed in correct way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions