This repository was archived by the owner on Dec 18, 2021. It is now read-only.
This repository was archived by the owner on Dec 18, 2021. It is now read-only.
problem with decoding escaped unicode string #60
Closed
Description
Hi.
My config is
$ curl -XGET http://127.0.0.1:9200
{
"name" : "mYc2RK-",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "WCHUEzGyR8yTPfvtJWTBFQ",
"version" : {
"number" : "5.3.0",
"build_hash" : "3adb13b",
"build_date" : "2017-03-23T03:31:50.652Z",
"build_snapshot" : false,
"lucene_version" : "6.4.1"
},
"tagline" : "You Know, for Search"
}
Lang detect 5.3.0.1
Example 1
GET _langdetect
{ "text" : "какой-то не очень длинный русский текст"}
{
"languages": [
{
"language": "ru",
"probability": 0.999997235732777
}
]
}
Example 2
GET _langdetect
{"text":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"}
{
"languages": [
{
"language": "hr",
"probability": 0.9999997870025434
}
]
}
Both texts are identical, but first sends as is, second is unicode escaped.
In first example language was determined correctly.
Escaped unicode strings gets from ES PHP library v 5.1.3
(elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)
Another example with escaped unicode string which shows that problem probably is in langdetect plugin.
Create new doc with unicode escaped string:
POST test/test
{
"Title":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"
}
{
"_index": "test",
"_type": "test",
"_id": "AVszuHOlZhOlEAMN9jBe",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": true
}
GET test/test/AVszuHOlZhOlEAMN9jBe
{
"_index": "test",
"_type": "test",
"_id": "AVszuHOlZhOlEAMN9jBe",
"_version": 1,
"found": true,
"_source": {
"Title": "какой-то не очень длинный русский текст"
}
}
Create new doc with unicode string:
POST test/test
{"Title": "какой-то не очень длинный русский текст"}
{
"_index": "test",
"_type": "test",
"_id": "AVszuHOlZhOlEAMN9jBe",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": true
}
GET lookmytrips/Image/AVszvDhMZhOlEAMN9jBh
{
"_index": "test",
"_type": "test",
"_id": "AVszvDhMZhOlEAMN9jBh",
"_version": 1,
"found": true,
"_source": {
"Title": "какой-то не очень длинный русский текст"
}
}
Us you can see both texts were stored and displayed in correct way.
Metadata
Metadata
Assignees
Labels
No labels