This plugin enables URL tokenization and token filtering by URL part.
| Elasticsearch Version | Plugin Version |
|---|---|
| 2.3.2 | 2.3.2.1 |
| 2.3.1 | 2.3.1.1 |
| 2.3.0 | 2.3.0.1 |
| 2.2.2 | 2.2.3 |
| 2.2.1 | 2.2.2 |
| 2.2.0 | 2.2.1 |
| 2.1.1 | 2.2.0 |
| 2.1.1 | 2.1.1 |
| 2.0.0 | 2.1.0 |
| 1.6.x, 1.7.x | 2.0.0 |
| 1.6.0 | 1.2.1 |
| 1.5.2 | 1.1.0 |
| 1.4.2 | 1.0.0 |
bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.3.2.1/elasticsearch-analysis-url-2.3.2.1.zippart: Defaults tonull. If leftnull, all URL parts will be tokenized, and some additional tokens (host:portandprotocol://host) will be included. Options arewhole,protocol,host,port,path,query, andref.url_decode: Defaults tofalse. Iftrue, URL tokens will be URL decoded.allow_malformed: Defaults tofalse. Iftrue, malformed URLs will not be rejected, but will be passed through without being tokenized.tokenize_host: Defaults totrue. Iftrue, the host will be further tokenized using a reverse path hierarchy tokenizer with the delimiter set to..tokenize_path: Defaults totrue. Iftrue, the path will be tokenized using a path hierarchy tokenizer with the delimiter set to/.tokenize_query: Defaults totrue. Iftrue, the query string will be split on&.
Index settings:
{
"settings": {
"analysis": {
"tokenizer": {
"url_host": {
"type": "url",
"part": "host"
}
},
"analyzer": {
"url_host": {
"tokenizer": "url_host"
}
}
}
}
}Make an analysis request:
curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'
{
"tokens" : [ {
"token" : "foo.bar.com",
"start_offset" : 8,
"end_offset" : 19,
"type" : "host",
"position" : 1
}, {
"token" : "bar.com",
"start_offset" : 12,
"end_offset" : 19,
"type" : "host",
"position" : 2
}, {
"token" : "com",
"start_offset" : 16,
"end_offset" : 19,
"type" : "host",
"position" : 3
} ]
}part: This option defaults towhole, which will cause the entire URL to be returned. In this case, the filter only serves to validate incoming URLs. Other possible values are:protocol,host,port,path,query, andref.url_decode: Defaults tofalse. Iftrue, the desired portion of the URL will be URL decoded.allow_malformed: Defaults tofalse. Iftrue, documents containing malformed URLs will not be rejected, and an attempt will be made to parse the desired URL part from the malformed URL string. If the desired part cannot be found, no value will be indexed for that field.passthrough: Defaults tofalse. Iftrue,allow_malformedis implied, and any non-url tokens will be passed through the filter. Valid URLs will be tokenized according to the filter's other settings.tokenize_host: Defaults totrue. Iftrue, the host will be further tokenized using a reverse path hierarchy tokenizer with the delimiter set to..tokenize_path: Defaults totrue. Iftrue, the path will be tokenized using a path hierarchy tokenizer with the delimiter set to/.tokenize_query: Defaults totrue. Iftrue, the query string will be split on&.
Set up your index like so:
{
"settings": {
"analysis": {
"filter": {
"url_host": {
"type": "url",
"part": "host",
"url_decode": true
}
},
"analyzer": {
"url_host": {
"filter": ["url_host"],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"example_type": {
"properties": {
"url": {
"type": "multi_field",
"fields": {
"url": {"type": "string"},
"host": {"type": "string", "analyzer": "url_host"}
}
}
}
}
}
}Make an analysis request:
curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'
{
"tokens" : [ {
"token" : "foo.bar.com",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 1
} ]
}