Skip to content

Commit 7dee1ba

Browse files
committed
Added chapter about unicodeSetFilter. Changed analyzer names in the examples where they were all "collation".
Commits for the unicodeSetFilter changes are coming.
1 parent b2c5f3d commit 7dee1ba

File tree

1 file changed

+31
-3
lines changed

1 file changed

+31
-3
lines changed

guide/reference/index-modules/analysis/icu-plugin.textile

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ p. Normalizes characters as explained "here":http://userguide.icu-project.org/tr
1919
"index" : {
2020
"analysis" : {
2121
"analyzer" : {
22-
"collation" : {
22+
"normalization" : {
2323
"tokenizer" : "keyword",
2424
"filter" : ["icu_normalizer"]
2525
}
@@ -31,14 +31,15 @@ p. Normalizes characters as explained "here":http://userguide.icu-project.org/tr
3131

3232
h1. ICU Folding
3333

34-
p. Folding of unicode characters based on @UTR#30@. It registers itself under @icu_folding@ and @icuFolding@ names. Sample setting:
34+
p. Folding of unicode characters based on @UTR#30@. It registers itself under @icu_folding@ and @icuFolding@ names.
35+
The filter also does lowercasing, which means the lowercase filter can normally be left out. Sample setting:
3536

3637
<pre class="prettyprint">
3738
{
3839
"index" : {
3940
"analysis" : {
4041
"analyzer" : {
41-
"collation" : {
42+
"folding" : {
4243
"tokenizer" : "keyword",
4344
"filter" : ["icu_folding"]
4445
}
@@ -48,6 +49,33 @@ p. Folding of unicode characters based on @UTR#30@. It registers itself under @i
4849
}
4950
</pre>
5051

52+
h2. Filtering
53+
54+
p. The folding can be filtered by a set of unicode characters with the parameter @unicodeSetFilter@. This is useful for a non-internationalized search engine where retaining a set of national characters which are primary letters in a specific language is wanted. See syntax for the UnicodeSet "here":http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html.
55+
56+
p. The Following example excempt Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.
57+
58+
<pre class="prettyprint">
59+
{
60+
"index" : {
61+
"analysis" : {
62+
"analyzer" : {
63+
"folding" : {
64+
"tokenizer" : "standard",
65+
"filter" : ["my_icu_folding", "lowercase"]
66+
}
67+
}
68+
"filter" : {
69+
"my_icu_folding" : {
70+
"type" : "icu_folding"
71+
"unicodeSetFilter" : "[^åäöÅÄÖ]"
72+
}
73+
}
74+
}
75+
}
76+
}
77+
</pre>
78+
5179
h1. ICU Collation
5280

5381
p. Uses collation token filter. Allows to either specify the rules for collation (defined "here":http://www.icu-project.org/userguide/Collate_Customization.html) using the @rules@ parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the @language@ parameter (further specialized by country and variant). By default registers under @icu_collation@ or @icuCollation@ and uses the default locale.

0 commit comments

Comments
 (0)