Doc improvements for language tags and custom ICU collations.

author Jeff Davis <[email protected]>

Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)

committer Jeff Davis <[email protected]>

Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)
author Jeff Davis <[email protected]>
Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)
committer Jeff Davis <[email protected]>
Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml

index 6dd95b89664b77f6fa8008bab3857c31aefe7fd3..be06f746a5953bd0302c2dfedfd421bcdd1a98db 100644 (file)
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
      variants and customization options.
     </para>
    </sect2>
+  <sect2 id="icu-locales">
+   <title>ICU Locales</title>
+   <sect3 id="icu-locale-names">
+    <title>ICU Locale Names</title>
+    <para>
+     The ICU format for the locale name is a <link
+     linkend="icu-language-tag">Language Tag</link>.
+
+<programlisting>
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP');
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+</programlisting>
+    </para>
+   </sect3>
+   <sect3 id="icu-canonicalization">
+    <title>Locale Canonicalization and Validation</title>
+    <para>
+     When defining a new ICU collation object or database with ICU as the
+     provider, the given locale name is transformed ("canonicalized") into a
+     language tag if not already in that form. For instance,
+
+<screen>
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE:  using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE:  using standard form "de-DE" for locale "de_DE.utf8"
+</screen>
+
+     If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
+     <symbol>LOCALE</symbol> are the expected result. For consistent results
+     when using the ICU provider, specify the canonical <link
+     linkend="icu-language-tag">language tag</link> instead of relying on the
+     transformation.
+    </para>
+    <para>
+     A locale with no language name, or the special language name
+     <literal>root</literal>, is transformed to have the language
+     <literal>und</literal> ("undefined").
+    </para>
+    <para>
+     ICU can transform most libc locale names, as well as some other formats,
+     into language tags for easier transition to ICU. If a libc locale name is
+     used in ICU, it may not have precisely the same behavior as in libc.
+    </para>
+    <para>
+     If there is a problem interpreting the locale name, or if the locale name
+     represents a language or region that ICU does not recognize, you will see
+     the following warning:
+
+<screen>
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+WARNING:  ICU locale "nonsense" has unknown language "nonsense"
+HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION
+</screen>
+
+     <xref linkend="guc-icu-validation-level"/> controls how the message is
+     reported. Unless set to <literal>ERROR</literal>, the collation will
+     still be created, but the behavior may not be what the user intended.
+    </para>
+   </sect3>
+   <sect3 id="icu-language-tag">
+    <title>Language Tag</title>
+    <para>
+     A language tag, defined in BCP 47, is a standardized identifier used to
+     identify languages, regions, and other information about a locale.
+    </para>
+    <para>
+     Basic language tags are simply
+     <replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>;
+     or even just <replaceable>language</replaceable>. The
+     <replaceable>language</replaceable> is a language code
+     (e.g. <literal>fr</literal> for French), and
+     <replaceable>region</replaceable> is a region code
+     (e.g. <literal>CA</literal> for Canada). Examples:
+     <literal>ja-JP</literal>, <literal>de</literal>, or
+     <literal>fr-CA</literal>.
+    </para>
+    <para>
+     Collation settings may be included in the language tag to customize
+     collation behavior. ICU allows extensive customization, such as
+     sensitivity (or insensitivity) to accents, case, and punctuation;
+     treatment of digits within text; and many other options to satisfy a
+     variety of uses.
+    </para>
+    <para>
+     To include this additional collation information in a language tag,
+     append <literal>-u</literal>, which indicates there are additional
+     collation settings, followed by one or more
+     <literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable>
+     pairs. The <replaceable>key</replaceable> is the key for a <link
+     linkend="icu-collation-settings">collation setting</link> and
+     <replaceable>value</replaceable> is a valid value for that setting. For
+     boolean settings, the <literal>-</literal><replaceable>key</replaceable>
+     may be specified without a corresponding
+     <literal>-</literal><replaceable>value</replaceable>, which implies a
+     value of <literal>true</literal>.
+    </para>
+    <para>
+     For example, the language tag <literal>en-US-u-kn-ks-level2</literal>
+     means the locale with the English language in the US region, with
+     collation settings <literal>kn</literal> set to <literal>true</literal>
+     and <literal>ks</literal> set to <literal>level2</literal>. Those
+     settings mean the collation will be case-insensitive and treat a sequence
+     of digits as a single number:
  
+<screen>
+CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
+SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+SELECT 'N-45' &lt; 'N-123' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+</screen>
+    </para>
+    <para>
+     See <xref linkend="icu-custom-collations"/> for details and additional
+     examples of using language tags with custom collation information for the
+     locale.
+    </para>
+   </sect3>
+  </sect2>
    <sect2 id="locale-problems">
     <title>Problems</title>
  
@@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
      code byte values.
     </para>
  
+   <note>
+    <para>
+     The <literal>C</literal> and <literal>POSIX</literal> locales may behave
+     differently depending on the database encoding.
+    </para>
+   </note>
+
     <para>
      Additionally, two SQL standard collation names are available:
  
@@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
     <sect4 id="collation-managing-create-icu">
      <title>ICU Collations</title>
  
-   <para>
-    ICU allows collations to be customized beyond the basic language+country
-    set that is preloaded by <command>initdb</command>.  Users are encouraged
-    to define their own collation objects that make use of these facilities to
-    suit the sorting behavior to their requirements.
-    See <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
-    and <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink> for
-    information on ICU locale naming.  The set of acceptable names and
-    attributes depends on the particular ICU version.
-   </para>
-
-   <para>
-    Here are some examples:
-
-    <variablelist>
-     <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
-      <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
-      <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
-      <listitem>
-       <para>German collation with phone book collation type</para>
-       <para>
-        The first example selects the ICU locale using a <quote>language
-        tag</quote> per BCP 47.  The second example uses the traditional
-        ICU-specific locale syntax.  The first style is preferred going
-        forward, and is used internally to store locales.
-       </para>
-       <para>
-        Note that you can name the collation objects in the SQL environment
-        anything you want.  In this example, we follow the naming style that
-        the predefined collations use, which in turn also follow BCP 47, but
-        that is not required for user-defined collations.
-       </para>
-      </listitem>
-     </varlistentry>
-
-     <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
-      <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
-      <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
-      <listitem>
-       <para>
-        Root collation with Emoji collation type, per Unicode Technical Standard #51
-       </para>
-       <para>
-        Observe how in the traditional ICU locale naming system, the root
-        locale is selected by an empty string.
-       </para>
-      </listitem>
-     </varlistentry>
-
-     <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
-      <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
-      <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');</literal></term>
-      <listitem>
-       <para>
-        Sort Greek letters before Latin ones.  (The default is Latin before Greek.)
-       </para>
-      </listitem>
-     </varlistentry>
-
-     <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
-      <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
-      <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
-      <listitem>
-       <para>
-        Sort upper-case letters before lower-case letters.  (The default is
-        lower-case letters first.)
-       </para>
-      </listitem>
-     </varlistentry>
-
-    <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
-      <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
-      <term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');</literal></term>
-      <listitem>
-       <para>
-        Combines both of the above options.
-       </para>
-      </listitem>
-     </varlistentry>
-
-     <varlistentry id="collation-managing-create-icu-en-u-kn-true">
-      <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
-      <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
-      <listitem>
-       <para>
-        Numeric ordering, sorts sequences of digits by their numeric value,
-        for example: <literal>A-21</literal> &lt; <literal>A-123</literal>
-        (also known as natural sort).
-       </para>
-      </listitem>
-     </varlistentry>
-    </variablelist>
-
-    See <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
-    Technical Standard #35</ulink>
-    and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
-    details.  The list of possible collation types (<literal>co</literal>
-    subtag) can be found in
-    the <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
-    repository</ulink>.
-   </para>
+    <para>
+     ICU collations can be created like:
  
-   <para>
-    Note that while this system allows creating collations that <quote>ignore
-    case</quote> or <quote>ignore accents</quote> or similar (using the
-    <literal>ks</literal> key), in order for such collations to act in a
-    truly case- or accent-insensitive manner, they also need to be declared as not
-    <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
-    see <xref linkend="collation-nondeterministic"/>.
-    Otherwise, any strings that compare equal according to the collation but
-    are not byte-wise equal will be sorted according to their byte values.
-   </para>
+<programlisting>
+CREATE COLLATION german (provider = icu, locale = 'de-DE');
+</programlisting>
  
-   <note>
+     ICU locales are specified as a BCP 47 <link
+     linkend="icu-language-tag">Language Tag</link>, but can also accept most
+     libc-style locale names. If possible, libc-style locale names are
+     transformed into language tags.
+    </para>
      <para>
-     By design, ICU will accept almost any string as a locale name and match
-     it to the closest locale it can provide, using the fallback procedure
-     described in its documentation.  Thus, there will be no direct feedback
-     if a collation specification is composed using features that the given
-     ICU installation does not actually support.  It is therefore recommended
-     to create application-level test cases to check that the collation
-     definitions satisfy one's requirements.
+     New ICU collations can customize collation behavior extensively by
+     including collation attributes in the langugage tag. See <xref
+     linkend="icu-custom-collations"/> for details and examples.
      </para>
-   </note>
     </sect4>
-
     <sect4 id="collation-copy">
     <title>Copying Collations</title>
  
@@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      </tip>
     </sect3>
    </sect2>
+  <sect2 id="icu-custom-collations">
+   <title>ICU Custom Collations</title>
+
+   <para>
+    ICU allows extensive control over collation behavior by defining new
+    collations with collation settings as a part of the language tag. These
+    settings can modify the collation order to suit a variety of needs. For
+    instance:
+
+<programlisting>
+-- ignore differences in accents and case
+CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
+
+-- upper case letters sort before lower case.
+CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
+SELECT 'B' &lt; 'b' COLLATE upper_first; -- true
+
+-- treat digits numerically and ignore punctuation
+CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
+SELECT 'id-45' &lt; 'id-123' COLLATE num_ignore_punct; -- true
+SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
+</programlisting>
+
+    Many of the available options are described in <xref
+    linkend="icu-collation-settings"/>, or see <xref
+    linkend="icu-external-references"/> for more details.
+   </para>
+   <sect3 id="icu-collation-comparison-levels">
+    <title>ICU Comparison Levels</title>
+    <para>
+     Comparison of two strings (collation) in ICU is determined by a
+     multi-level process, where textual features are grouped into
+     "levels". Treatment of each level is controlled by the <link
+     linkend="icu-collation-settings-table">collation settings</link>. Higher
+     levels correspond to finer textual features.
+    </para>
+    <para>
+     <table id="icu-collation-levels">
+      <title>ICU Collation Levels</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Level</entry>
+         <entry>Description</entry>
+         <entry><literal>'f' = 'f'</literal></entry>
+         <entry><literal>'ab' = U&amp;'a\2063b'</literal></entry>
+         <entry><literal>'x-y' = 'x_y'</literal></entry>
+         <entry><literal>'g' = 'G'</literal></entry>
+         <entry><literal>'n' = 'ñ'</literal></entry>
+         <entry><literal>'y' = 'z'</literal></entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry>level1</entry>
+         <entry>Base Character</entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>false</literal></entry>
+        </row>
+        <row>
+         <entry>level2</entry>
+         <entry>Accents</entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+        </row>
+        <row>
+         <entry>level3</entry>
+         <entry>Case/Variants</entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+        </row>
+        <row>
+         <entry>level4</entry>
+         <entry>Punctuation</entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+        </row>
+        <row>
+         <entry>identic</entry>
+         <entry>All</entry>
+         <entry><literal>true</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+
+     The above table shows which textual feature differences are
+     considered significant when determining equality at the given level. The
+     unicode character <literal>U+2063</literal> is an invisible separator,
+     and as seen in the table, is ignored for at all levels of comparison less
+     than <literal>identic</literal>.
+    </para>
+    <para>
+     At every level, even with full normalization off, basic normalization is
+     performed. For example, <literal>'á'</literal> may be composed of the
+     code points <literal>U&amp;'\0061\0301'</literal> or the single code
+     point <literal>U&amp;'\00E1'</literal>, and those sequences will be
+     considered equal even at the <literal>identic</literal> level. To treat
+     any difference in code point representation as distinct, use a collation
+     created with <symbol>DETERMINISTIC</symbol> set to
+     <literal>true</literal>.
+    </para>
+    <sect4 id="icu-collation-level-examples">
+     <title>Collation Level Examples</title>
+     <para>
+
+<programlisting>
+CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
+CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
+CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
+
+-- invisible separator ignored at all levels except identic
+SELECT 'ab' = U&amp;'a\2063b' COLLATE level4; -- true
+SELECT 'ab' = U&amp;'a\2063b' COLLATE identic; -- false
+
+-- punctuation ignored at level3 but not at level 4
+SELECT 'x-y' = 'x_y' COLLATE level3; -- true
+SELECT 'x-y' = 'x_y' COLLATE level4; -- false
+</programlisting>
+
+     </para>
+    </sect4>
+   </sect3>
+   <sect3 id="icu-collation-settings">
+    <title>Collation Settings for an ICU Locale</title>
+    <para>
+     <table id="icu-collation-settings-table">
+      <title>ICU Collation Settings</title>
+      <tgroup cols="4">
+       <thead>
+        <row>
+         <entry>Key</entry>
+         <entry>Values</entry>
+         <entry>Default</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry><literal>ks</literal></entry>
+         <entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry>
+         <entry><literal>level3</literal></entry>
+         <entry>
+          Sensitivity (or "strength") when determining equality, with
+          <literal>level1</literal> the least sensitive to differences and
+          <literal>identic</literal> the most sensitive to differences. See
+          <xref linkend="icu-collation-levels"/> for details.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>ka</literal></entry>
+         <entry><literal>noignore</literal>, <literal>shifted</literal></entry>
+         <entry><literal>noignore</literal></entry>
+         <entry>
+          If set to <literal>shifted</literal>, causes some characters
+          (e.g. punctuation or space) to be ignored in comparison. Key
+          <literal>ks</literal> must be set to <literal>level3</literal> or
+          lower to take effect. Set key <literal>kv</literal> to control which
+          character classes are ignored.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kb</literal></entry>
+         <entry><literal>true</literal>, <literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry>
+          Backwards comparison for the level 2 differences. For example,
+          locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+          before <literal>'aé'</literal>.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kk</literal></entry>
+         <entry><literal>true</literal>, <literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry>
+          <para>
+           Enable full normalization; may affect performance. Basic
+           normalization is performed even when set to
+           <literal>false</literal>. Locales for languages that require full
+           normalization typically enable it by default.
+          </para>
+          <para>
+           Full normalization is important in some cases, such as when
+           multiple accents are applied to a single character. For instance,
+           <literal>'ệ'</literal> can be composed of code points
+           <literal>U&amp;'\0065\0323\0302'</literal> or
+           <literal>U&amp;'\0065\0302\0323'</literal>. With full normalization
+           on, these code point sequences are treated as equal; otherwise they
+           are unequal.
+          </para>
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kc</literal></entry>
+         <entry><literal>true</literal>, <literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry>
+          <para>
+           Separates case into a "level 2.5" that falls between accents and
+           other level 3 features.
+          </para>
+          <para>
+           If set to <literal>true</literal> and <literal>ks</literal> is set
+           to <literal>level1</literal>, will ignore accents but take case
+           into account.
+          </para>
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kf</literal></entry>
+         <entry>
+          <literal>upper</literal>, <literal>lower</literal>,
+          <literal>false</literal>
+         </entry>
+         <entry><literal>false</literal></entry>
+         <entry>
+          If set to <literal>upper</literal>, upper case sorts before lower
+          case. If set to <literal>lower</literal>, lower case sorts before
+          upper case. If set to <literal>false</literal>, the sort depends on
+          the rules of the locale.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kn</literal></entry>
+         <entry><literal>true</literal>, <literal>false</literal></entry>
+         <entry><literal>false</literal></entry>
+         <entry>
+          If set to <literal>true</literal>, numbers within a string are
+          treated as a single numeric value rather than a sequence of
+          digits. For example, <literal>'id-45'</literal> sorts before
+          <literal>'id-123'</literal>.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kr</literal></entry>
+         <entry>
+          <literal>space</literal>, <literal>punct</literal>,
+          <literal>symbol</literal>, <literal>currency</literal>,
+          <literal>digit</literal>, <replaceable>script-id</replaceable>
+         </entry>
+         <entry></entry>
+         <entry>
+          <para>
+           Set to one or more of the valid values, or any BCP 47
+           <replaceable>script-id</replaceable>, e.g. <literal>latn</literal>
+           ("Latin") or <literal>grek</literal> ("Greek"). Multiple values are
+           separated by "<literal>-</literal>".
+          </para>
+          <para>
+           Redefines the ordering of classes of characters; those characters
+           belonging to a class earlier in the list sort before characters
+           belonging to a class later in the list. For instance, the value
+           <literal>digit-currency-space</literal> (as part of a language tag
+           like <literal>und-u-kr-digit-currency-space</literal>) sorts
+           punctuation before digits and spaces.
+          </para>
+         </entry>
+        </row>
+        <row>
+         <entry><literal>kv</literal></entry>
+         <entry>
+          <literal>space</literal>, <literal>punct</literal>,
+          <literal>symbol</literal>, <literal>currency</literal>
+         </entry>
+         <entry><literal>punct</literal></entry>
+         <entry>
+          Classes of characters ignored during comparison at level 3. Setting
+          to a later value includes earlier values;
+          e.g. <literal>symbol</literal> also includes
+          <literal>punct</literal> and <literal>space</literal> in the
+          characters to be ignored. Key <literal>ka</literal> must be set to
+          <literal>shifted</literal> and key <literal>ks</literal> must be set
+          to <literal>level3</literal> or lower to take effect.
+         </entry>
+        </row>
+        <row>
+         <entry><literal>co</literal></entry>
+         <entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry>
+         <entry><literal>standard</literal></entry>
+         <entry>
+          Collation type. See <xref linkend="icu-external-references"/> for additional options and details.
+         </entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+      Defaults may depend on locale. The above table is not meant to be
+      complete. See <xref linkend="icu-external-references"/> for additional
+      options and details.
+    </para>
+    <note>
+     <para>
+      For many collation settings, you must create the collation with
+      <option>DETERMINISTIC</option> set to <literal>false</literal> for the
+      setting to have the desired effect (see <xref
+      linkend="collation-nondeterministic"/>). Additionally, some settings
+      only take effect when the key <literal>ka</literal> is set to
+      <literal>shifted</literal> (see <xref
+      linkend="icu-collation-settings-table"/>).
+     </para>
+    </note>
+   </sect3>
+   <sect3 id="icu-locale-examples">
+    <title>Examples</title>
+    <para>
+     <variablelist>
+      <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
+       <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
+       <listitem>
+        <para>German collation with phone book collation type</para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
+       <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
+       <listitem>
+        <para>
+         Root collation with Emoji collation type, per Unicode Technical Standard #51
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
+       <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
+       <listitem>
+        <para>
+         Sort Greek letters before Latin ones.  (The default is Latin before Greek.)
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
+       <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
+       <listitem>
+        <para>
+         Sort upper-case letters before lower-case letters.  (The default is
+         lower-case letters first.)
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
+       <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
+       <listitem>
+        <para>
+         Combines both of the above options.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </para>
+   </sect3>
+   <sect3 id="icu-external-references">
+    <title>External References for ICU</title>
+    <para>
+     This section (<xref linkend="icu-custom-collations"/>) is only a brief
+     overview of ICU behavior and language tags. Refer to the following
+     documents for technical details, additional options, and new behavior:
+    </para>
+    <itemizedlist>
+     <listitem>
+      <para>
+       <ulink
+           url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
+       Technical Standard #35</ulink>
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink>
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
+       repository</ulink>
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
+      </para>
+     </listitem>
+    </itemizedlist>
+   </sect3>
+  </sect2>
   </sect1>
  
   <sect1 id="multibyte">
author	Jeff Davis <[email protected]>
	Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)
committer	Jeff Davis <[email protected]>
	Thu, 18 May 2023 17:37:55 +0000 (10:37 -0700)