You are viewing the version of this documentation from Perl 5.28.3. View the latest version

CONTENTS

NAME

perluniprops - Index of Unicode Version 10.0.0 character properties in Perl

DESCRIPTION

This document provides information about the portion of the Unicode database that deals with character properties, that is the portion that is defined on single code points. ("Other information in the Unicode data base" below briefly mentions other data that Unicode provides.)

Perl can provide access to all non-provisional Unicode character properties, though not all are enabled by default. The omitted ones are the Unihan properties (accessible via the CPAN module Unicode::Unihan) and certain deprecated or Unicode-internal properties. (An installation may choose to recompile Perl's tables to change this. See "Unicode character properties that are NOT accepted by Perl".)

For most purposes, access to Unicode properties from the Perl core is through regular expression matches, as described in the next section. For some special purposes, and to access the properties that are not suitable for regular expression matching, all the Unicode character properties that Perl handles are accessible via the standard Unicode::UCD module, as described in the section "Properties accessible through Unicode::UCD".

Perl also provides some additional extensions and short-cut synonyms for Unicode properties.

This document merely lists all available properties and does not attempt to explain what each property really means. There is a brief description of each Perl extension; see "Other Properties" in perlunicode for more information on these. There is some detail about Blocks, Scripts, General_Category, and Bidi_Class in perlunicode, but to find out about the intricacies of the official Unicode properties, refer to the Unicode standard. A good starting place is http://www.unicode.org/reports/tr44/.

Note that you can define your own properties; see "User-Defined Character Properties" in perlunicode.

Properties accessible through \p{} and \P{}

The Perl regular expression \p{} and \P{} constructs give access to most of the Unicode character properties. The table below shows all these constructs, both single and compound forms.

Compound forms consist of two components, separated by an equals sign or a colon. The first component is the property name, and the second component is the particular value of the property to match against, for example, \p{Script_Extensions: Greek} and \p{Script_Extensions=Greek} both mean to match characters whose Script_Extensions property value is Greek. (Script_Extensions is an improved version of the Script property.)

Single forms, like \p{Greek}, are mostly Perl-defined shortcuts for their equivalent compound forms. The table shows these equivalences. (In our example, \p{Greek} is a just a shortcut for \p{Script_Extensions=Greek}). There are also a few Perl-defined single forms that are not shortcuts for a compound form. One such is \p{Word}. These are also listed in the table.

In parsing these constructs, Perl always ignores Upper/lower case differences everywhere within the {braces}. Thus \p{Greek} means the same thing as \p{greek}. But note that changing the case of the "p" or "P" before the left brace completely changes the meaning of the construct, from "match" (for \p{}) to "doesn't match" (for \P{}). Casing in this document is for improved legibility.

Also, white space, hyphens, and underscores are normally ignored everywhere between the {braces}, and hence can be freely added or removed even if the /x modifier hasn't been specified on the regular expression. But in the table below a 'T' at the beginning of an entry means that tighter (stricter) rules are used for that entry:

Some properties are considered obsolete by Unicode, but still available. There are several varieties of obsolescence:

The table below has two columns. The left column contains the \p{} constructs to look up, possibly preceded by the flags mentioned above; and the right column contains information about them, like a description, or synonyms. The table shows both the single and compound forms for each property that has them. If the left column is a short name for a property, the right column will give its longer, more descriptive name; and if the left column is the longest name, the right column will show any equivalent shortest name, in both single and compound forms if applicable.

If braces are not needed to specify a property (e.g., \pL), the left column contains both forms, with and without braces.

The right column will also caution you if a property means something different than what might normally be expected.

All single forms are Perl extensions; a few compound forms are as well, and are noted as such.

Numbers in (parentheses) indicate the total number of Unicode code points matched by the property. For the entries that give the longest, most descriptive version of the property, the count is followed by a list of some of the code points matched by it. The list includes all the matched characters in the 0-255 range, enclosed in the familiar [brackets] the same as a regular expression bracketed character class. Following that, the next few higher matching ranges are also given. To avoid visual ambiguity, the SPACE character is represented as \x20.

For emphasis, those properties that match no code points at all are listed as well in a separate section following the table.

Most properties match the same code points regardless of whether "/i" case-insensitive matching is specified or not. But a few properties are affected. These are shown with the notation (/i= other_property) in the second column. Under case-insensitive matching they match the same code pode points as the property other_property.

There is no description given for most non-Perl defined properties (See http://www.unicode.org/reports/tr44/ for that).

For compactness, '*' is used as a wildcard instead of showing all possible combinations. For example, entries like:

\p{Gc: *}                                  \p{General_Category: *}

mean that 'Gc' is a synonym for 'General_Category', and anything that is valid for the latter is also valid for the former. Similarly,

\p{Is_*}                                   \p{*}

means that if and only if, for example, \p{Foo} exists, then \p{Is_Foo} and \p{IsFoo} are also valid and all mean the same thing. And similarly, \p{Foo=Bar} means the same as \p{Is_Foo=Bar} and \p{IsFoo=Bar}. "*" here is restricted to something not beginning with an underscore.

Also, in binary properties, 'Yes', 'T', and 'True' are all synonyms for 'Y'. And 'No', 'F', and 'False' are all synonyms for 'N'. The table shows 'Y*' and 'N*' to indicate this, and doesn't have separate entries for the other possibilities. Note that not all properties which have values 'Yes' and 'No' are binary, and they have all their values spelled out without using this wild card, and a NOT clause in their description that highlights their not being binary. These also require the compound form to match them, whereas true binary properties have both single and compound forms available.

Note that all non-essential underscores are removed in the display of the short names below.

Legend summary:

* is a wild-card
(\d+) in the info column gives the number of Unicode code points matched by this property.
D means this is deprecated.
O means this is obsolete.
S means this is stabilized.
T means tighter (stricter) name matching applies.
X means use of this form is discouraged, and may not be stable.
      NAME                           INFO

  \p{Adlam}               \p{Script_Extensions=Adlam} (Short:
                            \p{Adlm}; NOT \p{Block=Adlam}) (88)
  \p{Adlm}                \p{Adlam} (= \p{Script_Extensions=Adlam})
                            (NOT \p{Block=Adlam}) (88)
X \p{Aegean_Numbers}      \p{Block=Aegean_Numbers} (64)
T \p{Age: 1.1}            \p{Age=V1_1} (33_979)
  \p{Age: V1_1}           Code point's usage introduced in version
                            1.1 (33_979: U+0000..01F5, U+01FA..0217,
                            U+0250..02A8, U+02B0..02DE,
                            U+02E0..02E9, U+0300..0345 ...)
T \p{Age: 2.0}            \p{Age=V2_0} (144_521)
  \p{Age: V2_0}           Code point's usage was introduced in
                            version 2.0; See