Description
Bit by bit, the C++ standard is gaining more support for Unicode, and the charN_t
family of types is gaining relevance. We should have some rules that guide the usage of the types.
A good starting point would be something like
Don't mix different character types in expressions
Mixing character types is often wrong, even when no narrowing occurs. Consider the following example:
bool contains_oe(std::u8string_view str) {
for (char8_t c : str)
if (c == U'ö') // comparison always fails
return true;
return false;
}
The comparison always fails because ö
is UTF-8-encoded as 0xC3 0xB6
, so even if str
contains a u8"ö"
somewhere, you wouldn't be able to find it this way.
There are certain instances where mixing character types is safe; for instance u8'x' == U'x'
is true
. However, safe use of this property requires the developer to memorize the set of ASCII characters.
Mixing char
, wchar_t
, and other character types in expressions is generally bug-prone because it's encoding-dependent. Treating char
as char8_t
may be safe if char
is UTF-8 anyway, but that's far from universally true.