Skip to content

Rules for char8_t et al. #2275

Closed
Closed
@Eisenwave

Description

@Eisenwave

Bit by bit, the C++ standard is gaining more support for Unicode, and the charN_t family of types is gaining relevance. We should have some rules that guide the usage of the types.

A good starting point would be something like

Don't mix different character types in expressions

Mixing character types is often wrong, even when no narrowing occurs. Consider the following example:

bool contains_oe(std::u8string_view str) {
    for (char8_t c : str)
        if (c == U'ö') // comparison always fails
            return true;
    return false;
}

The comparison always fails because ö is UTF-8-encoded as 0xC3 0xB6, so even if str contains a u8"ö" somewhere, you wouldn't be able to find it this way.

There are certain instances where mixing character types is safe; for instance u8'x' == U'x' is true. However, safe use of this property requires the developer to memorize the set of ASCII characters.

Mixing char, wchar_t, and other character types in expressions is generally bug-prone because it's encoding-dependent. Treating char as char8_t may be safe if char is UTF-8 anyway, but that's far from universally true.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions