UTF-8 Is Beautiful

It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for incorporating diverse alphabets and other characters such as 💩 emojis. It takes the long-established 7-bit ASCII character set and extends it into multiple bytes to represent many thousands of characters. How it does this may well be beyond that basic grasp, and [Vishnu] is here with a primer that’s both fascinating and easy to read.

UTF-8 extends ASCII from codes which fit in a single byte, to codes which can be up to four bytes long. The key lies in the first few bits of each byte, which specify how many bytes each character has, and then that it is a data byte. Since 7-bit ASCII codes always have a 0 in their most significant bit when mapped onto an 8-bit byte, compatibility with ASCII is ensured by the first 128 characters always beginning with a zero bit. It’s simple, elegant, and for any of who had to deal with character set hell in the days before it came along, magic.

We’ve talked surprisingly little about the internals of UTF-8 in the past, but it’s worthy of note that this is our second piece ever to use the poop emoji, after our coverage of the billionth GitHub repository.

Emoji bales: Tony Hisgett, CC BY 2.0.

30 thoughts on “UTF-8 Is Beautiful

    1. The Unicode Consortium has put pretty much everything into UTF-8. In addition to all of the traditional written languages that used to be covered by code pages / character sets (even the old box-drawing symbology from DOS and other systems), they’ve adopted the emoji library (which was the wild West when device vendors controlled it absent shared standards), made emoji composable via zero-width joiners, added the combining glyph sets (the reason it’s now possible to write “Spin̈al Tap” without a special font, despite an umlaut over a consonant not being valid in any language), accepted Klingon iconography, and a couple of years ago they even brought in a raft of retro computing symbols, like the Delete characters from the Apple ][, TRS-80, and Amstrad CPC. (␧, ␨, and ␩ respectively).

  1. In 2008, I was working on a large website migration project from several CMS’s into one. After a while, I could recognize what kind of codepage translation errors were happening based on what characters I got to see, like efficiÃíncy (which should be efficiëncy, I forgot the exact characters). That’s when I started learning about UTF-8 and found out what a wonderful design it is (yes, I know it has its faults).

      1. Yes, but the old CMS didn’t have proper codepage declarations and we extracted it right from their backends, so we didn’t (automatically) know what we were converting from. To it was all unicode/utf-8, not HTML entities.

  2. UTF-8 is a way to encode numbers.
    Unicode is a way to represent characters as numbers.
    UTF-8 is a way to use Unicode that is Ascii compatible.
    If Unicode had decide to not copy ascii into the first block, UTF-8 encoded Unicode would not be ascii compatible.

  3. A specific, security-related problem with UTF-8 are overlong characters where the same character can be represented with different UTF-8 codes, basically by encoding leading zeros.

    An example with the letter ‘e’ (ASCII 65, 0x41) and four different hex UTF-8 representations (UTF-8 value => encoded value).
    0x41 => 0x41
    0xC181 => 0x041
    0xE08181 => 0x0041
    0xF0808181 => 0x00041

    They all end up as an ‘e’ but only the first is valid. This means that a program checking against a specific text may be fooled by letters in overlong UTF-8.

    1. Is the security risk here that I could sneak the word “cheese” past a disallow list with that word on it by using overlong notation on the “e” and hoping that the search function gets confused, but trusting that Notepad.exe doesn’t?

      1. That is obviously a huge risk! 🤪

        The risk is more for allowing text through that should have been sanitised or rejected. Think how a not-so-smart way of avoiding SQL injection attacks could be implemented by rejecting apostrophes and semicolons in the input data without also rejecting all overlong codes (the right way) or checking for the characters to reject even if they are specified as overlong UTF-8 (the stupid but possibly working way).

        1. Hard to see how that could be a problem unless confusables are already a worse problem. (Confusables are different characters that look the same … Essentially a generalization of the old 1 vs l vs I problem.)

    2. Such overlong encodings have been expressly forbidden for almost 25 hears. This corrigendum is from November 2000. https://www.unicode.org/versions/corrigendum1.html. Here’s the most relevant sentence:

      […] the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms […]

      Whatever problems that have actually arisen are not a problem with the UTF-8 definition. The definition is more than just the encoding and decoding specifications, but also puts behavioral restrictions on the software that uses them. The problems in practice are from noncompliant software, that is, from defects in implementation.

      1. It is far from all problems that are due to bad specifications — including this — but rather due developers only reading the part of the specification that defines the bit patterns to use, implementing those, and then getting on to the next project they can do wrong.

        Buffer overflows, memory leaks, SQL injection attacks … you name it and someone will do it wrong — even today, way more than 25 years after they were made know and the knowledge disseminated.

        Theoretically, it could have been avoided by actually assigning those overlong patterns to special characters, at the cost of easy implementation, sure, but it would have made it more probable that some lazy developers would have implemented that part too and avoided the pitfall.

  4. Having had to deal with code pages, translations and British developers who thought £ was ASCII (it isn’t, it’s just in British code pages), I pretty much got to “just use utf-8”.

    If you’ve seen translation files opened and modified in an editor with the wrong code page (was it an 8859-n or a win-125n?) resulting in horrid mojibake, switching the “just use ASCII” mindset to “just use UTF-8” is a big win.

    For those who’ve not seen it mojibake is where a string of bytes (e.g. for some Chinese characters in GB-nnnn) gets reinterpreted into some other code page (a win-125x on PC somewhere in Europe) then saved creating a byte sequence no longer sensible in the original character set. Using UTF-8 all the way through (maybe apart from an initial ingest from some other known character set) will prevent it.

  5. UTF8 should have been the default for strings and byte arrays in any programming language / OS. This distinction between text and bytes is annoying me. Let the output functions, deal with the presentation, to a fault if need be. Instead I have to litter my code with .decode(UTF8) / .encode(UTF8) calls. Not to mention that most code bases will have bugs/security issues with any incoming external text. Either not checking “It’s base64, it’s safe” or not handling badly encoded UTF8. Letting each app/program/os deal with it has be the poorest choice ever made.

  6. i try to exclude utf-8 from my life, because the iso-latin charset is enough for my purposes. really, 90% of my complaints would be resolved if there was just a good way to search for (‘/’) any character with the high-ascii-bit set for search-and-replace of errant post-modern quotation marks.

    but i think a cool thing about utf-8 is that NUL is still NUL. so if you run UTF-8 through a program that isn’t expecting it, which just uses the regular NUL-terminated string functions like strcmp() and strcpy(), it will tend to “just work.” it will pass through the UTF-8 code points unchanged. you don’t need a major re-architecting to use counted strings or wchar_t.

  7. My first contact with UTF-8 was through BeOS, the revolutionary operating system of the late 1990s. Can you imagine how amazing it was back then to be able to handle all those special characters throughout the system in a consistent, easy and native way, all the way down to the filesystem? 🤓

Leave a Reply to JoshuaCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.