UTF-8 Is Beautiful

September 14, 2025

It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for incorporating diverse alphabets and other characters such as 💩 emojis. It takes the long-established 7-bit ASCII character set and extends it into multiple bytes to represent many thousands of characters. How it does this may well be beyond that basic grasp, and [Vishnu] is here with a primer that’s both fascinating and easy to read.

UTF-8 extends ASCII from codes which fit in a single byte, to codes which can be up to four bytes long. The key lies in the first few bits of each byte, which specify how many bytes each character has, and then that it is a data byte. Since 7-bit ASCII codes always have a 0 in their most significant bit when mapped onto an 8-bit byte, compatibility with ASCII is ensured by the first 128 characters always beginning with a zero bit. It’s simple, elegant, and for any of who had to deal with character set hell in the days before it came along, magic.

We’ve talked surprisingly little about the internals of UTF-8 in the past, but it’s worthy of note that this is our second piece ever to use the poop emoji, after our coverage of the billionth GitHub repository.

Emoji bales: Tony Hisgett, CC BY 2.0.

30 thoughts on “UTF-8 Is Beautiful”

a says:

September 14, 2025 at 10:47 pm

*poop

Report comment

Reply
Joshua says:

September 15, 2025 at 1:24 am

It’s lacking CP437, which was part of the IBM PC’s character set.

Report comment

Reply
1. rnjacobs says:
  
  September 15, 2025 at 4:33 am
  
  No it’s not? wikipedia:Code page 437
  
  Report comment
  
  Reply
2. FeRDNYC says:
  
  September 15, 2025 at 12:02 pm
  
  The Unicode Consortium has put pretty much everything into UTF-8. In addition to all of the traditional written languages that used to be covered by code pages / character sets (even the old box-drawing symbology from DOS and other systems), they’ve adopted the emoji library (which was the wild West when device vendors controlled it absent shared standards), made emoji composable via zero-width joiners, added the combining glyph sets (the reason it’s now possible to write “Spin̈al Tap” without a special font, despite an umlaut over a consonant not being valid in any language), accepted Klingon iconography, and a couple of years ago they even brought in a raft of retro computing symbols, like the Delete characters from the Apple ][, TRS-80, and Amstrad CPC. (␧, ␨, and ␩ respectively).
  
  Report comment
  
  Reply
IIVQ says:

September 15, 2025 at 1:49 am

In 2008, I was working on a large website migration project from several CMS’s into one. After a while, I could recognize what kind of codepage translation errors were happening based on what characters I got to see, like efficiÃíncy (which should be efficiëncy, I forgot the exact characters). That’s when I started learning about UTF-8 and found out what a wonderful design it is (yes, I know it has its faults).

Report comment

Reply
1. Joshua says:
  
  September 15, 2025 at 2:05 am
  
  At the time, there were alternatives to UTF. Such as ISO 8859-15 or use of HTML Entities.
  Normally, the web browser should handle it automatically if the HTML header has the character set mentioned.
  
  https://en.wikipedia.org/wiki/ISO/IEC_8859-15
  https://www.w3schools.com/html/html_entities.asp
  
  Report comment
  
  Reply
  1. IIVQ says:
    
    September 15, 2025 at 1:33 pm
    
    Yes, but the old CMS didn’t have proper codepage declarations and we extracted it right from their backends, so we didn’t (automatically) know what we were converting from. To it was all unicode/utf-8, not HTML entities.
    
    Report comment
    
    Reply
MrSVCD says:

September 15, 2025 at 2:34 am

UTF-8 is a way to encode numbers.
Unicode is a way to represent characters as numbers.
UTF-8 is a way to use Unicode that is Ascii compatible.
If Unicode had decide to not copy ascii into the first block, UTF-8 encoded Unicode would not be ascii compatible.

Report comment

Reply
1. defdefred says:
  
  September 15, 2025 at 10:15 am
  
  Yes!
  And unicode is a mess.
  
  Report comment
  
  Reply
Mr T says:

September 15, 2025 at 5:07 am

A specific, security-related problem with UTF-8 are overlong characters where the same character can be represented with different UTF-8 codes, basically by encoding leading zeros.

An example with the letter ‘e’ (ASCII 65, 0x41) and four different hex UTF-8 representations (UTF-8 value => encoded value).
0x41 => 0x41
0xC181 => 0x041
0xE08181 => 0x0041
0xF0808181 => 0x00041

They all end up as an ‘e’ but only the first is valid. This means that a program checking against a specific text may be fooled by letters in overlong UTF-8.

Report comment

Reply
1. Mr T says:
  
  September 15, 2025 at 5:17 am
  
  Not the letter ‘e’, but ‘A’ — Sorry!
  
  Report comment
  
  Reply
2. joegatling says:
  
  September 15, 2025 at 6:59 am
  
  Is the security risk here that I could sneak the word “cheese” past a disallow list with that word on it by using overlong notation on the “e” and hoping that the search function gets confused, but trusting that Notepad.exe doesn’t?
  
  Report comment
  
  Reply
  1. AaronFish says:
    
    September 15, 2025 at 8:07 am
    
    I mean maybe, if you’re part of the dairy industry that could be a risk. But a more common risk might be IDN homograph attacks where a malicious party registers a domain that looks like a legit one but is slightly tweaked at the unicode level so it’s technically not: https://en.wikipedia.org/wiki/IDN_homograph_attack
    
    Report comment
    
    Reply
  2. Mr T says:
    
    September 15, 2025 at 11:38 am
    
    That is obviously a huge risk! 🤪
    
    The risk is more for allowing text through that should have been sanitised or rejected. Think how a not-so-smart way of avoiding SQL injection attacks could be implemented by rejecting apostrophes and semicolons in the input data without also rejecting all overlong codes (the right way) or checking for the characters to reject even if they are specified as overlong UTF-8 (the stupid but possibly working way).
    
    Report comment
    
    Reply
    1. Jim J Jewett says:
      
      September 16, 2025 at 4:28 pm
      
      Hard to see how that could be a problem unless confusables are already a worse problem. (Confusables are different characters that look the same … Essentially a generalization of the old 1 vs l vs I problem.)
      
      Report comment
      
      Reply
3. Eric Hughes says:
  
  September 15, 2025 at 9:03 am
  
  Such overlong encodings have been expressly forbidden for almost 25 hears. This corrigendum is from November 2000. https://www.unicode.org/versions/corrigendum1.html. Here’s the most relevant sentence:
  
  […] the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms […]
  
  Whatever problems that have actually arisen are not a problem with the UTF-8 definition. The definition is more than just the encoding and decoding specifications, but also puts behavioral restrictions on the software that uses them. The problems in practice are from noncompliant software, that is, from defects in implementation.
  
  Report comment
  
  Reply
  1. Mr T says:
    
    September 15, 2025 at 11:26 am
    
    It is far from all problems that are due to bad specifications — including this — but rather due developers only reading the part of the specification that defines the bit patterns to use, implementing those, and then getting on to the next project they can do wrong.
    
    Buffer overflows, memory leaks, SQL injection attacks … you name it and someone will do it wrong — even today, way more than 25 years after they were made know and the knowledge disseminated.
    
    Theoretically, it could have been avoided by actually assigning those overlong patterns to special characters, at the cost of easy implementation, sure, but it would have made it more probable that some lazy developers would have implemented that part too and avoided the pitfall.
    
    Report comment
    
    Reply
4. defdefred says:
  
  September 15, 2025 at 12:27 pm
  
  Exactly!
  Plus adding formating information in to unicode is a real mess.
  
  Report comment
  
  Reply
玩家１â€™ says:

September 15, 2025 at 7:03 am

My username on a lot of stuff is
plær1â€™
or
玩家１â€™

Report comment

Reply
fluffy says:

September 15, 2025 at 10:25 am

Ironically, this website’s RSS feed replaces emojis with pictures, which are also rendered much larger than the surrounding text (although at least it preserve the emoji in alt text so copy-paste still works).

Report comment

Reply
echodelta says:

September 15, 2025 at 11:49 am

So an emoji blocker is more than just a way of getting those yellow circles off my screen? Good.

Report comment

Reply
None says:

September 15, 2025 at 12:20 pm

Brought to you by the same fine people that invented UNIX and C. Thank you Ken Thomson, for allowing us to leave behind UCS-2 (which was a VERY BAD IDEA even back in 1990).

Report comment

Reply
Erik Johnson says:

September 15, 2025 at 12:55 pm

I once worked with a chat protocol that had null-delimited segments. Except the NULL was UTF-8 encoded; 0xC0 0x80

Report comment

Reply
rclark says:

September 15, 2025 at 6:10 pm

I am glad in my career I only had to deal with the standard ascii 7-8 bit codes. Does everything I ever needed doing… and still does. Makes everything simpler.

Report comment

Reply
1. Reggie says:
  
  September 15, 2025 at 11:32 pm
  
  Exactly. UTF-8 is both beautiful and cancer.
  
  Report comment
  
  Reply
Josué Vicioso says:

September 15, 2025 at 10:16 pm

Missed opportunity to name the article “UTF-8 is BeaUTFul”

Report comment

Reply
danny s says:

September 16, 2025 at 2:43 am

Having had to deal with code pages, translations and British developers who thought £ was ASCII (it isn’t, it’s just in British code pages), I pretty much got to “just use utf-8”.

If you’ve seen translation files opened and modified in an editor with the wrong code page (was it an 8859-n or a win-125n?) resulting in horrid mojibake, switching the “just use ASCII” mindset to “just use UTF-8” is a big win.

For those who’ve not seen it mojibake is where a string of bytes (e.g. for some Chinese characters in GB-nnnn) gets reinterpreted into some other code page (a win-125x on PC somewhere in Europe) then saved creating a byte sequence no longer sensible in the original character set. Using UTF-8 all the way through (maybe apart from an initial ingest from some other known character set) will prevent it.

Report comment

Reply
Christian says:

September 16, 2025 at 8:23 am

UTF8 should have been the default for strings and byte arrays in any programming language / OS. This distinction between text and bytes is annoying me. Let the output functions, deal with the presentation, to a fault if need be. Instead I have to litter my code with .decode(UTF8) / .encode(UTF8) calls. Not to mention that most code bases will have bugs/security issues with any incoming external text. Either not checking “It’s base64, it’s safe” or not handling badly encoded UTF8. Letting each app/program/os deal with it has be the poorest choice ever made.

Report comment

Reply
Greg A says:

September 16, 2025 at 9:09 am

i try to exclude utf-8 from my life, because the iso-latin charset is enough for my purposes. really, 90% of my complaints would be resolved if there was just a good way to search for (‘/’) any character with the high-ascii-bit set for search-and-replace of errant post-modern quotation marks.

but i think a cool thing about utf-8 is that NUL is still NUL. so if you run UTF-8 through a program that isn’t expecting it, which just uses the regular NUL-terminated string functions like strcmp() and strcpy(), it will tend to “just work.” it will pass through the UTF-8 code points unchanged. you don’t need a major re-architecting to use counted strings or wchar_t.

Report comment

Reply
grexe says:

September 16, 2025 at 2:20 pm

My first contact with UTF-8 was through BeOS, the revolutionary operating system of the late 1990s. Can you imagine how amazing it was back then to be able to handle all those special characters throughout the system in a consistent, easy and native way, all the way down to the filesystem? 🤓

Report comment

Reply

Hackaday

UTF-8 Is Beautiful

30 thoughts on “UTF-8 Is Beautiful”

Leave a Reply to JoshuaCancel reply

Search

Never miss a hack

If you missed it

On 3D Scanners And Giving Kinects A New Purpose In Life

The Hottest Spark Plugs Were Actually Radioactive

A Cut Above: Surgery In Space, Now And In The Future

Two Decades Of Hackaday In Words

Spy Tech: The NRO And Apollo 11

Our Columns

How Do The Normal People Survive?

Hackaday Podcast Episode 340: The Best Programming Language, Space Surgery, And Hacking Two 3D Printers Into One

This Week In Security: CVSS 0, Chwoot, And Not In The Threat Model

How Hydraulic Ram Pumps Push Water Uphill With No External Power Input

FLOSS Weekly Episode 849: Veilid: Be A Brick

30 thoughts on “UTF-8 Is Beautiful”

Leave a Reply to JoshuaCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns