In previous articles we spoke about ASCII and legacy character sets, we introduced Unicode and explained the code points. In this article we explore some attempts at encoding Unicode code points.
Binary encoding of Unicode
I have the vague memory of mentioning in a previous article that Unicode code points are just numbers assigned to characters. That still holds true, but the fact of the matter is, as tempting as it can be to write Unicode encoded messages with paper and pen, it’s slightly more practical and useful to do so with a computer. However, before your computer can read and write messages in Unicode code points, it also needs its very own Unicode map. Remember that computers understand everything in binary, so we need a way to convert these characters (i.e. their code points) to binary. This process is what encoding is all about. Once your computer has its own binary Unicode map, it can use it in a way similar to what you’d do with your paper, pencil and Unicode map.
Hold on tight here, because this is usually another corner where people get lost in the dust about Unicode.
If I were to ask to some smart folks, such as yourself, to devise a scheme to represent Unicode code points in binary format (ones and zeros), most would just roll their eyes and go: “Duh! You have an Unicode map full of these code points, and you’ve been banging our heads saying they’re just numbers assigned to characters right? Newsflash dude, binary is also just a number representation that computers can understand. So to write a message with code points, just write each code point value in the document as binary. To read a message and display characters, you do the reverse. You take the code points from the document, find matching values on the map and just display the corresponding characters. That’s how ASCII works, that’s how Latin-1 works, why don’t you just do the same with Unicode?”
Hmm, that actually sounds pretty straight forward and intuitive. Unicode code points are numbers. So, simply put these number’s in binary format, exactly the way it’s done in ASCII, Latin-1 and most other legacy encodings. Ok, that’s fine, lets give it a go.
So we manage to send our first Unicode encoded message with your little scheme and this is what our correspondent receives:
11011011 10010010 00010110 10001101 11011000
It does indeed represent some code points. But wait a minute, which ones? See, one notable difference between Unicode and other character sets is that Unicode is huge. As I said in a previous article, it has over 1.1 million characters and is still expanding. That is, while some of these code points are as small as 0, 1, 2 or 3, others have a decimal value as large as 1,114,112 (0×110000). To represent such large code points in binary, you would need a minimum of 3 bytes. How would the decoder know which code point is only 1 byte, which is 2 or 3?
- Hey pal, Decoding Office here. Yeah, about the string of bytes that you just sent us, is it 1 byte followed by 2 groups of 2 bytes?
- hmm, I don’t remember I think it was 2 bytes followed by 1 byte, then followed by 2 other bytes, but I’m not sure anymore.
- Could it be 3 bytes followed by 2 bytes?
- nah, that looks more like 3 bytes, followed by 1 byte and then another. Try it like that and let me know what happens.
- ah fuck it, we’re not getting paid enough for this sh… ahem, thanks for nothing! click.
This is a pretty naive mapping scheme, there’s no way for a decoder to know how large each code point is just by looking at bits. They could be 1, 2 or 3 bytes, no way to tell. The only sensible solution in this situation is to agree on a “one-size fit-all” approach, where all code points have the same size. That way the decoder knows exactly where each code point begins and where it ends. The space also needs to be large enough for the largest values to fit in. In the case of Unicode this is a minimum of 3 bytes. To illustrate, suppose the message above was in fact a 2 bytes code point, followed by a single byte code point, then another double bytes code point. For the decoder to read them properly, we would have to pad each code point to make them the same size (3 bytes) before sending the message. Here’s the padded version of our message:
00000000 11011011 10010010 (2 bytes code point) 00000000 00000000 00010110 (1 byte code point) 00000000 10001101 11011000 (2 bytes code point)
Some of the earliest Unicode encoding schemes actually worked with this approach. One scheme called UCS2 aka UTF-16 required 2 bytes (16 bits) and occasionally used 4 bytes in special circumstances[1]. Another scheme, UCS4 aka UTF-32 just settled with simply using 4 bytes (32 bits) for all characters.
It all seemed fine, but look, padding code points with empty bytes is just considered wasteful and also introduce some efficiency issues with some processors’ architecture. It also is very unappealing to most folks that work 99% of the time with characters that only actually require 1 byte in legacy encodings (ASCII, Latin-1, etc).
0110 0001 : 'a' in ascii
0110 0001 : 'a' in Latin-1
0110 0001 : 'a' in Unicode code point
0000 0000 0110 0001 : 'a' in Unicode encoded with UCS2
0000 0000 0000 0000 0000 0000 0110 0001 : 'a' in Unicode encoded with UCS4
1110 1001 : 'é' in Latin-1
1110 1001 : 'é' in Unicode code point
0000 0000 1110 1001 : 'é' in Unicode encoded with UCS2
0000 0000 0000 0000 0000 0000 1110 1001 : 'é' in Unicode encoded with UCS4
What a waste indeed. Plus, attempts at optimizing the scheme gave birth to some nasty stuff such as Byte Order Marks (BOM), which complicated the whole thing even more and people just went “Nah, lets just go back to Latin-1″ or whatever else they were using. So UCS2 and UCS4 pretty much bombed.
Recap on UTF-16 and UTF-32
- wasteful and inefficient (i.e. it sucked)
[1] UCS2, using only 2 bytes, can’t map more than 65,536 characters (2^16). To get beyond that limitation it uses a trick called the Surrogate Blocks. Unicode has a range of 2048 unused code points within those first 65,536. By splitting that range in 2 and combining a value from the upper half (2 bytes) to a value from the lower half (2 other bytes), you can get 1024 x 1024 new code points (4 bytes each).
In the next article of this series I’ll cover UTF-8 and how it saved Unicode.
Thank you for reading.


