We’ve previously discussed Unicode and some less than stellar attempts at encoding its code points.
UTF-8 is a trick to represent Unicode code points in a more economic fashion. It doesn’t directly map code points to binary value like UTF-16 and 32 do. Instead, it handles bytes as if they’re some sort of “containers” (it will be clearer in a minute). The rules are simple. If a code point is between 0 and 127, UTF-8 creates a 1 byte container to hold the code point value. For code points above 127 it creates containers 2, 3 or more bytes long.
That’s right, UTF-8 is able to store code points using variable sized byte chunks.
“But what about the decoders and such things?” I hear you ask. “How can they tell which code points is 2 bytes and which is 3 and so forth? Won’t they be confused like before?”
Well, I said UTF-8 is a trick, a clever one. When encoding a code point, it first checks its value.
If the code point is less than 128 (0 – 127), UTF-8 creates a 1 byte container that begins with a 0 flag.
UTF-8 Container for code points between 0 – 127:
0xxx xxxx
- The leading 0 here is the flag that indicates to the UTF-8 decoder that the container is 1 byte in size.
- The next xxx xxxx is the space within the container where the code point is actually ’stored’
For example to encode ‘a’ in UTF-8:
0110 0001 : Unicode code point for 'a' (97) 0xxx xxxx : UTF-8 container for code points between 0 and 127 *110 0001 : Code point 97 --------- 0110 0001 : 'a' Encoded with UTF-8
Now, when an UTF-8 decoder encounters a byte, it first reads the leading bit and if it identifies 0, it immediately assumes that this is a 1 byte container. It then proceeds to read the next 7 bits (xxx xxxx) to find out exactly which code point the container holds. It finally simply matches the value in the Unicode map to output the corresponding character.
There’s one interesting property of UTF-8 encoding of characters between 0 and 127. After encoding, these code points remain exactly the same. That is, ‘a’ encoded in UTF-8 has exactly the same value as its raw Unicode code point, which in turn is the same as ASCII. This property makes UTF-8 encoded texts compatible with ASCII decoders for characters in that specific range.
0110 0001 : ASCII 'a' 0110 0001 : Unicode 'a' 0110 0001 : UTF-8 'a'
Now for something more interesting, lets look at UTF-8 encoding of code points above 127:
UTF-8 Container for code points between 128 and 2047 (0×80 – 0×7ff) :
110x xxxx 10xx xxxx
UTF-8 Container for code points between 2048 and 65,535 (0×800 – 0xffff) :
1110 xxxx 10xx xxxx 10xx xxxx
UTF-8 Container for code points between 65,536 and 4,194,303 (0×10000 – 0×3fffff) :
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
etc..
- each container begins with a flag in the leading byte, two or more 1s followed by a 0.
- The 1s in the leading flag (‘110′, ‘1110′, etc) indicate the size of the container. i.e. 11 for 2 bytes and 111 for 3 bytes. The 0 simply separates the flag from the data.
- each inner byte begins with a ‘10′ flag.
- As with 1 Byte containers seen earlier, x’s indicate where in the container the actual code point value is stored
- the largest code point value possible in each container is (2^number_of_x)-1.
e.g. lets encode ‘é’ in UTF-8:
1110 1001 : Unicode code point for 'é' (233) 110x xxxx 10xx xxxx : UTF-8 container for code points between 128 and 2047 ***0 0011 **10 1001 : realign the code point's bits to fit within the container ------------------- 1100 0011 1010 1001 : 'é' encoded in UTF-8 becomes 0xC3A9
When the UTF-8 decoder encounters such byte strings, it does the same thing as previously. It reads the leading flag in the first byte. A ‘110′ flag indicates a code point stored within a 2 bytes container, so the decoder reads 2 bytes. From the first byte it extracts all x’s after the ‘110′ flag and from the second byte it does the same after the ‘10′ flag.
1100 0011 1010 1001 : UTF-8 Encoded code point 110x xxxx 10xx xxxx : UTF-8 Container ------------------- ***0 0011 **10 1001 : Unicode code point 233
It then looks up the Unicode map to find out which character to display.
As you can see é encoded in UTF-8 requires 2 bytes, whereas Latin-1 stores it in 1, but it’s already a better tradeoff since it cost 2 bytes only for characters above 127, but with the convenience of an universal character set.
Hopefully, by now you grok Unicode and UTF-8. I have one last thing to cover.
Why am I seeing funny characters?
As you’ve seen, character encoding involves some “encoding” and some “decoding”. The decoding part is usually when problems arise. If you encode characters using one system and then try to decode them using another, there are chances that you’ll have some mismatch. We’ve already seen that many encoding schemes are compatible for the first 128 characters (ASCII range). So, these characters would usually display properly regardless of the chosen encoding. But lets take ‘é’ as a case (Unicode 233 or 0xE9) encoded with UTF-8 to be later read in Latin-1.
We’ve seen that after being encoded in UTF-8 ‘é’ is represented as 0xC3A9. If you then try to decode that value with a Latin-1 decoder, you’ll have a slight problem. Latin-1 only needs 1 byte to represent all its characters and doesn’t expect to go beyond it. 0xC3A9 is 2 bytes long. To a Latin-1 decoder this looks like 2 code points 1 byte each: 0xC3 and 0xA9. And this is exactly how it will read it. It’ll first decode 0xC3 and display the corresponding Latin-1 value à and then it’ll decode 0xA9 and display ©. So instead of ‘é’ you’ll see ‘é’.
Now the scenario the other way around. You have é encoded in Latin-1 as 0xE9 and try to have an UTF-8 decoder read it:
1110 1001 : 0xE9
According to UTF-8, a byte starting with 1110 indicates a 3 bytes container. So the UTF-8 decoder will first read the leading byte, but then it will expect the next 2 bytes to start with the “inner byte flag” 10. Instead there are very good chances that it will encounter some other sequence, which will result in a mismatch. Depending on the decoder’s implementation, it might try to signal the error, or attempt a desperate match of the value with a replacement character, or maybe just ignore it.
I hope that this series of articles on Unicode helped a few of whoever read it to have better foundations to understand Unicode, UTF-8 and their relationship with legacy encodings. Those working with PHP have the basis to now deal with multibyte string functions in internationalized applications. For those working with Python, in an upcoming article I’ll try to explain some cryptic behaviors of the interpreter when dealing with Unicode and byte strings.
Thank you for reading.


