Tag: utf-8

Unicode etc. Part 4: UTF-8

We’ve previously discussed Unicode and some less than stellar attempts at encoding its code points.

UTF-8 is a trick to represent Unicode code points in a more economic fashion. It doesn’t directly map code points to binary value like UTF-16 and 32 do. Instead, it handles bytes as if they’re some sort of “containers” (it will be clearer in a minute). The rules are simple. If a code point is between 0 and 127, UTF-8 creates a 1 byte container to hold the code point value. For code points above 127 it creates containers 2, 3 or more bytes long.

That’s right, UTF-8 is able to store code points using variable sized byte chunks.

“But what about the decoders and such things?” I hear you ask. “How can they tell which code points is 2 bytes and which is 3 and so forth? Won’t they be confused like before?”

Well, I said UTF-8 is a trick, a clever one. When encoding a code point, it first checks its value.

If the code point is less than 128 (0 – 127), UTF-8 creates a 1 byte container that begins with a 0 flag.

UTF-8 Container for code points between 0 – 127:

0xxx xxxx

- The leading 0 here is the flag that indicates to the UTF-8 decoder that the container is 1 byte in size.
- The next xxx xxxx is the space within the container where the code point is actually ’stored’

For example to encode ‘a’ in UTF-8:

0110 0001    :   Unicode code point for 'a' (97)

0xxx xxxx    :   UTF-8 container for code points between 0 and 127
*110 0001    :   Code point 97
---------
0110 0001    :   'a' Encoded with UTF-8

Now, when an UTF-8 decoder encounters a byte, it first reads the leading bit and if it identifies 0, it immediately assumes that this is a 1 byte container. It then proceeds to read the next 7 bits (xxx xxxx) to find out exactly which code point the container holds. It finally simply matches the value in the Unicode map to output the corresponding character.

There’s one interesting property of UTF-8 encoding of characters between 0 and 127. After encoding, these code points remain exactly the same. That is, ‘a’ encoded in UTF-8 has exactly the same value as its raw Unicode code point, which in turn is the same as ASCII. This property makes UTF-8 encoded texts compatible with ASCII decoders for characters in that specific range.

0110 0001 : ASCII 'a'
0110 0001 : Unicode 'a'
0110 0001 : UTF-8 'a'

Now for something more interesting, lets look at UTF-8 encoding of code points above 127:

UTF-8 Container for code points between 128 and 2047 (0×80 – 0×7ff) :

110x xxxx  10xx xxxx

UTF-8 Container for code points between 2048 and 65,535 (0×800 – 0xffff) :

1110 xxxx  10xx xxxx  10xx xxxx

UTF-8 Container for code points between 65,536 and 4,194,303 (0×10000 – 0×3fffff) :

1111 0xxx  10xx xxxx  10xx xxxx  10xx xxxx

etc..

- each container begins with a flag in the leading byte, two or more 1s followed by a 0.
- The 1s in the leading flag (‘110′, ‘1110′, etc) indicate the size of the container. i.e. 11 for 2 bytes and 111 for 3 bytes. The 0 simply separates the flag from the data.
- each inner byte begins with a ‘10′ flag.
- As with 1 Byte containers seen earlier, x’s indicate where in the container the actual code point value is stored
- the largest code point value possible in each container is (2^number_of_x)-1.

e.g. lets encode ‘é’ in UTF-8:

          1110 1001    :   Unicode code point for 'é' (233)
110x xxxx 10xx xxxx    :   UTF-8 container for code points between 128 and 2047
***0 0011 **10 1001    :   realign the code point's bits to fit within the container
-------------------
1100 0011 1010 1001    :   'é' encoded in UTF-8 becomes 0xC3A9

When the UTF-8 decoder encounters such byte strings, it does the same thing as previously. It reads the leading flag in the first byte. A ‘110′ flag indicates a code point stored within a 2 bytes container, so the decoder reads 2 bytes. From the first byte it extracts all x’s after the ‘110′ flag and from the second byte it does the same after the ‘10′ flag.

1100 0011 1010 1001    :  UTF-8 Encoded code point
110x xxxx 10xx xxxx    :  UTF-8 Container
-------------------
***0 0011 **10 1001    :  Unicode code point 233

It then looks up the Unicode map to find out which character to display.

As you can see é encoded in UTF-8 requires 2 bytes, whereas Latin-1 stores it in 1, but it’s already a better tradeoff since it cost 2 bytes only for characters above 127, but with the convenience of an universal character set.

Hopefully, by now you grok Unicode and UTF-8. I have one last thing to cover.

Why am I seeing funny characters?

As you’ve seen, character encoding involves some “encoding” and some “decoding”. The decoding part is usually when problems arise. If you encode characters using one system and then try to decode them using another, there are chances that you’ll have some mismatch. We’ve already seen that many encoding schemes are compatible for the first 128 characters (ASCII range). So, these characters would usually display properly regardless of the chosen encoding. But lets take ‘é’ as a case (Unicode 233 or 0xE9) encoded with UTF-8 to be later read in Latin-1.

We’ve seen that after being encoded in UTF-8 ‘é’ is represented as 0xC3A9. If you then try to decode that value with a Latin-1 decoder, you’ll have a slight problem. Latin-1 only needs 1 byte to represent all its characters and doesn’t expect to go beyond it. 0xC3A9 is 2 bytes long. To a Latin-1 decoder this looks like 2 code points 1 byte each: 0xC3 and 0xA9. And this is exactly how it will read it. It’ll first decode 0xC3 and display the corresponding Latin-1 value à and then it’ll decode 0xA9 and display ©. So instead of ‘é’ you’ll see ‘é’.

Now the scenario the other way around. You have é encoded in Latin-1 as 0xE9 and try to have an UTF-8 decoder read it:

1110 1001  :  0xE9

According to UTF-8, a byte starting with 1110 indicates a 3 bytes container. So the UTF-8 decoder will first read the leading byte, but then it will expect the next 2 bytes to start with the “inner byte flag” 10. Instead there are very good chances that it will encounter some other sequence, which will result in a mismatch. Depending on the decoder’s implementation, it might try to signal the error, or attempt a desperate match of the value with a replacement character, or maybe just ignore it.

I hope that this series of articles on Unicode helped a few of whoever read it to have better foundations to understand Unicode, UTF-8 and their relationship with legacy encodings. Those working with PHP have the basis to now deal with multibyte string functions in internationalized applications. For those working with Python, in an upcoming article I’ll try to explain some cryptic behaviors of the interpreter when dealing with Unicode and byte strings.

Thank you for reading.

Unicode etc. Part 3: Encodings

In previous articles we spoke about ASCII and legacy character sets, we introduced Unicode and explained the code points. In this article we explore some attempts at encoding Unicode code points.

Binary encoding of Unicode

I have the vague memory of mentioning in a previous article that Unicode code points are just numbers assigned to characters. That still holds true, but the fact of the matter is, as tempting as it can be to write Unicode encoded messages with paper and pen, it’s slightly more practical and useful to do so with a computer. However, before your computer can read and write messages in Unicode code points, it also needs its very own Unicode map. Remember that computers understand everything in binary, so we need a way to convert these characters (i.e. their code points) to binary. This process is what encoding is all about. Once your computer has its own binary Unicode map, it can use it in a way similar to what you’d do with your paper, pencil and Unicode map.

Hold on tight here, because this is usually another corner where people get lost in the dust about Unicode.

If I were to ask to some smart folks, such as yourself, to devise a scheme to represent Unicode code points in binary format (ones and zeros), most would just roll their eyes and go: “Duh! You have an Unicode map full of these code points, and you’ve been banging our heads saying they’re just numbers assigned to characters right? Newsflash dude, binary is also just a number representation that computers can understand. So to write a message with code points, just write each code point value in the document as binary. To read a message and display characters, you do the reverse. You take the code points from the document, find matching values on the map and just display the corresponding characters. That’s how ASCII works, that’s how Latin-1 works, why don’t you just do the same with Unicode?”

Hmm, that actually sounds pretty straight forward and intuitive. Unicode code points are numbers. So, simply put these number’s in binary format, exactly the way it’s done in ASCII, Latin-1 and most other legacy encodings. Ok, that’s fine, lets give it a go.

So we manage to send our first Unicode encoded message with your little scheme and this is what our correspondent receives:

11011011  10010010  00010110  10001101  11011000

It does indeed represent some code points. But wait a minute, which ones? See, one notable difference between Unicode and other character sets is that Unicode is huge. As I said in a previous article, it has over 1.1 million characters and is still expanding. That is, while some of these code points are as small as 0, 1, 2 or 3, others have a decimal value as large as 1,114,112 (0×110000). To represent such large code points in binary, you would need a minimum of 3 bytes. How would the decoder know which code point is only 1 byte, which is 2 or 3?

- Hey pal, Decoding Office here. Yeah, about the string of bytes that you just sent us, is it 1 byte followed by 2 groups of 2 bytes?
- hmm, I don’t remember I think it was 2 bytes followed by 1 byte, then followed by 2 other bytes, but I’m not sure anymore.
- Could it be 3 bytes followed by 2 bytes?
- nah, that looks more like 3 bytes, followed by 1 byte and then another. Try it like that and let me know what happens.
- ah fuck it, we’re not getting paid enough for this sh… ahem, thanks for nothing! click.

This is a pretty naive mapping scheme, there’s no way for a decoder to know how large each code point is just by looking at bits. They could be 1, 2 or 3 bytes, no way to tell. The only sensible solution in this situation is to agree on a “one-size fit-all” approach, where all code points have the same size. That way the decoder knows exactly where each code point begins and where it ends. The space also needs to be large enough for the largest values to fit in. In the case of Unicode this is a minimum of 3 bytes. To illustrate, suppose the message above was in fact a 2 bytes code point, followed by a single byte code point, then another double bytes code point. For the decoder to read them properly, we would have to pad each code point to make them the same size (3 bytes) before sending the message. Here’s the padded version of our message:

00000000  11011011  10010010  (2 bytes code point)
00000000  00000000  00010110  (1 byte code point)
00000000  10001101  11011000  (2 bytes code point)

Some of the earliest Unicode encoding schemes actually worked with this approach. One scheme called UCS2 aka UTF-16 required 2 bytes (16 bits) and occasionally used 4 bytes in special circumstances[1]. Another scheme, UCS4 aka UTF-32 just settled with simply using 4 bytes (32 bits) for all characters.

It all seemed fine, but look, padding code points with empty bytes is just considered wasteful and also introduce some efficiency issues with some processors’ architecture. It also is very unappealing to most folks that work 99% of the time with characters that only actually require 1 byte in legacy encodings (ASCII, Latin-1, etc).

                                 0110 0001   :   'a' in ascii
                                 0110 0001   :   'a' in Latin-1
                                 0110 0001   :   'a' in Unicode code point
                      0000 0000  0110 0001   :   'a' in Unicode encoded with UCS2
0000 0000  0000 0000  0000 0000  0110 0001   :   'a' in Unicode encoded with UCS4

                                 1110 1001   :   'é' in Latin-1
                                 1110 1001   :   'é' in Unicode code point
                      0000 0000  1110 1001   :   'é' in Unicode encoded with UCS2
0000 0000  0000 0000  0000 0000  1110 1001   :   'é' in Unicode encoded with UCS4

What a waste indeed. Plus, attempts at optimizing the scheme gave birth to some nasty stuff such as Byte Order Marks (BOM), which complicated the whole thing even more and people just went “Nah, lets just go back to Latin-1″ or whatever else they were using. So UCS2 and UCS4 pretty much bombed.

Recap on UTF-16 and UTF-32
- wasteful and inefficient (i.e. it sucked)

[1] UCS2, using only 2 bytes, can’t map more than 65,536 characters (2^16). To get beyond that limitation it uses a trick called the Surrogate Blocks. Unicode has a range of 2048 unused code points within those first 65,536. By splitting that range in 2 and combining a value from the upper half (2 bytes) to a value from the lower half (2 other bytes), you can get 1024 x 1024 new code points (4 bytes each).

In the next article of this series I’ll cover UTF-8 and how it saved Unicode.

Thank you for reading.

Unicode etc. Part 2: Introducing Unicode

In a previous article I quickly went over some basics about binary, hex, ascii and extended encodings.

To recap:
- ASCII had 128 character positions (0 – 127).
- Note that the last character position was 0111 1111 (127).
- Since ASCII didn’t include characters foreign to english,
- and some people really wanted accents,
- they turned the ASCII leading bit on (1000 0000 aka 128)
- and got 128 more characters (1000 0000 to 1111 1111).
- Though, they couldn’t agree on which character to put where,
- and so ensued various incompatible implementations from position 128 onward.

It was becoming clear that this soup of incompatible charsets would cause problems on the long run. Unicode was an attempt to solve this by creating the mother of character sets. A map of all possible symbols and characters.

Unicode Code Points:

At the time of writing, Unicode includes over 1.1 million characters, each assigned a specific and unique number just like in ASCII and other legacy encodings. Those numbers are generally called code points.

Lets forget about computers for a moment to simply concentrate on Unicode and its code points. If I wrote a sentence using only Unicode code points, you could transcribe it in letters and symbols, just by matching these code points to their assigned character.

Before we proceed, there is a sensible nuance that most articles discussing Unicode don’t spend enough time emphasizing, or go the opposite way by adding so much details that it confuses the reader. As a result, the foundations necessary to understand the general idea remains brittle. In this article I’ll press on it to the point of annoyance, you can thank me later. So pay attention, here it goes:

Unicode code points are just numbers assigned to characters

Repeat this in your head 5 times and then some. Did I mention anything about computers? I think not. Again:

Unicode code points are just numbers assigned to characters

In my opinion, the first thing that confuses people the most about Unicode is that there’s a premature connection made between Unicode, computers and encodings. I believe the reason to be that these code points are most often represented in hexadecimal. But they don’t need to be, that’s just a convention.

Unicode code points are just numbers assigned to characters

I didn’t say Hexadecimal numbers, I said numbers. Also notice how I didn’t mention encoding in that sentence? I haven’t forgotten anything. As far as Unicode code points are concerned, there’s no encoding. The map is straight forward. Code points -> characters.

Unicode code points are just numbers assigned to characters

It’s as simple as that. No need to complicate this explanation by mentioning computers and how characters will be represented in documents and a whole other set of complexities. Think of Unicode as Morse Code, but with numbers instead of dots and dashes. You want to send a message in Unicode you grab the Unicode map, a pen and some paper and you just proceed to write down those numbers instead of characters. I want to decode your message, I grab my Unicode map and do the translation the other way around. That’s it. Unicode code points know nothing about computers, or hex, or encodings.

Now, you hopefully have a clue that Unicode code points are just numbers assigned to characters. Lets continue.

To make the transition from ASCII to Unicode seamless, it was decided that ASCII characters would have the same code point value in Unicode. That is, ‘a’ which is 97 in ASCII is also 97 in Unicode, ‘1′ is still 49 and so forth. The Unicode folks even went as far as making their code points also compatible with the then popular Latin-1 extended code points (characters 128 – 255). That is ‘é’ which is 233 in Latin-1 is also 233 in Unicode.

There’s more to Unicode, but nothing that needs to be covered here to understand the big picture.  By the end of this series you should be able to explore its darker corners with much less apprehensions.

Lets recap on Unicode:
- code points -> characters
- code points 0 – 127 are the same as ASCII’s
- code points 128 – 255 are the same as ISO-8859-1’s (Latin-1)

In the next article of the series we’ll discuss some Unicode encoding attempts.

thank you for reading.

Unicode etc. Part 1: Refresher

This series of posts aims at explaining what Unicode, UTF-8 and other legacy encodings are and how they relate to each other. Although some authors have already attempted the exercise and sometimes succeed at getting the point across, I’ve often had the impression that prior knowledge of certain implicit concepts was assumed. Therefore, in my own attempt at covering this topic, I will linger on a few key details that I believe to be very helpful in having a solid grasp of the subject. In this first post, lets go over some brief reminders. This is really basic stuff that I won’t go into more details than necessary. The post should really serve as nothing more than a refresher on binary or ascii. If you wish to dig a bit more, by all means happy googling.

1- The decimal, binary and hexadecimal numerical systems.

Computers like most electrical devices can only detect 2 states, power-On/power-Off. Electronic operations therefore need to be performed using a numbering system that can accommodate that limitation. The binary numerical system aka base-2, is a complete numerical system using only 2 symbols, 1 and 0. A perfect fit.

Humans work mostly in decimal aka base-10 (10 symbols). Binary is extremely tedious for us. Furthermore, converting between binary and decimal isn’t so obvious.

It just so happens that there’s a direct mathematical relationship between base-2 and base-16 (hexadecimal). Converting between base-10 and base-16 isn’t too complicated either.

Thus, a good consensus in computing is to simply represent numbers in hexadecimal, since it’s a good middle ground between decimal and binary. It’s easier on the eyes and the brains. The trade-off is to learn to be familiar with it, which isn’t so bad.

Hexadecimal numbers are represented using 16 symbols: 0-9 for the first 10, and past these, a-f (for values 10-15). In computing, hex are generally represented with a leading 0 followed by an x (in upper or lowercase). e.g. 0×21 tells me this isn’t a decimal 21 but rather hex 21, a completely different value.

decimal :  binary : hexadecimal
   0    :      0  :     0x0
   1    :      1  :     0x1
   2    :     10  :     0x2
   3    :     11  :     0x3
   4    :    100  :     0x4
 ...    :    ...  :     ...
  10    :   1010  :     0xa
  11    :   1011  :     0xb
  12    :   1100  :     0xc
 ...    :   ...   :     ...
  15    :   1111  :     0xf
  16    :  10000  :    0x10
  17    :  10001  :    0x11
 ...    :    ...  :     ...
  29    :  11101  :    0x1d
  30    :  11110  :    0x1e
  31    :  11111  :    0x1f
  32    : 100000  :    0x20
 ...    :    ...  :     ...
 ...    :    ...  :     ...

Converting between decimal and hex isn’t so important for the topic at hand. Conversions between binary and hex are more relevant and may help, it’s also super easy. To convert from binary to hex, simply gather binary digits in groups of 4 starting from the right. Pad the leftmost remaining numbers with 0 if needed. Now, simply convert each group to its hex equivalent:

        0001 0011 = 0x13
        1011 1111 = 0xbf
   0010 1001 1010 = 0x29a

To convert the other way around, you do the reverse, replace the hex number by its binary equivalent.

2- Bits and bytes

Computers process data in batches of 8 bits called Bytes (sometimes called octets):

1010 1111 = 0xaf  (1 byte)
0001 0011  1010 1111 = 0x13af (2 bytes)

3- An oversimplified explanation of character sets.

Computers don’t actually understand characters. In reality, characters, whether they’re in a document, on a command line, or a browser window, are represented as strings of bytes (i.e. batches of 8 bits packets). To display them on screen, a decoder processes the electronic stream and maps each number it identifies to yet another type of data, containing additional instructions particularly relevant to a graphical application, whose purpose is to draw these shapes on screen (i.e. the characters). This is very low level stuff and I think we can probably get by with this overly simplistic big picture. In a nutshell, characters are stored as numbers and there’s software to map these numbers to shapes drawn on screen for our benefit.

ASCII

ASCII was an effort to create such a character map. 128 characters to be exact, numbered 0 to 127. Including upper and lower case english letters, numbers, punctuations, some symbols, as well as some white spaces and formatting characters (spaces, tabs, new lines, etc). Some examples of ASCII characters and their decimal, binary and hexadecimal representation.

Char :   Dec  :   Binary      :   Hex
--------------------------------------------
'A'  :   65   :   0100 0001   :   0x41
'a'  :   97   :   0110 0001   :   0x61
'b'  :   98   :   0110 0010   :   0x62
'1'  :   49   :   0011 0001   :   0x31
'!'  :   33   :   0010 0001   :   0x21
'~'  :   126  :   0111 1110   :   0x7e

Note that if ~ is at position 126, it means that it’s the 127th character (first character is at 0). Therefore the 128th and last ASCII character (an unprintable) is at 127, aka 0111 1111 (bin), aka 0×7f (hex).

ASCII doesn’t include accents, nor other letters and symbols used in non-english texts.

Extended Character Sets: Latin-1, etc.

As mentioned previously the last ASCII character is at position 0111 1111. You probably noticed the unused leading bit, yes? Well, so did a few folks who really wanted characters such as ç, ñ, ß and other such fancy glyphs. They proceeded to extend the ASCII set for their own purpose by switching that leading bit on. It gave them an additional range of 128 new possible characters (1000 0000 to 1111 1111). Unfortunately, not everyone agreed as to what put where and so, each came with their own implementation.

Thus began a plethora of different character sets that had everything in common from position 0 to 127 (ASCII) and then diverged from 128 to 255 (some would eventually overlap on some subsets). A couple of them, such ISO-8859-1 also known as Latin-1, are still very much in use today.

To recap:
- ASCII had 128 character positions (0 – 127).
- Note that the last character position was 0111 1111 (127).
- Since ASCII didn’t include characters foreign to english,
- and some people really wanted accents,
- they turned the ASCII leading bit on (1000 0000 aka 128)
- and got 128 more characters (1000 0000 to 1111 1111).
- Though, they couldn’t agree on which character to put where,
- and so ensued various incompatible implementations from position 128 onward.

In the next article I’ll introduce Unicode.

thank you for reading.