Unicode etc. Part 1: Refresher

This series of posts aims at explaining what Unicode, UTF-8 and other legacy encodings are and how they relate to each other. Although some authors have already attempted the exercise and sometimes succeed at getting the point across, I’ve often had the impression that prior knowledge of certain implicit concepts was assumed. Therefore, in my own attempt at covering this topic, I will linger on a few key details that I believe to be very helpful in having a solid grasp of the subject. In this first post, lets go over some brief reminders. This is really basic stuff that I won’t go into more details than necessary. The post should really serve as nothing more than a refresher on binary or ascii. If you wish to dig a bit more, by all means happy googling.

1- The decimal, binary and hexadecimal numerical systems.

Computers like most electrical devices can only detect 2 states, power-On/power-Off. Electronic operations therefore need to be performed using a numbering system that can accommodate that limitation. The binary numerical system aka base-2, is a complete numerical system using only 2 symbols, 1 and 0. A perfect fit.

Humans work mostly in decimal aka base-10 (10 symbols). Binary is extremely tedious for us. Furthermore, converting between binary and decimal isn’t so obvious.

It just so happens that there’s a direct mathematical relationship between base-2 and base-16 (hexadecimal). Converting between base-10 and base-16 isn’t too complicated either.

Thus, a good consensus in computing is to simply represent numbers in hexadecimal, since it’s a good middle ground between decimal and binary. It’s easier on the eyes and the brains. The trade-off is to learn to be familiar with it, which isn’t so bad.

Hexadecimal numbers are represented using 16 symbols: 0-9 for the first 10, and past these, a-f (for values 10-15). In computing, hex are generally represented with a leading 0 followed by an x (in upper or lowercase). e.g. 0×21 tells me this isn’t a decimal 21 but rather hex 21, a completely different value.

decimal :  binary : hexadecimal
   0    :      0  :     0x0
   1    :      1  :     0x1
   2    :     10  :     0x2
   3    :     11  :     0x3
   4    :    100  :     0x4
 ...    :    ...  :     ...
  10    :   1010  :     0xa
  11    :   1011  :     0xb
  12    :   1100  :     0xc
 ...    :   ...   :     ...
  15    :   1111  :     0xf
  16    :  10000  :    0x10
  17    :  10001  :    0x11
 ...    :    ...  :     ...
  29    :  11101  :    0x1d
  30    :  11110  :    0x1e
  31    :  11111  :    0x1f
  32    : 100000  :    0x20
 ...    :    ...  :     ...
 ...    :    ...  :     ...

Converting between decimal and hex isn’t so important for the topic at hand. Conversions between binary and hex are more relevant and may help, it’s also super easy. To convert from binary to hex, simply gather binary digits in groups of 4 starting from the right. Pad the leftmost remaining numbers with 0 if needed. Now, simply convert each group to its hex equivalent:

        0001 0011 = 0x13
        1011 1111 = 0xbf
   0010 1001 1010 = 0x29a

To convert the other way around, you do the reverse, replace the hex number by its binary equivalent.

2- Bits and bytes

Computers process data in batches of 8 bits called Bytes (sometimes called octets):

1010 1111 = 0xaf  (1 byte)
0001 0011  1010 1111 = 0x13af (2 bytes)

3- An oversimplified explanation of character sets.

Computers don’t actually understand characters. In reality, characters, whether they’re in a document, on a command line, or a browser window, are represented as strings of bytes (i.e. batches of 8 bits packets). To display them on screen, a decoder processes the electronic stream and maps each number it identifies to yet another type of data, containing additional instructions particularly relevant to a graphical application, whose purpose is to draw these shapes on screen (i.e. the characters). This is very low level stuff and I think we can probably get by with this overly simplistic big picture. In a nutshell, characters are stored as numbers and there’s software to map these numbers to shapes drawn on screen for our benefit.

ASCII

ASCII was an effort to create such a character map. 128 characters to be exact, numbered 0 to 127. Including upper and lower case english letters, numbers, punctuations, some symbols, as well as some white spaces and formatting characters (spaces, tabs, new lines, etc). Some examples of ASCII characters and their decimal, binary and hexadecimal representation.

Char :   Dec  :   Binary      :   Hex
--------------------------------------------
'A'  :   65   :   0100 0001   :   0x41
'a'  :   97   :   0110 0001   :   0x61
'b'  :   98   :   0110 0010   :   0x62
'1'  :   49   :   0011 0001   :   0x31
'!'  :   33   :   0010 0001   :   0x21
'~'  :   126  :   0111 1110   :   0x7e

Note that if ~ is at position 126, it means that it’s the 127th character (first character is at 0). Therefore the 128th and last ASCII character (an unprintable) is at 127, aka 0111 1111 (bin), aka 0×7f (hex).

ASCII doesn’t include accents, nor other letters and symbols used in non-english texts.

Extended Character Sets: Latin-1, etc.

As mentioned previously the last ASCII character is at position 0111 1111. You probably noticed the unused leading bit, yes? Well, so did a few folks who really wanted characters such as ç, ñ, ß and other such fancy glyphs. They proceeded to extend the ASCII set for their own purpose by switching that leading bit on. It gave them an additional range of 128 new possible characters (1000 0000 to 1111 1111). Unfortunately, not everyone agreed as to what put where and so, each came with their own implementation.

Thus began a plethora of different character sets that had everything in common from position 0 to 127 (ASCII) and then diverged from 128 to 255 (some would eventually overlap on some subsets). A couple of them, such ISO-8859-1 also known as Latin-1, are still very much in use today.

To recap:
- ASCII had 128 character positions (0 – 127).
- Note that the last character position was 0111 1111 (127).
- Since ASCII didn’t include characters foreign to english,
- and some people really wanted accents,
- they turned the ASCII leading bit on (1000 0000 aka 128)
- and got 128 more characters (1000 0000 to 1111 1111).
- Though, they couldn’t agree on which character to put where,
- and so ensued various incompatible implementations from position 128 onward.

In the next article I’ll introduce Unicode.

thank you for reading.