A few very insightful articles already do a good job of explaining the general concepts behind encoding of character sets. I won’t rebuild what can just be reused. The goal of this series is for you to ultimately and once and for all understand the relationship between Unicode, UTF-8 and legacy encodings. I try to focus on a few details that are often assumed knowledge in other articles and may as a result remain a bit fuzzy. In this article I’ll skim over the basics with a few assumptions:
1- you already understand the binary numerical system and its conversion to hexadecimal. A brief reminder:
Computers like most electrical devices can only detect 2 states, power-On/power-Off. Electronic operations therefore need to be performed using a numbering system that can accommodate that limitation. The binary numerical system aka base-2, is a complete numerical system using only 2 symbols, 1 and 0. A perfect fit.
Humans work mostly in decimal aka base-10 (10 symbols). Binary is extremely tedious for us. Furthermore, converting between binary and decimal isn’t so obvious.
It just so happens that there’s a direct mathematical relationship between base-2 and base-16 (hexadecimal). Converting between base-10 and base-16 isn’t too complicated either.
Thus, a good consensus in computing is to simply represent numbers in hexadecimal, since it’s a good middle ground between decimal and binary. It’s easier on the eyes and the brains. The trade-off is to learn to be familiar with it, which isn’t so bad.
Hexadecimal numbers are represented using 16 symbols: 0-9 for the first 10, and past these, a-f (for values 10-15). In computing, hex are generally represented with a leading 0 followed by an x (in upper or lowercase). e.g. 0×21 tells me this isn’t a decimal 21 but rather hex 21, a completely different value.
decimal : binary : hexadecimal 0 : 0 : 0x0 1 : 1 : 0x1 2 : 10 : 0x2 3 : 11 : 0x3 4 : 100 : 0x4 ... : ... : ... 10 : 1010 : 0xa 11 : 1011 : 0xb 12 : 1100 : 0xc ... : ... : ... 15 : 1111 : 0xf 16 : 10000 : 0x10 17 : 10001 : 0x11 ... : ... : ... 29 : 11101 : 0x1d 30 : 11110 : 0x1e 31 : 11111 : 0x1f 32 : 100000 : 0x20 ... : ... : ... ... : ... : ...
Converting between decimal and hex isn’t so important for the topic at hand. Conversions between binary and hex are more relevant and may help, it’s also super easy. To convert from binary to hex, simply gather binary digits in groups of 4 starting from the right. Pad the leftmost remaining numbers with 0 if needed. Now, simply convert each group to its hex equivalent:
0001 0011 = 0x13
1011 1111 = 0xbf
0010 1001 1010 = 0x29a
To convert the other way around, you do the reverse, replace the hex number by its binary equivalent.
2- you know that computers process binaries in batches of 8 bits called Bytes:
1010 1111 = 0xaf (1 byte) 0001 0011 1010 1111 = 0x13af (2 bytes)
3- you have some basic or unclear understanding of character sets. An oversimplification of the model:
Computers don’t actually understand characters. In reality, characters, whether they’re in a document, on a command line, or a browser window, are represented as strings of bytes (batches of 8 bits such as 10111110). To display them on screen, a decoder processes the electronic stream and maps each number to yet another type of data, containing additional instructions particularly relevant to a graphical application, whose purpose is to draw these shapes on screen (i.e. the characters). This is very low level stuff and I think we can probably get by with this overly simplistic big picture. In a nutshell, characters are stored as numbers and there’s software to map these numbers to shapes drawn on screen for our benefit.
ASCII
ASCII was an effort to create such a character map. 128 characters to be exact, numbered 0 to 127. Including upper and lower case english letters, numbers, punctuations, some symbols, as well as some white spaces and formatting characters (spaces, tabs, new lines, etc). Some examples of ASCII characters and their decimal, binary and hexadecimal representation.
Char : Dec : Binary : Hex -------------------------------------------- 'A' : 65 : 0100 0001 : 0x41 'a' : 97 : 0110 0001 : 0x61 'b' : 98 : 0110 0010 : 0x62 '1' : 49 : 0011 0001 : 0x31 '!' : 33 : 0010 0001 : 0x21 '~' : 126 : 0111 1110 : 0x7e
Note that if ~ is at position 126, it means that it’s the 127th character (first character is at 0). Therefore the 128th and last ASCII character (an unprintable) is at 127, aka 0111 1111 (bin), aka 0×7f (hex).
ASCII doesn’t include accents, nor other letters and symbols used in non-english texts.
Extended Character Sets: Latin-1, etc.
As mentioned previously the last ASCII character is at position 0111 1111. You probably noticed the unused leading bit, yes? Well, so did a few folks who really wanted characters such as ç, ñ, ß and other such fancy glyphs. They proceeded to extend the ASCII set for their own purpose by switching that leading bit on. It gave them an additional range (1000 0000 to 1111 1111) of 128 new possible characters. Unfortunately, not everyone agreed as to what put where and so, each came with their own implementation.
Thus began a plethora of different character sets that had everything in common from position 0 to 127 (ASCII) and then diverged from 128 to 255 (some would eventually overlap on some subsets). A couple of them, such ISO-8859-1 also known as Latin-1, are still very much in use today.
To recap:
- ASCII had 128 character positions (0 – 127).
- Note that the last character position was 0111 1111 (127).
- Since ASCII didn’t include characters foreign to english,
- and some people really wanted accents,
- they turned the ASCII leading bit on (1000 0000 aka 128)
- and got 128 more characters (1000 0000 to 1111 1111).
- Though, they couldn’t agree on which character to put where,
- and so ensued various incompatible implementations from position 128 onward.
In the next article I’ll introduce Unicode.
thank you for reading.


