In a previous article I quickly went over some basics about binary, hex, ascii and extended encodings.
To recap:
- ASCII had 128 character positions (0 – 127).
- Note that the last character position was 0111 1111 (127).
- Since ASCII didn’t include characters foreign to english,
- and some people really wanted accents,
- they turned the ASCII leading bit on (1000 0000 aka 128)
- and got 128 more characters (1000 0000 to 1111 1111).
- Though, they couldn’t agree on which character to put where,
- and so ensued various incompatible implementations from position 128 onward.
It was becoming clear that this soup of incompatible charsets would cause problems on the long run. Unicode was an attempt to solve this by creating the mother of character sets. A map of all possible symbols and characters.
Unicode Code Points:
At the time of writing, Unicode includes over 1.1 million characters, each assigned a specific and unique number just like in ASCII and other legacy encodings. Those numbers are generally called code points.
Lets forget about computers for a moment to simply concentrate on Unicode and its code points. If I wrote a sentence using only Unicode code points, you could transcribe it in letters and symbols, just by matching these code points to their assigned character.
Before we proceed, there is a sensible nuance that most articles discussing Unicode don’t spend enough time emphasizing, or go the opposite way by adding so much details that it confuses the reader. As a result, the foundations necessary to understand the general idea remains brittle. In this article I’ll press on it to the point of annoyance, you can thank me later. So pay attention, here it goes:
Unicode code points are just numbers assigned to characters
Repeat this in your head 5 times and then some. Did I mention anything about computers? I think not. Again:
Unicode code points are just numbers assigned to characters
In my opinion, the first thing that confuses people the most about Unicode is that there’s a premature connection made between Unicode, computers and encodings. I believe the reason to be that these code points are most often represented in hexadecimal. But they don’t need to be, that’s just a convention.
Unicode code points are just numbers assigned to characters
I didn’t say Hexadecimal numbers, I said numbers. Also notice how I didn’t mention encoding in that sentence? I haven’t forgotten anything. As far as Unicode code points are concerned, there’s no encoding. The map is straight forward. Code points -> characters.
Unicode code points are just numbers assigned to characters
It’s as simple as that. No need to complicate this explanation by mentioning computers and how characters will be represented in documents and a whole other set of complexities. Think of Unicode as Morse Code, but with numbers instead of dots and dashes. You want to send a message in Unicode you grab the Unicode map, a pen and some paper and you just proceed to write down those numbers instead of characters. I want to decode your message, I grab my Unicode map and do the translation the other way around. That’s it. Unicode code points know nothing about computers, or hex, or encodings.
Now, you hopefully have a clue that Unicode code points are just numbers assigned to characters. Lets continue.
To make the transition from ASCII to Unicode seamless, it was decided that ASCII characters would have the same code point value in Unicode. That is, ‘a’ which is 97 in ASCII is also 97 in Unicode, ‘1′ is still 49 and so forth. The Unicode folks even went as far as making their code points also compatible with the then popular Latin-1 extended code points (characters 128 – 255). That is ‘é’ which is 233 in Latin-1 is also 233 in Unicode.
There’s more to Unicode, but nothing that needs to be covered here to understand the big picture. By the end of this series you should be able to explore its darker corners with much less apprehensions.
Lets recap on Unicode:
- code points -> characters
- code points 0 – 127 are the same as ASCII’s
- code points 128 – 255 are the same as ISO-8859-1’s (Latin-1)
In the next article of the series we’ll discuss some Unicode encoding attempts.
thank you for reading.


