character_encoding
Overview
What is a character encoding? In the end, when people do things with computers, they tend to work in text forms, whether that’s programs, or whether that is some other input that they give to the computer. It’s usually text. Text is conceptually a list of characters
. That’s what separates it from random images on paper. The idea with character encoding is that the computers definitely would prefer numbers. We take these characters, we assign them numbers, integers. Then we figure out a way to transcribe those integers into a list of bytes
. That whole process of going from text to a list of bytes is known as encoding. The reverse step is known as decoding. For example, this would be the standard ASCII approach to this. Hello is a five letter word, we will split that into five separate characters. Each of these characters is being assigned a number. At least in this case, we take these numbers and say, each of these numbers corresponds to 1 byte in the final output. Once you start working with more characters, that system breaks down because when you say each character is 1 byte, then you’re stuck with 256 characters. That doesn’t work for more complicated use cases like Chinese characters.
ASCII
The simplest version that you can do this is ASCII. At least, historically, it’s the most important one of the first character encodings that came into existence. The idea is, we take about 127 character
s. Not all of these are printable characters that you could see on paper, and we assign each of the numbers
. These are the decimal and hexadecimal values that we give them. We say each of these values will just be encoded as a single byte in the final output
. I will use hexadecimal representations a lot. It doesn’t really matter because the exact values don’t matter. ASCII is a 7-bit character encoding
. It covers most use cases that appear in the English languages and languages that use a similar alphabet, which aren’t all that many. That’s pretty much it. There’s not a lot else that you can do with it, which is frustrating when you do want to support other languages.
ISO 8859-*, and Windows code pages
The first step towards making ASCII work for other languages is to extend i
t. The idea behind a lot of the character encodings that came next, in particular, the ISO 8859-*, and Windows code pages. ASCII is 7-bit, which means we have another 128 characters availabl
e. We’re just going to create a lot of character encodings that covers some specific languages. For example, there’s Latin-1, which stands for ISO 8859-1, but rolls over the tongue a bit more nicely. That is for Western languages, you can write Spanish, French with it, languages like that, German too. Then there’s other encodings for other languages, like there’s the Cyrillic variant in that standard. There’s the Cyrillic Windows code page. These are not necessarily compatible. There’s an example where there’s characters in two different character encodings for the same language, so both for Cyrillic languages. If you encode something as one of them and decode as another, it will come out as garbage. That was also not a great situation overall. This doesn't really cover, for example, Chinese character use case
.
GBK
This is one example of a character encoding that was used for Chinese characters is GBK
. The idea is, 128 extra characters, that’s not enough. If there's a character that's not ASCII, with the upper bit set, then the next byte will also count towards this character
. It’s either 1 byte ASCII or 2 byte Chinese character
. That gives you about 30,000 more characters, which is still not enough to cover all of the Chinese characters that exist, but it’s practical enough.
UNICODE
What we ended up with is Unicode, which is not an encoding. It is a character set
, which says, each of the characters which I chose because it has a non-ASCII character in it, to each of these characters, we assign the number, and that is Unicode, mapping text(character) to number
. That should ideally cover all the use cases that were previously covered by other encodings. Then we specify an actual encoding, and there are multiple of those, which define how to translate these integers into byte sequences
, like there’s UTF-8, UTF-16, UTF-32, and quite a few others. UTF-8 and UTF-16 are the most important ones.
The way that Unicode characters are usually spelled out is U+ and then four hex digits, or five sometimes if the characters don't fit into the four hex digit range
. That is how you specify, this is the character I’m talking about, does not specify how it is encoded. The numbering is compatible with Latin-1. The first 256 Unicode characters are the 256 Latin-1 characters. The maximum number that one can have is larger than 1 million, so we’ll have a little more than 1 million characters in total available for Unicode. Hopefully, that’s enough for the future, we’ll see. Right now there’s no issue with that. It also includes emoji
, which is something that the Unicode standard is famous for these days.
UTF-8
The most common character encoding that is used with Unicode is UTF-8. It’s a variable length encoding
. The higher the character number is, the longer the byte sequence is in which it is encoded
. In particular, it's ASCII compatible
. The ASCII characters in UTF-8 are the ASCII characters as they are in code of ASCII, which is very nice. It’s also a nice property of using the scheme. These particular byte sequences don’t really have to worry about how the actual bits are encoded. If there is something broken, some invalid byte in there when you decode it, that won’t break decoding of the rest of the string.
UTF-16
UTF-16, which is 2-byte code units
, so 65,000 characters that can be encoded in a single 2-byte unit
. Ones that do not fit into that range, they’re split into two separate pairs of code units
. Because it uses 2 bytes, there are two different variants. I don’t know if you are familiar with that. There’s generally, little endian machines and big endian machines. The little endian ones put the low byte first and then the higher value byte, big endian is the reverse situation. Most modern processors use little endian.