Skip to main content

Unicode and UTF-8

Unicode

Unicode is a encoding scheme. Every character is assigned an code point(essesntially a number). Example:

1 U+0031
A U+0041
a U+0061

At first glass it seems that the numbers are assigned in a 2 byte word - far from it. Unicode is just an encoding scheme - a number assigned to a character. Memory representation is specified by an encoding format such as UTF-8

UTF-8

UTF-8 is a variable length encoding format. It encodes ascii characters in a byte - thus having backwards compatibility with ascii.

0xxx xxxx    A single-byte US-ASCII code (from the first 127 characters)

For more exotic characters whose code points spans multiple bytes it uses prefix to indicate size.

110x xxxx    One more byte follows
1110 xxxx Two more bytes follow
1111 0xxx Three more bytes follow

10xx xxxx A continuation of one of the multi-byte characters

The 8 in UTF-8 signifies the smallest number of bits that can represent a code point. UTF-16 uses 16-bits to represent ascii and so on.