Unicode and UTF-8
Unicode
Unicode is a encoding scheme
. Every character is assigned an code point(essesntially a number). Example:
1 U+0031
A U+0041
a U+0061
At first glass it seems that the numbers are assigned in a 2 byte word - far from it. Unicode is just an encoding scheme - a number assigned to a character. Memory representation is specified by an encoding format
such as UTF-8
UTF-8
UTF-8 is a variable length encoding format
. It encodes ascii characters in a byte - thus having backwards compatibility with ascii.
0xxx xxxx A single-byte US-ASCII code (from the first 127 characters)
For more exotic characters whose code points spans multiple bytes it uses prefix to indicate size.
110x xxxx One more byte follows
1110 xxxx Two more bytes follow
1111 0xxx Three more bytes follow
10xx xxxx A continuation of one of the multi-byte characters
The 8 in UTF-8 signifies the smallest number of bits that can represent a code point. UTF-16 uses 16-bits to represent ascii and so on.