Unicode and UTF-8!
Date Feb 19, 2011 @ 07:47

ASCII character set encoding uses 0-127 numeric to codes ascii characters. Using this we can encode 127 characters only, which is sufficient for English language. But as the Internet expanded to other regions of the world where English is not the primary language, a need for another character set emerged which can encode all the characters of the world. Unicode solves this problem by assigning unique umber to all the character that is used anywhere in the world. Unicode assign a "code point" to each and every character. Code point is like a pointer, it does not have any specific representation.

Unicode code pointCharacter
U+0041A
U+00E9é
U+03B8θ(the Greek theta)
U+20AC€(the euro)

Here it says that Code Point "U+20AC" is for € That's it. How we want to represent "U+20AC" is only upto us. Some people say that always use 6 bytes to represent this and mark all usunsed bytes as zero(0) some may say that alwasy use 10 bytes and mark the unused bytes to zero. There are different systems to represent the unicode. These are:- There are different ways in which we can represent code point in the computer.

Let's see UTF-8 as an example:- Each Unicode code point is divided into different bucket and computer representation depsnds on which bucket it falls:-

Bits Last Code Point Byte1 Byte2 Byte3 Byte4 Byte5 Byte6
7U+007F 0XXXXXXX
11U+07FF 110XXXXX10XXXXXX
16U+FFFF 1110XXXX 10XXXXXX 10XXXXXX
21U+1FFFFF11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
26U+3FFFFFF 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
31U+7FFFFFFF 1111110X10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX

Let's take each example:- Code Point for A is "U+0041" which falls in first bucket. First bucket has 0XXXXXXX format. and binary representation for 41 is 00100001. Now put the max seven bits ie leaving one zero fromleft side ie "0100001" and put this in 0XXXXXXX format. This becomes 00100001 ir 0X41. Means that for first bucket utf useages one byte only.

Similarly, Code Point for é is "U+00E9" which falls in second bucket. Second bucket has 110XXXXX 10XXXXXX format. So E9 will be represented as 0000 0000 1110 1001 after filling the most significat 11 bits in the format we get 11000011 10101001 (0XC3 0X A9) .

Code Point for € is "U+20AC" which falls in third bucket. Third bucket has 1110XXXX 10XXXXXX 10XXXXXX format. So 20AC will be represented as 0010 0000 1010 1100 after filling this 1110 0010 10 000010 10101100(0XE2 0X82 0XAC) .