Date Feb 19, 2011 @ 07:47
ASCII character set encoding uses 0-127 numeric to codes ascii characters. Using this we can encode 127 characters only, which is sufficient for English language. But as the Internet expanded to other regions of the world where English is not the primary language, a need for another character set emerged which can encode all the characters of the world. Unicode solves this problem by assigning unique umber to all the character that is used anywhere in the world. Unicode assign a "code point" to each and every character. Code point is like a pointer, it does not have any specific representation.
| Unicode code point | Character |
| U+0041 | A |
| U+00E9 | é |
| U+03B8 | θ(the Greek theta) |
| U+20AC | €(the euro) |
Here it says that Code Point "U+20AC" is for € That's it. How we want to represent "U+20AC" is only upto us. Some people say that always use 6 bytes to represent this and mark all usunsed bytes as zero(0) some may say that alwasy use 10 bytes and mark the unused bytes to zero. There are different systems to represent the unicode. These are:- There are different ways in which we can represent code point in the computer.
-
UCS-2: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)
-
UCS-4: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range
-
UTF-8: the encoding scheme described above (UTF = UCS Transformation Format)
-
UTF-16: an extension to UCS-2 to be able to encode characters outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence of two 16-bit characters
-
UTF-32: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4
-
UTF-7: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)
| Bits | Last Code Point | Byte1 | Byte2 | Byte3 | Byte4 | Byte5 | Byte6 |
| 7 | U+007F | 0XXXXXXX | |||||
| 11 | U+07FF | 110XXXXX | 10XXXXXX | ||||
| 16 | U+FFFF | 1110XXXX | 10XXXXXX | 10XXXXXX | |||
| 21 | U+1FFFFF | 11110XXX | 10XXXXXX | 10XXXXXX | 10XXXXXX | ||
| 26 | U+3FFFFFF | 111110XX | 10XXXXXX | 10XXXXXX | 10XXXXXX | 10XXXXXX | |
| 31 | U+7FFFFFFF | 1111110X | 10XXXXXX | 10XXXXXX | 10XXXXXX | 10XXXXXX | 10XXXXXX |
Let's take each example:- Code Point for A is "U+0041" which falls in first bucket. First bucket has 0XXXXXXX format. and binary representation for 41 is 00100001. Now put the max seven bits ie leaving one zero fromleft side ie "0100001" and put this in 0XXXXXXX format. This becomes 00100001 ir 0X41. Means that for first bucket utf useages one byte only.
Similarly, Code Point for é is "U+00E9" which falls in second bucket. Second bucket has 110XXXXX 10XXXXXX format. So E9 will be represented as 0000 0000 1110 1001 after filling the most significat 11 bits in the format we get 11000011 10101001 (0XC3 0X A9) .
Code Point for € is "U+20AC" which falls in third bucket. Third bucket has 1110XXXX 10XXXXXX 10XXXXXX format. So 20AC will be represented as 0010 0000 1010 1100 after filling this 1110 0010 10 000010 10101100(0XE2 0X82 0XAC) .
