UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding that is used to represent Unicode-encoded text using a stream of bytes.
Table of contents |
2 Advantages 3 Disadvantages 4 External links |
The characters that are smaller than 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required, and then the uppermost bit of every byte is 1, in order for them to be always greater than 127 and not look like any of the 7-bit ASCII characters (particularly the ones used for control, e.g. carriage return). The encoded character is divided into several groups of bits, which are then divided among the lower positions inside these bytes.
For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:
Description
UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646), which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.
Code range
hexadecimalUTF-16
UTF-8
binaryNotes
000000 - 00007F
00000000 0xxxxxxx
0xxxxxxx
ASCII equivalence range; byte begins with zero
000080 - 0007FF
00000xxx xxxxxxxx
110xxxxx 10xxxxxx
first byte begins with 11, the following byte(s) begin with 10
000800 - 00FFFF
xxxxxxxx xxxxxxxx
1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF
110110xx xxxxxxxx
110111xx xxxxxxxx11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)
Advantages
Disadvantages
External links