CESU-8: Difference between revisions
mNo edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
The '''Compatibility Encoding Scheme for UTF-16: 8-Bit''' ('''CESU-8''') is a variant of [[UTF-8]] that is described in [[Unicode]] Technical Report #26 [http://www.unicode.org/reports/tr26/]. A Unicode code point from the [[Basic Multilingual Plane]] (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in [[UTF-16]], and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. |
The '''Compatibility Encoding Scheme for UTF-16: 8-Bit''' ('''CESU-8''') is a variant of [[UTF-8]] that is described in [[Unicode]] Technical Report #26 [http://www.unicode.org/reports/tr26/].<ref>{{Cite web |url=http://www.unicode.org/reports/tr26/ |title=Unicode Technical Report #26 - Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) |last=McGowan |first=Rick |publisher=Unicode Consortium}}</ref> A Unicode code point from the [[Basic Multilingual Plane]] (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in [[UTF-16]], and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. |
||
The encoding of Unicode supplementary characters works out to <code>11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx</code> (yyyy represents the top five bits of the character minus one). |
The encoding of Unicode supplementary characters works out to <code>11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx</code> (yyyy represents the top five bits of the character minus one). |
||
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange. |
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.<ref>{{Cite web |url=http://www.unicode.org/reports/about-reports.html#Types |title=About Unicode Technical Reports - Types of Unicode Technical Reports: UAX, UTS, UTR |publisher=Unicode Consortium}}</ref> It should be used exclusively for internal processing and never for external data exchange. |
||
Supporting CESU-8 in [[HTML]] documents is prohibited by the [[W3C]]<ref>{{Cite web |url=https://www.w3.org/TR/html51/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5.1 Standard | |
Supporting CESU-8 in [[HTML]] documents is prohibited by the [[W3C]]<ref>{{Cite web |url=https://www.w3.org/TR/html51/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5.1 Standard |publisher=W3C}}</ref><ref>{{Cite web |url=https://www.w3.org/TR/html5/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5 Standard |publisher=W3C}}</ref> and [[WHATWG]]<ref>{{Cite web |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |title=12.2.3.3 Character encodings |website=HTML Living Standard |publisher=WHATWG}}</ref> HTML standards, as it would present a [[cross-site scripting]] vulnerability.<ref>{{Cite web |url=https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta |title=<meta> - HTML |website=MDN Web Docs |publisher=Mozilla}}</ref> |
||
CESU-8 is similar to Java's [[UTF-8#Modified UTF-8|Modified UTF-8]] but does not have the special encoding of the NUL character (U+0000). |
CESU-8 is similar to Java's [[UTF-8#Modified UTF-8|Modified UTF-8]] but does not have the special encoding of the NUL character (U+0000). |
Revision as of 20:35, 2 August 2017
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1].[1] A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.
The encoding of Unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx
(yyyy represents the top five bits of the character minus one).
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange.
Supporting CESU-8 in HTML documents is prohibited by the W3C[3][4] and WHATWG[5] HTML standards, as it would present a cross-site scripting vulnerability.[6]
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).
The Oracle database uses CESU-8 for its "UTF8" character set. Standard UTF-8 can be obtained using the character set "AL32UTF8" (since Oracle version 9.0).
Examples
Encoding | Unicode code point | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
U+0045 | U+0205 | U+10400 | |||||||||||||||||
E | ȅ | 𐐀 | |||||||||||||||||
UTF-8 | 45 | C8 | 85 | F0 | 90 | 90 | 80 | ||||||||||||
UTF-16 | 0045 | 0205 | D801 | DC00 | |||||||||||||||
CESU-8 | 45 | C8 | 85 | ED | A0 | 81 | ED | B0 | 80 |
References
- ^ McGowan, Rick. "Unicode Technical Report #26 - Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)". Unicode Consortium.
- ^ "About Unicode Technical Reports - Types of Unicode Technical Reports: UAX, UTS, UTR". Unicode Consortium.
- ^ "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
- ^ "8.2.2.3. Character encodings". HTML 5 Standard. W3C.
- ^ "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
- ^ "<meta> - HTML". MDN Web Docs. Mozilla.