Content deleted Content added
m Task 18 (cosmetic): eval 5 templates: hyphenate params (1×); |
No edit summary |
||
(44 intermediate revisions by 34 users not shown) | |||
Line 1:
{{Short description|Character encoding}}
{{Use dmy dates|date=June 2020}}
{{Infobox character encoding
Line 7 ⟶ 8:
| standard = {{IETF RFC|2152}}
| lang = International
<!-- Not | extends = [[
| encodes = [[ISO/IEC 10646]] ([[Unicode]])
| status =
| prev = [[HZ-GB-2312]]
Line 14 ⟶ 15:
| classification = [[Unicode Transformation Format]], [[ASCII armor]], [[variable-width encoding]], [[state (computer science)|stateful encoding]]
}}
'''UTF-7''' (7-[[bit]] [[Unicode Transformation Format]]) is an obsolete variable-length character encoding for representing [[Unicode]] text using a stream of [[ASCII]] characters. It was originally intended to provide a means of encoding [[Unicode]] text for use in [[Internet]] [[E-mail]] messages that was more efficient than the combination of [[UTF-8]] with [[quoted-printable]].
UTF-7 (according to its RFC) isn't a "[[Unicode Transformation Format]]", as the definition can only encode code points in the [[Basic Multilingual Plane|BMP]] (the first 65536 Unicode code points, which does not include [[emojis]] and many other characters). However if a UTF-7 translator is to/from [[UTF-16]] then it can (and probably does){{citation needed|date=August 2023}} encode each surrogate half as though it was a 16-bit code point, and thus can encode all code points. It is unclear if other UTF-7 software (such as translators to UTF-32 or UTF-8) support this.
UTF-7 has never
==Motivation==
[[MIME]], the modern standard
Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME [[MIME#Content-Transfer-Encoding|transfer encoding]], but still must be explicitly identified as the text character set.
UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or
[[8BITMIME]] has also been introduced, which reduces the need to encode message bodies in a 7-bit format.
A modified form of UTF-7 (sometimes dubbed 'mUTF-7'
| url = https://doc.dovecot.org/configuration_manual/mail_location/
| title = Configuration Manual
| at = Sec. "Mail Location Settings"
| author = <!--Not stated-->
| date = 8 February 2023
| website = Dovecot Documentation
| access-date = 2023-02-28
| quote = Store mailbox names on disk using UTF-8 instead of modified UTF-7 (mUTF-7).
}}</ref>) was used in the [[Internet Message Access Protocol|Internet Message Access Protocol (IMAP)]] e-mail retrieval protocol, version 4 rev 1, for "international" mailbox names.{{Ref RFC|3501|section=5.1.3 "Mailbox International Naming Convention"
|quote=In modified UTF-7, printable [[US-ASCII]] characters, except for "&", represent themselves…. The character "&" (0x26) is represented by the two-octet sequence "&-". All other characters… are represented in modified BASE64….
}}
The following version, IMAP version 4 rev 2, uses UTF-8 instead.{{Ref RFC|9051|section=5.1. "Mailbox Naming" |quote=In IMAP4rev2, mailbox names are encoded in Net-Unicode (this differs from IMAP4rev1).}}
==Description==
Line 35 ⟶ 48:
There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.
Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: <code>' ( ) , - . / : ?</code>. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range {{U+|0020}}–U+007E except <code>~ \ +</code> and space (the characters {{code|\}} and {{code|~}} being excluded due to being redefined in "variants of ASCII" such as [[JIS-Roman]]). Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.
Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign (<code>+</code>) ''may'' be encoded as <code>+-</code>.
Other characters must be encoded in [[UTF-16]] (hence U+10000 and higher would be encoded into two surrogates), and then in [[
==Examples==
Line 70 ⟶ 83:
===Encoding===
First, an encoder must decide which characters to represent directly in ASCII form, which <code>+</code>
Using the £† (U+00A3 U+2020) character sequence as an example:
{{ordered list
|1= Express the character's Unicode numbers (UTF-16) in
|<samp>0x00A3 → 0000 0000 1010 0011</samp>
|<samp>0x2020 → 0010 0000 0010 0000</samp>}}
|2= Concatenate the binary sequences:<br />
<samp>0000 0000 1010 0011 and 0010 0000 0010 0000 → 0000 0000 1010 0011 0010 0000 0010 0000</samp>
|3= Regroup the binary into groups of six bits, starting from the left:<br />
<samp>0000 0000 1010 0011 0010 0000 0010 0000 → 000000 001010 001100 100000 001000 00</samp>
|4= If the last group has fewer than six bits, add trailing zeros:<br />
<samp>000000 001010 001100 100000 001000 00 → 000000 001010 001100 100000 001000 000000</samp>
|5= Replace each group of six bits with a respective Base64 code:<br />
<samp>000000 001010 001100 100000 001000 000000 → AKMgIA</samp>
}}
Line 94 ⟶ 107:
#Regroup the binary into groups of sixteen bits, starting from the left:<br /><samp>000000 001010 001100 100000 001000 000000 → 0000000010100011 0010000000100000 0000</samp>
#If there is an incomplete group at the end containing only zeros, discard it (if the incomplete group contains any ones, the code is invalid):<br /><samp>0000000010100011 0010000000100000</samp>
#Each group of 16 bits is a character's Unicode (UTF-16) number and can be expressed in other forms:<br /><samp>0000 0000 1010 0011 ≡ 0x00A3 ≡ 163<sub>10</sub></samp>
==
A
While
==Security==
UTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.
Older versions of [[Internet Explorer]] can be tricked into interpreting the page as UTF-7. This can be used for a [[cross-site scripting]] attack as the <code><</code> and <code>></code> marks can be encoded as <code>+ADw-</code> and <code>+AD4-</code> in UTF-7, which most validators let through as simple text.<ref>{{cite web|url=https://code.google.com/p/doctype-mirror/wiki/ArticleUtf7 |title=ArticleUtf7 - doctype-mirror - UTF-7: the case of the missing charset - Mirror of Google Doctype - Google Project Hosting
UTF-7 is considered obsolete, at least for Microsoft software (.NET), with code paths previously supporting it intentionally broken (to prevent security issues) in .NET 5, in 2020.<ref name="dotnet5">{{Cite web
==See also==▼
* [[Comparison of Unicode encodings]]▼
==References==
{{reflist}}
▲==See also==
▲* [[Comparison of Unicode encodings]]
{{Unicode navigation}}
|