[391] | 1 | :mod:`email.charset`: Representing character sets
|
---|
| 2 | -------------------------------------------------
|
---|
[2] | 3 |
|
---|
| 4 | .. module:: email.charset
|
---|
| 5 | :synopsis: Character Sets
|
---|
| 6 |
|
---|
| 7 |
|
---|
| 8 | This module provides a class :class:`Charset` for representing character sets
|
---|
| 9 | and character set conversions in email messages, as well as a character set
|
---|
| 10 | registry and several convenience methods for manipulating this registry.
|
---|
| 11 | Instances of :class:`Charset` are used in several other modules within the
|
---|
| 12 | :mod:`email` package.
|
---|
| 13 |
|
---|
| 14 | Import this class from the :mod:`email.charset` module.
|
---|
| 15 |
|
---|
| 16 | .. versionadded:: 2.2.2
|
---|
| 17 |
|
---|
| 18 |
|
---|
| 19 | .. class:: Charset([input_charset])
|
---|
| 20 |
|
---|
| 21 | Map character sets to their email properties.
|
---|
| 22 |
|
---|
| 23 | This class provides information about the requirements imposed on email for a
|
---|
| 24 | specific character set. It also provides convenience routines for converting
|
---|
| 25 | between character sets, given the availability of the applicable codecs. Given
|
---|
| 26 | a character set, it will do its best to provide information on how to use that
|
---|
| 27 | character set in an email message in an RFC-compliant way.
|
---|
| 28 |
|
---|
| 29 | Certain character sets must be encoded with quoted-printable or base64 when used
|
---|
| 30 | in email headers or bodies. Certain character sets must be converted outright,
|
---|
| 31 | and are not allowed in email.
|
---|
| 32 |
|
---|
| 33 | Optional *input_charset* is as described below; it is always coerced to lower
|
---|
| 34 | case. After being alias normalized it is also used as a lookup into the
|
---|
| 35 | registry of character sets to find out the header encoding, body encoding, and
|
---|
| 36 | output conversion codec to be used for the character set. For example, if
|
---|
| 37 | *input_charset* is ``iso-8859-1``, then headers and bodies will be encoded using
|
---|
| 38 | quoted-printable and no output conversion codec is necessary. If
|
---|
| 39 | *input_charset* is ``euc-jp``, then headers will be encoded with base64, bodies
|
---|
| 40 | will not be encoded, but output text will be converted from the ``euc-jp``
|
---|
| 41 | character set to the ``iso-2022-jp`` character set.
|
---|
| 42 |
|
---|
| 43 | :class:`Charset` instances have the following data attributes:
|
---|
| 44 |
|
---|
| 45 |
|
---|
| 46 | .. attribute:: input_charset
|
---|
| 47 |
|
---|
| 48 | The initial character set specified. Common aliases are converted to
|
---|
| 49 | their *official* email names (e.g. ``latin_1`` is converted to
|
---|
| 50 | ``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
|
---|
| 51 |
|
---|
| 52 |
|
---|
| 53 | .. attribute:: header_encoding
|
---|
| 54 |
|
---|
| 55 | If the character set must be encoded before it can be used in an email
|
---|
| 56 | header, this attribute will be set to ``Charset.QP`` (for
|
---|
| 57 | quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
|
---|
| 58 | ``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
|
---|
| 59 | it will be ``None``.
|
---|
| 60 |
|
---|
| 61 |
|
---|
| 62 | .. attribute:: body_encoding
|
---|
| 63 |
|
---|
| 64 | Same as *header_encoding*, but describes the encoding for the mail
|
---|
| 65 | message's body, which indeed may be different than the header encoding.
|
---|
| 66 | ``Charset.SHORTEST`` is not allowed for *body_encoding*.
|
---|
| 67 |
|
---|
| 68 |
|
---|
| 69 | .. attribute:: output_charset
|
---|
| 70 |
|
---|
| 71 | Some character sets must be converted before they can be used in email headers
|
---|
| 72 | or bodies. If the *input_charset* is one of them, this attribute will
|
---|
| 73 | contain the name of the character set output will be converted to. Otherwise, it will
|
---|
| 74 | be ``None``.
|
---|
| 75 |
|
---|
| 76 |
|
---|
| 77 | .. attribute:: input_codec
|
---|
| 78 |
|
---|
| 79 | The name of the Python codec used to convert the *input_charset* to
|
---|
| 80 | Unicode. If no conversion codec is necessary, this attribute will be
|
---|
| 81 | ``None``.
|
---|
| 82 |
|
---|
| 83 |
|
---|
| 84 | .. attribute:: output_codec
|
---|
| 85 |
|
---|
| 86 | The name of the Python codec used to convert Unicode to the
|
---|
| 87 | *output_charset*. If no conversion codec is necessary, this attribute
|
---|
| 88 | will have the same value as the *input_codec*.
|
---|
| 89 |
|
---|
| 90 | :class:`Charset` instances also have the following methods:
|
---|
| 91 |
|
---|
| 92 |
|
---|
| 93 | .. method:: get_body_encoding()
|
---|
| 94 |
|
---|
| 95 | Return the content transfer encoding used for body encoding.
|
---|
| 96 |
|
---|
| 97 | This is either the string ``quoted-printable`` or ``base64`` depending on
|
---|
| 98 | the encoding used, or it is a function, in which case you should call the
|
---|
| 99 | function with a single argument, the Message object being encoded. The
|
---|
| 100 | function should then set the :mailheader:`Content-Transfer-Encoding`
|
---|
| 101 | header itself to whatever is appropriate.
|
---|
| 102 |
|
---|
| 103 | Returns the string ``quoted-printable`` if *body_encoding* is ``QP``,
|
---|
| 104 | returns the string ``base64`` if *body_encoding* is ``BASE64``, and
|
---|
| 105 | returns the string ``7bit`` otherwise.
|
---|
| 106 |
|
---|
| 107 |
|
---|
| 108 | .. method:: convert(s)
|
---|
| 109 |
|
---|
| 110 | Convert the string *s* from the *input_codec* to the *output_codec*.
|
---|
| 111 |
|
---|
| 112 |
|
---|
| 113 | .. method:: to_splittable(s)
|
---|
| 114 |
|
---|
| 115 | Convert a possibly multibyte string to a safely splittable format. *s* is
|
---|
| 116 | the string to split.
|
---|
| 117 |
|
---|
| 118 | Uses the *input_codec* to try and convert the string to Unicode, so it can
|
---|
| 119 | be safely split on character boundaries (even for multibyte characters).
|
---|
| 120 |
|
---|
| 121 | Returns the string as-is if it isn't known how to convert *s* to Unicode
|
---|
| 122 | with the *input_charset*.
|
---|
| 123 |
|
---|
| 124 | Characters that could not be converted to Unicode will be replaced with
|
---|
| 125 | the Unicode replacement character ``'U+FFFD'``.
|
---|
| 126 |
|
---|
| 127 |
|
---|
| 128 | .. method:: from_splittable(ustr[, to_output])
|
---|
| 129 |
|
---|
| 130 | Convert a splittable string back into an encoded string. *ustr* is a
|
---|
| 131 | Unicode string to "unsplit".
|
---|
| 132 |
|
---|
| 133 | This method uses the proper codec to try and convert the string from
|
---|
| 134 | Unicode back into an encoded format. Return the string as-is if it is not
|
---|
| 135 | Unicode, or if it could not be converted from Unicode.
|
---|
| 136 |
|
---|
| 137 | Characters that could not be converted from Unicode will be replaced with
|
---|
| 138 | an appropriate character (usually ``'?'``).
|
---|
| 139 |
|
---|
| 140 | If *to_output* is ``True`` (the default), uses *output_codec* to convert
|
---|
| 141 | to an encoded format. If *to_output* is ``False``, it uses *input_codec*.
|
---|
| 142 |
|
---|
| 143 |
|
---|
| 144 | .. method:: get_output_charset()
|
---|
| 145 |
|
---|
| 146 | Return the output character set.
|
---|
| 147 |
|
---|
| 148 | This is the *output_charset* attribute if that is not ``None``, otherwise
|
---|
| 149 | it is *input_charset*.
|
---|
| 150 |
|
---|
| 151 |
|
---|
| 152 | .. method:: encoded_header_len()
|
---|
| 153 |
|
---|
| 154 | Return the length of the encoded header string, properly calculating for
|
---|
| 155 | quoted-printable or base64 encoding.
|
---|
| 156 |
|
---|
| 157 |
|
---|
| 158 | .. method:: header_encode(s[, convert])
|
---|
| 159 |
|
---|
| 160 | Header-encode the string *s*.
|
---|
| 161 |
|
---|
| 162 | If *convert* is ``True``, the string will be converted from the input
|
---|
| 163 | charset to the output charset automatically. This is not useful for
|
---|
| 164 | multibyte character sets, which have line length issues (multibyte
|
---|
| 165 | characters must be split on a character, not a byte boundary); use the
|
---|
| 166 | higher-level :class:`~email.header.Header` class to deal with these issues
|
---|
| 167 | (see :mod:`email.header`). *convert* defaults to ``False``.
|
---|
| 168 |
|
---|
| 169 | The type of encoding (base64 or quoted-printable) will be based on the
|
---|
| 170 | *header_encoding* attribute.
|
---|
| 171 |
|
---|
| 172 |
|
---|
| 173 | .. method:: body_encode(s[, convert])
|
---|
| 174 |
|
---|
| 175 | Body-encode the string *s*.
|
---|
| 176 |
|
---|
| 177 | If *convert* is ``True`` (the default), the string will be converted from
|
---|
| 178 | the input charset to output charset automatically. Unlike
|
---|
| 179 | :meth:`header_encode`, there are no issues with byte boundaries and
|
---|
| 180 | multibyte charsets in email bodies, so this is usually pretty safe.
|
---|
| 181 |
|
---|
| 182 | The type of encoding (base64 or quoted-printable) will be based on the
|
---|
| 183 | *body_encoding* attribute.
|
---|
| 184 |
|
---|
| 185 | The :class:`Charset` class also provides a number of methods to support
|
---|
| 186 | standard operations and built-in functions.
|
---|
| 187 |
|
---|
| 188 |
|
---|
| 189 | .. method:: __str__()
|
---|
| 190 |
|
---|
| 191 | Returns *input_charset* as a string coerced to lower
|
---|
| 192 | case. :meth:`__repr__` is an alias for :meth:`__str__`.
|
---|
| 193 |
|
---|
| 194 |
|
---|
| 195 | .. method:: __eq__(other)
|
---|
| 196 |
|
---|
| 197 | This method allows you to compare two :class:`Charset` instances for
|
---|
| 198 | equality.
|
---|
| 199 |
|
---|
| 200 |
|
---|
| 201 | .. method:: __ne__(other)
|
---|
| 202 |
|
---|
| 203 | This method allows you to compare two :class:`Charset` instances for
|
---|
| 204 | inequality.
|
---|
| 205 |
|
---|
| 206 | The :mod:`email.charset` module also provides the following functions for adding
|
---|
| 207 | new entries to the global character set, alias, and codec registries:
|
---|
| 208 |
|
---|
| 209 |
|
---|
| 210 | .. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
|
---|
| 211 |
|
---|
| 212 | Add character properties to the global registry.
|
---|
| 213 |
|
---|
| 214 | *charset* is the input character set, and must be the canonical name of a
|
---|
| 215 | character set.
|
---|
| 216 |
|
---|
| 217 | Optional *header_enc* and *body_enc* is either ``Charset.QP`` for
|
---|
| 218 | quoted-printable, ``Charset.BASE64`` for base64 encoding,
|
---|
| 219 | ``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
|
---|
| 220 | or ``None`` for no encoding. ``SHORTEST`` is only valid for
|
---|
| 221 | *header_enc*. The default is ``None`` for no encoding.
|
---|
| 222 |
|
---|
| 223 | Optional *output_charset* is the character set that the output should be in.
|
---|
| 224 | Conversions will proceed from input charset, to Unicode, to the output charset
|
---|
| 225 | when the method :meth:`Charset.convert` is called. The default is to output in
|
---|
| 226 | the same character set as the input.
|
---|
| 227 |
|
---|
| 228 | Both *input_charset* and *output_charset* must have Unicode codec entries in the
|
---|
| 229 | module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
|
---|
| 230 | module does not know about. See the :mod:`codecs` module's documentation for
|
---|
| 231 | more information.
|
---|
| 232 |
|
---|
| 233 | The global character set registry is kept in the module global dictionary
|
---|
| 234 | ``CHARSETS``.
|
---|
| 235 |
|
---|
| 236 |
|
---|
| 237 | .. function:: add_alias(alias, canonical)
|
---|
| 238 |
|
---|
| 239 | Add a character set alias. *alias* is the alias name, e.g. ``latin-1``.
|
---|
| 240 | *canonical* is the character set's canonical name, e.g. ``iso-8859-1``.
|
---|
| 241 |
|
---|
| 242 | The global charset alias registry is kept in the module global dictionary
|
---|
| 243 | ``ALIASES``.
|
---|
| 244 |
|
---|
| 245 |
|
---|
| 246 | .. function:: add_codec(charset, codecname)
|
---|
| 247 |
|
---|
| 248 | Add a codec that map characters in the given character set to and from Unicode.
|
---|
| 249 |
|
---|
| 250 | *charset* is the canonical name of a character set. *codecname* is the name of a
|
---|
| 251 | Python codec, as appropriate for the second argument to the :func:`unicode`
|
---|
[391] | 252 | built-in, or to the :meth:`~unicode.encode` method of a Unicode string.
|
---|
[2] | 253 |
|
---|