Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

email.charset.rst

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 9.4 KB

Rev	Line
[391]	1	:mod:`email.charset`: Representing character sets
	2	-------------------------------------------------
[2]	3
	4	.. module:: email.charset
	5	:synopsis: Character Sets
	6
	7
	8	This module provides a class :class:`Charset` for representing character sets
	9	and character set conversions in email messages, as well as a character set
	10	registry and several convenience methods for manipulating this registry.
	11	Instances of :class:`Charset` are used in several other modules within the
	12	:mod:`email` package.
	13
	14	Import this class from the :mod:`email.charset` module.
	15
	16	.. versionadded:: 2.2.2
	17
	18
	19	.. class:: Charset([input_charset])
	20
	21	Map character sets to their email properties.
	22
	23	This class provides information about the requirements imposed on email for a
	24	specific character set. It also provides convenience routines for converting
	25	between character sets, given the availability of the applicable codecs. Given
	26	a character set, it will do its best to provide information on how to use that
	27	character set in an email message in an RFC-compliant way.
	28
	29	Certain character sets must be encoded with quoted-printable or base64 when used
	30	in email headers or bodies. Certain character sets must be converted outright,
	31	and are not allowed in email.
	32
	33	Optional input_charset is as described below; it is always coerced to lower
	34	case. After being alias normalized it is also used as a lookup into the
	35	registry of character sets to find out the header encoding, body encoding, and
	36	output conversion codec to be used for the character set. For example, if
	37	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
	38	quoted-printable and no output conversion codec is necessary. If
	39	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
	40	will not be encoded, but output text will be converted from the ``euc-jp``
	41	character set to the ``iso-2022-jp`` character set.
	42
	43	:class:`Charset` instances have the following data attributes:
	44
	45
	46	.. attribute:: input_charset
	47
	48	The initial character set specified. Common aliases are converted to
	49	their official email names (e.g. ``latin_1`` is converted to
	50	``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
	51
	52
	53	.. attribute:: header_encoding
	54
	55	If the character set must be encoded before it can be used in an email
	56	header, this attribute will be set to ``Charset.QP`` (for
	57	quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
	58	``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
	59	it will be ``None``.
	60
	61
	62	.. attribute:: body_encoding
	63
	64	Same as header_encoding, but describes the encoding for the mail
	65	message's body, which indeed may be different than the header encoding.
	66	``Charset.SHORTEST`` is not allowed for body_encoding.
	67
	68
	69	.. attribute:: output_charset
	70
	71	Some character sets must be converted before they can be used in email headers
	72	or bodies. If the input_charset is one of them, this attribute will
	73	contain the name of the character set output will be converted to. Otherwise, it will
	74	be ``None``.
	75
	76
	77	.. attribute:: input_codec
	78
	79	The name of the Python codec used to convert the input_charset to
	80	Unicode. If no conversion codec is necessary, this attribute will be
	81	``None``.
	82
	83
	84	.. attribute:: output_codec
	85
	86	The name of the Python codec used to convert Unicode to the
	87	output_charset. If no conversion codec is necessary, this attribute
	88	will have the same value as the input_codec.
	89
	90	:class:`Charset` instances also have the following methods:
	91
	92
	93	.. method:: get_body_encoding()
	94
	95	Return the content transfer encoding used for body encoding.
	96
	97	This is either the string ``quoted-printable`` or ``base64`` depending on
	98	the encoding used, or it is a function, in which case you should call the
	99	function with a single argument, the Message object being encoded. The
	100	function should then set the :mailheader:`Content-Transfer-Encoding`
	101	header itself to whatever is appropriate.
	102
	103	Returns the string ``quoted-printable`` if body_encoding is ``QP``,
	104	returns the string ``base64`` if body_encoding is ``BASE64``, and
	105	returns the string ``7bit`` otherwise.
	106
	107
	108	.. method:: convert(s)
	109
	110	Convert the string s from the input_codec to the output_codec.
	111
	112
	113	.. method:: to_splittable(s)
	114
	115	Convert a possibly multibyte string to a safely splittable format. s is
	116	the string to split.
	117
	118	Uses the input_codec to try and convert the string to Unicode, so it can
	119	be safely split on character boundaries (even for multibyte characters).
	120
	121	Returns the string as-is if it isn't known how to convert s to Unicode
	122	with the input_charset.
	123
	124	Characters that could not be converted to Unicode will be replaced with
	125	the Unicode replacement character ``'U+FFFD'``.
	126
	127
	128	.. method:: from_splittable(ustr[, to_output])
	129
	130	Convert a splittable string back into an encoded string. ustr is a
	131	Unicode string to "unsplit".
	132
	133	This method uses the proper codec to try and convert the string from
	134	Unicode back into an encoded format. Return the string as-is if it is not
	135	Unicode, or if it could not be converted from Unicode.
	136
	137	Characters that could not be converted from Unicode will be replaced with
	138	an appropriate character (usually ``'?'``).
	139
	140	If to_output is ``True`` (the default), uses output_codec to convert
	141	to an encoded format. If to_output is ``False``, it uses input_codec.
	142
	143
	144	.. method:: get_output_charset()
	145
	146	Return the output character set.
	147
	148	This is the output_charset attribute if that is not ``None``, otherwise
	149	it is input_charset.
	150
	151
	152	.. method:: encoded_header_len()
	153
	154	Return the length of the encoded header string, properly calculating for
	155	quoted-printable or base64 encoding.
	156
	157
	158	.. method:: header_encode(s[, convert])
	159
	160	Header-encode the string s.
	161
	162	If convert is ``True``, the string will be converted from the input
	163	charset to the output charset automatically. This is not useful for
	164	multibyte character sets, which have line length issues (multibyte
	165	characters must be split on a character, not a byte boundary); use the
	166	higher-level :class:`~email.header.Header` class to deal with these issues
	167	(see :mod:`email.header`). convert defaults to ``False``.
	168
	169	The type of encoding (base64 or quoted-printable) will be based on the
	170	header_encoding attribute.
	171
	172
	173	.. method:: body_encode(s[, convert])
	174
	175	Body-encode the string s.
	176
	177	If convert is ``True`` (the default), the string will be converted from
	178	the input charset to output charset automatically. Unlike
	179	:meth:`header_encode`, there are no issues with byte boundaries and
	180	multibyte charsets in email bodies, so this is usually pretty safe.
	181
	182	The type of encoding (base64 or quoted-printable) will be based on the
	183	body_encoding attribute.
	184
	185	The :class:`Charset` class also provides a number of methods to support
	186	standard operations and built-in functions.
	187
	188
	189	.. method:: __str__()
	190
	191	Returns input_charset as a string coerced to lower
	192	case. :meth:`__repr__` is an alias for :meth:`__str__`.
	193
	194
	195	.. method:: __eq__(other)
	196
	197	This method allows you to compare two :class:`Charset` instances for
	198	equality.
	199
	200
	201	.. method:: __ne__(other)
	202
	203	This method allows you to compare two :class:`Charset` instances for
	204	inequality.
	205
	206	The :mod:`email.charset` module also provides the following functions for adding
	207	new entries to the global character set, alias, and codec registries:
	208
	209
	210	.. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
	211
	212	Add character properties to the global registry.
	213
	214	charset is the input character set, and must be the canonical name of a
	215	character set.
	216
	217	Optional header_enc and body_enc is either ``Charset.QP`` for
	218	quoted-printable, ``Charset.BASE64`` for base64 encoding,
	219	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
	220	or ``None`` for no encoding. ``SHORTEST`` is only valid for
	221	header_enc. The default is ``None`` for no encoding.
	222
	223	Optional output_charset is the character set that the output should be in.
	224	Conversions will proceed from input charset, to Unicode, to the output charset
	225	when the method :meth:`Charset.convert` is called. The default is to output in
	226	the same character set as the input.
	227
	228	Both input_charset and output_charset must have Unicode codec entries in the
	229	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
	230	module does not know about. See the :mod:`codecs` module's documentation for
	231	more information.
	232
	233	The global character set registry is kept in the module global dictionary
	234	``CHARSETS``.
	235
	236
	237	.. function:: add_alias(alias, canonical)
	238
	239	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
	240	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
	241
	242	The global charset alias registry is kept in the module global dictionary
	243	``ALIASES``.
	244
	245
	246	.. function:: add_codec(charset, codecname)
	247
	248	Add a codec that map characters in the given character set to and from Unicode.
	249
	250	charset is the canonical name of a character set. codecname is the name of a
	251	Python codec, as appropriate for the second argument to the :func:`unicode`
[391]	252	built-in, or to the :meth:`~unicode.encode` method of a Unicode string.
[2]	253

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Doc/library/email.charset.rst

Download in other formats: