Context Navigation

← Previous Change
Next Change →

codecs.rst

Timestamp:

Mar 19, 2014, 11:31:01 PM (11 years ago)

Author:

dmik

Message:

python: Merge vendor 2.7.6 to trunk.

Location:

python/trunk

Files:

: 2 edited

. (modified) (1 prop)
Doc/library/codecs.rst (modified) (17 diffs)

Legend:

: Unmodified
: Added
: Removed

python/trunk
- Property svn:mergeinfo set to
  /python/vendor/Python-2.7.6 merged eligible
  /python/vendor/current merged eligible

python/trunk/Doc/library/codecs.rst

-              r2
+              r391
    *encode* and *decode*: These must be functions or methods which have the same
    interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
    Codec Interface). The functions/methods are expected to work in a stateless
    mode.
+   interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
+   instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
+   are expected to work in a stateless mode.
    *incrementalencoder* and *incrementaldecoder*: These have to be factory
 …
    The factory functions must return objects providing the interfaces defined by
    the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
+   the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
    Stream codecs can maintain state.
 …
 The :class:`Codec` class defines the interface for stateless encoders/decoders.
 To simplify and standardize error handling, the :meth:`encode` and
 :meth:`decode` methods may implement different error handling schemes by
+To simplify and standardize error handling, the :meth:`~Codec.encode` and
+:meth:`~Codec.decode` methods may implement different error handling schemes by
 providing the *errors* string argument.  The following string values are defined
 and implemented by all standard Python codecs:
+.. tabularcolumns:: |l|L|
 +-------------------------+-----------------------------------------------+
 …
 the basic interface for incremental encoding and decoding. Encoding/decoding the
 input isn't done with one call to the stateless encoder/decoder function, but
+with multiple calls to the :meth:`encode`/:meth:`decode` method of the
+incremental encoder/decoder. The incremental encoder/decoder keeps track of the
+encoding/decoding process during method calls.
+The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
+same as if all the single inputs were joined into one, and this input was
+with multiple calls to the
+:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
+the incremental encoder/decoder. The incremental encoder/decoder keeps track of
+the encoding/decoding process during method calls.
+The joined output of calls to the
+:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
+the same as if all the single inputs were joined into one, and this input was
 encoded/decoded with the stateless encoder/decoder.
 …
       *size*, if given, is passed as size argument to the stream's
       :meth:`readline` method.
+      :meth:`read` method.
       If *keepends* is false line-endings will be stripped from the lines
 …
 Unicode strings are stored internally as sequences of codepoints (to be precise
 as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the
 former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
+as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
+via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the
+former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data
 type. Once a Unicode object is used outside of CPU and memory, CPU endianness
 and how these arrays are stored as bytes become an issue.  Transforming a
 …
 character is mapped to which byte value.
 All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
+All of these encodings can only encode 256 of the 1114112 codepoints
 defined in unicode. A simple and straightforward way that can store each Unicode
 code point, is to store each codepoint as two consecutive bytes. There are two
 possibilities: Store the bytes in big endian or in little endian order. These
 two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
 disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
 will always have to swap bytes on encoding and decoding. UTF-16 avoids this
 problem: Bytes will always be in natural endianness. When these bytes are read
+code point, is to store each codepoint as four consecutive bytes. There are two
+possibilities: store the bytes in big endian or in little endian order. These
+two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
+disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
+will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
+problem: bytes will always be in natural endianness. When these bytes are read
 by a CPU with a different endianness, then bytes have to be swapped though. To
+be able to detect the endianness of a UTF-16 byte sequence, there's the so
+called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
+This character will be prepended to every UTF-16 byte sequence. The byte swapped
+version of this character (``0xFFFE``) is an illegal character that may not
+appear in a Unicode text. So when the first character in an UTF-16 byte sequence
+be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
+there's the so called BOM ("Byte Order Mark"). This is the Unicode character
+``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
+byte sequence. The byte swapped version of this character (``0xFFFE``) is an
+illegal character that may not appear in a Unicode text. So when the
+first character in an ``UTF-16`` or ``UTF-32`` byte sequence
 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
 Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
 a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
+Unfortunately the character ``U+FEFF`` had a second purpose as
+a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
 a word to be split. It can e.g. be used to give hints to a ligature algorithm.
 With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
 deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
+Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
 it's a device to determine the storage layout of the encoded bytes, and vanishes
 once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
 …
 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
 parts: Marker bits (the most significant bits) and payload bits. The marker bits
 are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
+parts: marker bits (the most significant bits) and payload bits. The marker bits
+are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
 encoded like this (with x being payload bits, which when concatenated give the
 Unicode character):
 …
 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
 +-----------------------------------+----------------------------------------------+
+| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
++-----------------------------------+----------------------------------------------+
+| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
++-----------------------------------+----------------------------------------------+
+| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
+|                                   | 10xxxxxx                                     |
+| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
 +-----------------------------------+----------------------------------------------+
 …
    | INVERTED QUESTION MARK
 in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
+in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
 correctly guessed from the byte sequence. So here the BOM is not used to be able
 to determine the byte order used for generating the byte sequence, but as a
 signature that helps in guessing the encoding. On encoding the utf-8-sig codec
 will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
+decoding utf-8-sig will skip those three bytes if they appear as the first three
+bytes in the file.
+decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
+three bytes in the file.  In UTF-8, the use of the BOM is discouraged and
+should generally be avoided.
 …
 * an IBM PC code page, which is ASCII compatible
+.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
 +-----------------+--------------------------------+--------------------------------+
 | Codec           | Aliases                        | Languages                      |
 …
 |                 | IBM500                         |                                |
 +-----------------+--------------------------------+--------------------------------+
+| cp720           |                                | Arabic                         |
++-----------------+--------------------------------+--------------------------------+
 | cp737           |                                | Greek                          |
 +-----------------+--------------------------------+--------------------------------+
 …
 +-----------------+--------------------------------+--------------------------------+
 | cp857           | 857, IBM857                    | Turkish                        |
++-----------------+--------------------------------+--------------------------------+
+| cp858           | 858, IBM858                    | Western Europe                 |
 +-----------------+--------------------------------+--------------------------------+
 | cp860           | 860, IBM860                    | Portuguese                     |
 …
 | iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               |
 +-----------------+--------------------------------+--------------------------------+
 | iso8859_13      | iso-8859-13                    | Baltic languages               |
+| iso8859_13      | iso-8859-13, latin7, L7        | Baltic languages               |
 +-----------------+--------------------------------+--------------------------------+
 | iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               |
 +-----------------+--------------------------------+--------------------------------+
+| iso8859_15      | iso-8859-15                    | Western Europe                 |
+| iso8859_15      | iso-8859-15, latin9, L9        | Western Europe                 |
++-----------------+--------------------------------+--------------------------------+
+| iso8859_16      | iso-8859-16, latin10, L10      | South-Eastern Europe           |
 +-----------------+--------------------------------+--------------------------------+
 | johab           | cp1361, ms1361                 | Korean                         |
 …
 +-----------------+--------------------------------+--------------------------------+
+A number of codecs are specific to Python, so their codec names have no meaning
+outside Python. Some of them don't convert from Unicode strings to byte strings,
+but instead use the property of the Python codecs machinery that any bijective
+function with one argument can be considered as an encoding.
+For the codecs listed below, the result in the "encoding" direction is always a
+byte string. The result of the "decoding" direction is listed as operand type in
+the table.
++--------------------+---------------------------+----------------+---------------------------+
+| Codec              | Aliases                   | Operand type   | Purpose                   |
++====================+===========================+================+===========================+
+| base64_codec       | base64, base-64           | byte string    | Convert operand to MIME   |
+|                    |                           |                | base64                    |
++--------------------+---------------------------+----------------+---------------------------+
+| bz2_codec          | bz2                       | byte string    | Compress the operand      |
+|                    |                           |                | using bz2                 |
++--------------------+---------------------------+----------------+---------------------------+
+| hex_codec          | hex                       | byte string    | Convert operand to        |
+|                    |                           |                | hexadecimal               |
+|                    |                           |                | representation, with two  |
+|                    |                           |                | digits per byte           |
++--------------------+---------------------------+----------------+---------------------------+
+| idna               |                           | Unicode string | Implements :rfc:`3490`,   |
+|                    |                           |                | see also                  |
+|                    |                           |                | :mod:`encodings.idna`     |
++--------------------+---------------------------+----------------+---------------------------+
+| mbcs               | dbcs                      | Unicode string | Windows only: Encode      |
+|                    |                           |                | operand according to the  |
+|                    |                           |                | ANSI codepage (CP_ACP)    |
++--------------------+---------------------------+----------------+---------------------------+
+| palmos             |                           | Unicode string | Encoding of PalmOS 3.5    |
++--------------------+---------------------------+----------------+---------------------------+
+| punycode           |                           | Unicode string | Implements :rfc:`3492`    |
++--------------------+---------------------------+----------------+---------------------------+
+| quopri_codec       | quopri, quoted-printable, | byte string    | Convert operand to MIME   |
+|                    | quotedprintable           |                | quoted printable          |
++--------------------+---------------------------+----------------+---------------------------+
+| raw_unicode_escape |                           | Unicode string | Produce a string that is  |
+|                    |                           |                | suitable as raw Unicode   |
+|                    |                           |                | literal in Python source  |
+|                    |                           |                | code                      |
++--------------------+---------------------------+----------------+---------------------------+
+| rot_13             | rot13                     | Unicode string | Returns the Caesar-cypher |
+|                    |                           |                | encryption of the operand |
++--------------------+---------------------------+----------------+---------------------------+
+| string_escape      |                           | byte string    | Produce a string that is  |
+|                    |                           |                | suitable as string        |
+|                    |                           |                | literal in Python source  |
+|                    |                           |                | code                      |
++--------------------+---------------------------+----------------+---------------------------+
+| undefined          |                           | any            | Raise an exception for    |
+|                    |                           |                | all conversions. Can be   |
+|                    |                           |                | used as the system        |
+|                    |                           |                | encoding if no automatic  |
+|                    |                           |                | :term:`coercion` between  |
+|                    |                           |                | byte and Unicode strings  |
+|                    |                           |                | is desired.               |
++--------------------+---------------------------+----------------+---------------------------+
+| unicode_escape     |                           | Unicode string | Produce a string that is  |
+|                    |                           |                | suitable as Unicode       |
+|                    |                           |                | literal in Python source  |
+|                    |                           |                | code                      |
++--------------------+---------------------------+----------------+---------------------------+
+| unicode_internal   |                           | Unicode string | Return the internal       |
+|                    |                           |                | representation of the     |
+|                    |                           |                | operand                   |
++--------------------+---------------------------+----------------+---------------------------+
+| uu_codec           | uu                        | byte string    | Convert the operand using |
+|                    |                           |                | uuencode                  |
++--------------------+---------------------------+----------------+---------------------------+
+| zlib_codec         | zip, zlib                 | byte string    | Compress the operand      |
+|                    |                           |                | using gzip                |
++--------------------+---------------------------+----------------+---------------------------+
+Python Specific Encodings
+-------------------------
+A number of predefined codecs are specific to Python, so their codec names have
+no meaning outside Python.  These are listed in the tables below based on the
+expected input and output types (note that while text encodings are the most
+common use case for codecs, the underlying codec infrastructure supports
+arbitrary data transforms rather than just text encodings).  For asymmetric
+codecs, the stated purpose describes the encoding direction.
+The following codecs provide unicode-to-str encoding [#encoding-note]_ and
+str-to-unicode decoding [#decoding-note]_, similar to the Unicode text
+encodings.
+.. tabularcolumns:: |l|L|L|
++--------------------+---------------------------+---------------------------+
+| Codec              | Aliases                   | Purpose                   |
++====================+===========================+===========================+
+| idna               |                           | Implements :rfc:`3490`,   |
+|                    |                           | see also                  |
+|                    |                           | :mod:`encodings.idna`     |
++--------------------+---------------------------+---------------------------+
+| mbcs               | dbcs                      | Windows only: Encode      |
+|                    |                           | operand according to the  |
+|                    |                           | ANSI codepage (CP_ACP)    |
++--------------------+---------------------------+---------------------------+
+| palmos             |                           | Encoding of PalmOS 3.5    |
++--------------------+---------------------------+---------------------------+
+| punycode           |                           | Implements :rfc:`3492`    |
++--------------------+---------------------------+---------------------------+
+| raw_unicode_escape |                           | Produce a string that is  |
+|                    |                           | suitable as raw Unicode   |
+|                    |                           | literal in Python source  |
+|                    |                           | code                      |
++--------------------+---------------------------+---------------------------+
+| rot_13             | rot13                     | Returns the Caesar-cypher |
+|                    |                           | encryption of the operand |
++--------------------+---------------------------+---------------------------+
+| undefined          |                           | Raise an exception for    |
+|                    |                           | all conversions. Can be   |
+|                    |                           | used as the system        |
+|                    |                           | encoding if no automatic  |
+|                    |                           | :term:`coercion` between  |
+|                    |                           | byte and Unicode strings  |
+|                    |                           | is desired.               |
++--------------------+---------------------------+---------------------------+
+| unicode_escape     |                           | Produce a string that is  |
+|                    |                           | suitable as Unicode       |
+|                    |                           | literal in Python source  |
+|                    |                           | code                      |
++--------------------+---------------------------+---------------------------+
+| unicode_internal   |                           | Return the internal       |
+|                    |                           | representation of the     |
+|                    |                           | operand                   |
++--------------------+---------------------------+---------------------------+
 .. versionadded:: 2.3
    The ``idna`` and ``punycode`` encodings.
+The following codecs provide str-to-str encoding and decoding
+[#decoding-note]_.
+.. tabularcolumns:: |l|L|L|L|
++--------------------+---------------------------+---------------------------+------------------------------+
+| Codec              | Aliases                   | Purpose                   | Encoder/decoder              |
++====================+===========================+===========================+==============================+
+| base64_codec       | base64, base-64           | Convert operand to MIME   | :meth:`base64.b64encode`,    |
+|                    |                           | base64 (the result always | :meth:`base64.b64decode`     |
+|                    |                           | includes a trailing       |                              |
+|                    |                           | ``'\n'``)                 |                              |
++--------------------+---------------------------+---------------------------+------------------------------+
+| bz2_codec          | bz2                       | Compress the operand      | :meth:`bz2.compress`,        |
+|                    |                           | using bz2                 | :meth:`bz2.decompress`       |
++--------------------+---------------------------+---------------------------+------------------------------+
+| hex_codec          | hex                       | Convert operand to        | :meth:`base64.b16encode`,    |
+|                    |                           | hexadecimal               | :meth:`base64.b16decode`     |
+|                    |                           | representation, with two  |                              |
+|                    |                           | digits per byte           |                              |
++--------------------+---------------------------+---------------------------+------------------------------+
+| quopri_codec       | quopri, quoted-printable, | Convert operand to MIME   | :meth:`quopri.encodestring`, |
+|                    | quotedprintable           | quoted printable          | :meth:`quopri.decodestring`  |
++--------------------+---------------------------+---------------------------+------------------------------+
+| string_escape      |                           | Produce a string that is  |                              |
+|                    |                           | suitable as string        |                              |
+|                    |                           | literal in Python source  |                              |
+|                    |                           | code                      |                              |
++--------------------+---------------------------+---------------------------+------------------------------+
+| uu_codec           | uu                        | Convert the operand using | :meth:`uu.encode`,           |
+|                    |                           | uuencode                  | :meth:`uu.decode`            |
++--------------------+---------------------------+---------------------------+------------------------------+
+| zlib_codec         | zip, zlib                 | Compress the operand      | :meth:`zlib.compress`,       |
+|                    |                           | using gzip                | :meth:`zlib.decompress`      |
++--------------------+---------------------------+---------------------------+------------------------------+
+.. [#encoding-note] str objects are also accepted as input in place of unicode
+   objects.  They are implicitly converted to unicode by decoding them using
+   the default encoding.  If this conversion fails, it may lead to encoding
+   operations raising :exc:`UnicodeDecodeError`.
+.. [#decoding-note] unicode objects are also accepted as input in place of str
+   objects.  They are implicitly converted to str by encoding them using the
+   default encoding.  If this conversion fails, it may lead to decoding
+   operations raising :exc:`UnicodeEncodeError`.
 …
 to the user.
+Python supports this conversion in several ways: The ``idna`` codec allows to
+convert between Unicode and the ACE. Furthermore, the :mod:`socket` module
+Python supports this conversion in several ways:  the ``idna`` codec performs
+conversion between Unicode and ACE, separating an input string into labels
+based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
+and converting each label to ACE as required, and conversely separating an input
+byte string into labels based on the ``.`` separator and converting any ACE
+labels found into unicode.  Furthermore, the :mod:`socket` module
 transparently converts Unicode host names to ACE, so that applications need not
 be concerned about converting host names themselves when they pass them to the
 …
 :mailheader:`Host` field if it sends that field at all).
+.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
 When receiving host names from the wire (such as in reverse name lookup), no
 automatic conversion to Unicode is performed: Applications wishing to present

Note: See TracChangeset for help on using the changeset viewer.

/python/vendor/Python-2.7.6	merged	eligible
/python/vendor/current	merged	eligible

Context Navigation

Changeset 391 for python/trunk/Doc/library/codecs.rst

Legend:

python/trunk

python/trunk/Doc/library/codecs.rst

Download in other formats: