Ignore:
Timestamp:
Mar 19, 2014, 11:31:01 PM (11 years ago)
Author:
dmik
Message:

python: Merge vendor 2.7.6 to trunk.

Location:
python/trunk
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • python/trunk

  • python/trunk/Doc/library/codecs.rst

    r2 r391  
    4848
    4949   *encode* and *decode*: These must be functions or methods which have the same
    50    interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
    51    Codec Interface). The functions/methods are expected to work in a stateless
    52    mode.
     50   interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
     51   instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
     52   are expected to work in a stateless mode.
    5353
    5454   *incrementalencoder* and *incrementaldecoder*: These have to be factory
     
    6767
    6868   The factory functions must return objects providing the interfaces defined by
    69    the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
     69   the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
    7070   Stream codecs can maintain state.
    7171
     
    316316The :class:`Codec` class defines the interface for stateless encoders/decoders.
    317317
    318 To simplify and standardize error handling, the :meth:`encode` and
    319 :meth:`decode` methods may implement different error handling schemes by
     318To simplify and standardize error handling, the :meth:`~Codec.encode` and
     319:meth:`~Codec.decode` methods may implement different error handling schemes by
    320320providing the *errors* string argument.  The following string values are defined
    321321and implemented by all standard Python codecs:
     322
     323.. tabularcolumns:: |l|L|
    322324
    323325+-------------------------+-----------------------------------------------+
     
    396398the basic interface for incremental encoding and decoding. Encoding/decoding the
    397399input isn't done with one call to the stateless encoder/decoder function, but
    398 with multiple calls to the :meth:`encode`/:meth:`decode` method of the
    399 incremental encoder/decoder. The incremental encoder/decoder keeps track of the
    400 encoding/decoding process during method calls.
    401 
    402 The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
    403 same as if all the single inputs were joined into one, and this input was
     400with multiple calls to the
     401:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
     402the incremental encoder/decoder. The incremental encoder/decoder keeps track of
     403the encoding/decoding process during method calls.
     404
     405The joined output of calls to the
     406:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
     407the same as if all the single inputs were joined into one, and this input was
    404408encoded/decoded with the stateless encoder/decoder.
    405409
     
    652656
    653657      *size*, if given, is passed as size argument to the stream's
    654       :meth:`readline` method.
     658      :meth:`read` method.
    655659
    656660      If *keepends* is false line-endings will be stripped from the lines
     
    760764
    761765Unicode strings are stored internally as sequences of codepoints (to be precise
    762 as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
    763 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the
    764 former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
     766as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
     767via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the
     768former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data
    765769type. Once a Unicode object is used outside of CPU and memory, CPU endianness
    766770and how these arrays are stored as bytes become an issue.  Transforming a
     
    783787character is mapped to which byte value.
    784788
    785 All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
     789All of these encodings can only encode 256 of the 1114112 codepoints
    786790defined in unicode. A simple and straightforward way that can store each Unicode
    787 code point, is to store each codepoint as two consecutive bytes. There are two
    788 possibilities: Store the bytes in big endian or in little endian order. These
    789 two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
    790 disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
    791 will always have to swap bytes on encoding and decoding. UTF-16 avoids this
    792 problem: Bytes will always be in natural endianness. When these bytes are read
     791code point, is to store each codepoint as four consecutive bytes. There are two
     792possibilities: store the bytes in big endian or in little endian order. These
     793two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
     794disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
     795will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
     796problem: bytes will always be in natural endianness. When these bytes are read
    793797by a CPU with a different endianness, then bytes have to be swapped though. To
    794 be able to detect the endianness of a UTF-16 byte sequence, there's the so
    795 called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
    796 This character will be prepended to every UTF-16 byte sequence. The byte swapped
    797 version of this character (``0xFFFE``) is an illegal character that may not
    798 appear in a Unicode text. So when the first character in an UTF-16 byte sequence
     798be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
     799there's the so called BOM ("Byte Order Mark"). This is the Unicode character
     800``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
     801byte sequence. The byte swapped version of this character (``0xFFFE``) is an
     802illegal character that may not appear in a Unicode text. So when the
     803first character in an ``UTF-16`` or ``UTF-32`` byte sequence
    799804appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
    800 Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
    801 a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
     805Unfortunately the character ``U+FEFF`` had a second purpose as
     806a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
    802807a word to be split. It can e.g. be used to give hints to a ligature algorithm.
    803808With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
    804809deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
    805 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
     810Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
    806811it's a device to determine the storage layout of the encoded bytes, and vanishes
    807812once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
     
    811816characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
    812817with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
    813 parts: Marker bits (the most significant bits) and payload bits. The marker bits
    814 are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
     818parts: marker bits (the most significant bits) and payload bits. The marker bits
     819are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
    815820encoded like this (with x being payload bits, which when concatenated give the
    816821Unicode character):
     
    825830| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
    826831+-----------------------------------+----------------------------------------------+
    827 | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
    828 +-----------------------------------+----------------------------------------------+
    829 | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
    830 +-----------------------------------+----------------------------------------------+
    831 | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
    832 |                                   | 10xxxxxx                                     |
     832| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
    833833+-----------------------------------+----------------------------------------------+
    834834
     
    855855   | INVERTED QUESTION MARK
    856856
    857 in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
     857in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
    858858correctly guessed from the byte sequence. So here the BOM is not used to be able
    859859to determine the byte order used for generating the byte sequence, but as a
    860860signature that helps in guessing the encoding. On encoding the utf-8-sig codec
    861861will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
    862 decoding utf-8-sig will skip those three bytes if they appear as the first three
    863 bytes in the file.
     862decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
     863three bytes in the file.  In UTF-8, the use of the BOM is discouraged and
     864should generally be avoided.
    864865
    865866
     
    891892* an IBM PC code page, which is ASCII compatible
    892893
     894.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
     895
    893896+-----------------+--------------------------------+--------------------------------+
    894897| Codec           | Aliases                        | Languages                      |
     
    909912|                 | IBM500                         |                                |
    910913+-----------------+--------------------------------+--------------------------------+
     914| cp720           |                                | Arabic                         |
     915+-----------------+--------------------------------+--------------------------------+
    911916| cp737           |                                | Greek                          |
    912917+-----------------+--------------------------------+--------------------------------+
     
    923928+-----------------+--------------------------------+--------------------------------+
    924929| cp857           | 857, IBM857                    | Turkish                        |
     930+-----------------+--------------------------------+--------------------------------+
     931| cp858           | 858, IBM858                    | Western Europe                 |
    925932+-----------------+--------------------------------+--------------------------------+
    926933| cp860           | 860, IBM860                    | Portuguese                     |
     
    10361043| iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               |
    10371044+-----------------+--------------------------------+--------------------------------+
    1038 | iso8859_13      | iso-8859-13                    | Baltic languages               |
     1045| iso8859_13      | iso-8859-13, latin7, L7        | Baltic languages               |
    10391046+-----------------+--------------------------------+--------------------------------+
    10401047| iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               |
    10411048+-----------------+--------------------------------+--------------------------------+
    1042 | iso8859_15      | iso-8859-15                    | Western Europe                 |
     1049| iso8859_15      | iso-8859-15, latin9, L9        | Western Europe                 |
     1050+-----------------+--------------------------------+--------------------------------+
     1051| iso8859_16      | iso-8859-16, latin10, L10      | South-Eastern Europe           |
    10431052+-----------------+--------------------------------+--------------------------------+
    10441053| johab           | cp1361, ms1361                 | Korean                         |
     
    10921101+-----------------+--------------------------------+--------------------------------+
    10931102
    1094 A number of codecs are specific to Python, so their codec names have no meaning
    1095 outside Python. Some of them don't convert from Unicode strings to byte strings,
    1096 but instead use the property of the Python codecs machinery that any bijective
    1097 function with one argument can be considered as an encoding.
    1098 
    1099 For the codecs listed below, the result in the "encoding" direction is always a
    1100 byte string. The result of the "decoding" direction is listed as operand type in
    1101 the table.
    1102 
    1103 +--------------------+---------------------------+----------------+---------------------------+
    1104 | Codec              | Aliases                   | Operand type   | Purpose                   |
    1105 +====================+===========================+================+===========================+
    1106 | base64_codec       | base64, base-64           | byte string    | Convert operand to MIME   |
    1107 |                    |                           |                | base64                    |
    1108 +--------------------+---------------------------+----------------+---------------------------+
    1109 | bz2_codec          | bz2                       | byte string    | Compress the operand      |
    1110 |                    |                           |                | using bz2                 |
    1111 +--------------------+---------------------------+----------------+---------------------------+
    1112 | hex_codec          | hex                       | byte string    | Convert operand to        |
    1113 |                    |                           |                | hexadecimal               |
    1114 |                    |                           |                | representation, with two  |
    1115 |                    |                           |                | digits per byte           |
    1116 +--------------------+---------------------------+----------------+---------------------------+
    1117 | idna               |                           | Unicode string | Implements :rfc:`3490`,   |
    1118 |                    |                           |                | see also                  |
    1119 |                    |                           |                | :mod:`encodings.idna`     |
    1120 +--------------------+---------------------------+----------------+---------------------------+
    1121 | mbcs               | dbcs                      | Unicode string | Windows only: Encode      |
    1122 |                    |                           |                | operand according to the  |
    1123 |                    |                           |                | ANSI codepage (CP_ACP)    |
    1124 +--------------------+---------------------------+----------------+---------------------------+
    1125 | palmos             |                           | Unicode string | Encoding of PalmOS 3.5    |
    1126 +--------------------+---------------------------+----------------+---------------------------+
    1127 | punycode           |                           | Unicode string | Implements :rfc:`3492`    |
    1128 +--------------------+---------------------------+----------------+---------------------------+
    1129 | quopri_codec       | quopri, quoted-printable, | byte string    | Convert operand to MIME   |
    1130 |                    | quotedprintable           |                | quoted printable          |
    1131 +--------------------+---------------------------+----------------+---------------------------+
    1132 | raw_unicode_escape |                           | Unicode string | Produce a string that is  |
    1133 |                    |                           |                | suitable as raw Unicode   |
    1134 |                    |                           |                | literal in Python source  |
    1135 |                    |                           |                | code                      |
    1136 +--------------------+---------------------------+----------------+---------------------------+
    1137 | rot_13             | rot13                     | Unicode string | Returns the Caesar-cypher |
    1138 |                    |                           |                | encryption of the operand |
    1139 +--------------------+---------------------------+----------------+---------------------------+
    1140 | string_escape      |                           | byte string    | Produce a string that is  |
    1141 |                    |                           |                | suitable as string        |
    1142 |                    |                           |                | literal in Python source  |
    1143 |                    |                           |                | code                      |
    1144 +--------------------+---------------------------+----------------+---------------------------+
    1145 | undefined          |                           | any            | Raise an exception for    |
    1146 |                    |                           |                | all conversions. Can be   |
    1147 |                    |                           |                | used as the system        |
    1148 |                    |                           |                | encoding if no automatic  |
    1149 |                    |                           |                | :term:`coercion` between  |
    1150 |                    |                           |                | byte and Unicode strings  |
    1151 |                    |                           |                | is desired.               |
    1152 +--------------------+---------------------------+----------------+---------------------------+
    1153 | unicode_escape     |                           | Unicode string | Produce a string that is  |
    1154 |                    |                           |                | suitable as Unicode       |
    1155 |                    |                           |                | literal in Python source  |
    1156 |                    |                           |                | code                      |
    1157 +--------------------+---------------------------+----------------+---------------------------+
    1158 | unicode_internal   |                           | Unicode string | Return the internal       |
    1159 |                    |                           |                | representation of the     |
    1160 |                    |                           |                | operand                   |
    1161 +--------------------+---------------------------+----------------+---------------------------+
    1162 | uu_codec           | uu                        | byte string    | Convert the operand using |
    1163 |                    |                           |                | uuencode                  |
    1164 +--------------------+---------------------------+----------------+---------------------------+
    1165 | zlib_codec         | zip, zlib                 | byte string    | Compress the operand      |
    1166 |                    |                           |                | using gzip                |
    1167 +--------------------+---------------------------+----------------+---------------------------+
     1103Python Specific Encodings
     1104-------------------------
     1105
     1106A number of predefined codecs are specific to Python, so their codec names have
     1107no meaning outside Python.  These are listed in the tables below based on the
     1108expected input and output types (note that while text encodings are the most
     1109common use case for codecs, the underlying codec infrastructure supports
     1110arbitrary data transforms rather than just text encodings).  For asymmetric
     1111codecs, the stated purpose describes the encoding direction.
     1112
     1113The following codecs provide unicode-to-str encoding [#encoding-note]_ and
     1114str-to-unicode decoding [#decoding-note]_, similar to the Unicode text
     1115encodings.
     1116
     1117.. tabularcolumns:: |l|L|L|
     1118
     1119+--------------------+---------------------------+---------------------------+
     1120| Codec              | Aliases                   | Purpose                   |
     1121+====================+===========================+===========================+
     1122| idna               |                           | Implements :rfc:`3490`,   |
     1123|                    |                           | see also                  |
     1124|                    |                           | :mod:`encodings.idna`     |
     1125+--------------------+---------------------------+---------------------------+
     1126| mbcs               | dbcs                      | Windows only: Encode      |
     1127|                    |                           | operand according to the  |
     1128|                    |                           | ANSI codepage (CP_ACP)    |
     1129+--------------------+---------------------------+---------------------------+
     1130| palmos             |                           | Encoding of PalmOS 3.5    |
     1131+--------------------+---------------------------+---------------------------+
     1132| punycode           |                           | Implements :rfc:`3492`    |
     1133+--------------------+---------------------------+---------------------------+
     1134| raw_unicode_escape |                           | Produce a string that is  |
     1135|                    |                           | suitable as raw Unicode   |
     1136|                    |                           | literal in Python source  |
     1137|                    |                           | code                      |
     1138+--------------------+---------------------------+---------------------------+
     1139| rot_13             | rot13                     | Returns the Caesar-cypher |
     1140|                    |                           | encryption of the operand |
     1141+--------------------+---------------------------+---------------------------+
     1142| undefined          |                           | Raise an exception for    |
     1143|                    |                           | all conversions. Can be   |
     1144|                    |                           | used as the system        |
     1145|                    |                           | encoding if no automatic  |
     1146|                    |                           | :term:`coercion` between  |
     1147|                    |                           | byte and Unicode strings  |
     1148|                    |                           | is desired.               |
     1149+--------------------+---------------------------+---------------------------+
     1150| unicode_escape     |                           | Produce a string that is  |
     1151|                    |                           | suitable as Unicode       |
     1152|                    |                           | literal in Python source  |
     1153|                    |                           | code                      |
     1154+--------------------+---------------------------+---------------------------+
     1155| unicode_internal   |                           | Return the internal       |
     1156|                    |                           | representation of the     |
     1157|                    |                           | operand                   |
     1158+--------------------+---------------------------+---------------------------+
    11681159
    11691160.. versionadded:: 2.3
    11701161   The ``idna`` and ``punycode`` encodings.
     1162
     1163The following codecs provide str-to-str encoding and decoding
     1164[#decoding-note]_.
     1165
     1166.. tabularcolumns:: |l|L|L|L|
     1167
     1168+--------------------+---------------------------+---------------------------+------------------------------+
     1169| Codec              | Aliases                   | Purpose                   | Encoder/decoder              |
     1170+====================+===========================+===========================+==============================+
     1171| base64_codec       | base64, base-64           | Convert operand to MIME   | :meth:`base64.b64encode`,    |
     1172|                    |                           | base64 (the result always | :meth:`base64.b64decode`     |
     1173|                    |                           | includes a trailing       |                              |
     1174|                    |                           | ``'\n'``)                 |                              |
     1175+--------------------+---------------------------+---------------------------+------------------------------+
     1176| bz2_codec          | bz2                       | Compress the operand      | :meth:`bz2.compress`,        |
     1177|                    |                           | using bz2                 | :meth:`bz2.decompress`       |
     1178+--------------------+---------------------------+---------------------------+------------------------------+
     1179| hex_codec          | hex                       | Convert operand to        | :meth:`base64.b16encode`,    |
     1180|                    |                           | hexadecimal               | :meth:`base64.b16decode`     |
     1181|                    |                           | representation, with two  |                              |
     1182|                    |                           | digits per byte           |                              |
     1183+--------------------+---------------------------+---------------------------+------------------------------+
     1184| quopri_codec       | quopri, quoted-printable, | Convert operand to MIME   | :meth:`quopri.encodestring`, |
     1185|                    | quotedprintable           | quoted printable          | :meth:`quopri.decodestring`  |
     1186+--------------------+---------------------------+---------------------------+------------------------------+
     1187| string_escape      |                           | Produce a string that is  |                              |
     1188|                    |                           | suitable as string        |                              |
     1189|                    |                           | literal in Python source  |                              |
     1190|                    |                           | code                      |                              |
     1191+--------------------+---------------------------+---------------------------+------------------------------+
     1192| uu_codec           | uu                        | Convert the operand using | :meth:`uu.encode`,           |
     1193|                    |                           | uuencode                  | :meth:`uu.decode`            |
     1194+--------------------+---------------------------+---------------------------+------------------------------+
     1195| zlib_codec         | zip, zlib                 | Compress the operand      | :meth:`zlib.compress`,       |
     1196|                    |                           | using gzip                | :meth:`zlib.decompress`      |
     1197+--------------------+---------------------------+---------------------------+------------------------------+
     1198
     1199.. [#encoding-note] str objects are also accepted as input in place of unicode
     1200   objects.  They are implicitly converted to unicode by decoding them using
     1201   the default encoding.  If this conversion fails, it may lead to encoding
     1202   operations raising :exc:`UnicodeDecodeError`.
     1203
     1204.. [#decoding-note] unicode objects are also accepted as input in place of str
     1205   objects.  They are implicitly converted to str by encoding them using the
     1206   default encoding.  If this conversion fails, it may lead to decoding
     1207   operations raising :exc:`UnicodeEncodeError`.
    11711208
    11721209
     
    11961233to the user.
    11971234
    1198 Python supports this conversion in several ways: The ``idna`` codec allows to
    1199 convert between Unicode and the ACE. Furthermore, the :mod:`socket` module
     1235Python supports this conversion in several ways:  the ``idna`` codec performs
     1236conversion between Unicode and ACE, separating an input string into labels
     1237based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
     1238and converting each label to ACE as required, and conversely separating an input
     1239byte string into labels based on the ``.`` separator and converting any ACE
     1240labels found into unicode.  Furthermore, the :mod:`socket` module
    12001241transparently converts Unicode host names to ACE, so that applications need not
    12011242be concerned about converting host names themselves when they pass them to the
     
    12051246:mailheader:`Host` field if it sends that field at all).
    12061247
     1248.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
     1249
    12071250When receiving host names from the wire (such as in reverse name lookup), no
    12081251automatic conversion to Unicode is performed: Applications wishing to present
Note: See TracChangeset for help on using the changeset viewer.