Changeset 391 for python/trunk/Doc/library/codecs.rst
- Timestamp:
- Mar 19, 2014, 11:31:01 PM (11 years ago)
- Location:
- python/trunk
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
python/trunk
-
Property svn:mergeinfo
set to
/python/vendor/Python-2.7.6 merged eligible /python/vendor/current merged eligible
-
Property svn:mergeinfo
set to
-
python/trunk/Doc/library/codecs.rst
r2 r391 48 48 49 49 *encode* and *decode*: These must be functions or methods which have the same 50 interface as the :meth:` encode`/:meth:`decode` methods of Codec instances (see51 Codec Interface). The functions/methods are expected to work in a stateless52 mode.50 interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec 51 instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods 52 are expected to work in a stateless mode. 53 53 54 54 *incrementalencoder* and *incrementaldecoder*: These have to be factory … … 67 67 68 68 The factory functions must return objects providing the interfaces defined by 69 the base classes :class:`Stream Writer` and :class:`StreamReader`, respectively.69 the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively. 70 70 Stream codecs can maintain state. 71 71 … … 316 316 The :class:`Codec` class defines the interface for stateless encoders/decoders. 317 317 318 To simplify and standardize error handling, the :meth:` encode` and319 :meth:` decode` methods may implement different error handling schemes by318 To simplify and standardize error handling, the :meth:`~Codec.encode` and 319 :meth:`~Codec.decode` methods may implement different error handling schemes by 320 320 providing the *errors* string argument. The following string values are defined 321 321 and implemented by all standard Python codecs: 322 323 .. tabularcolumns:: |l|L| 322 324 323 325 +-------------------------+-----------------------------------------------+ … … 396 398 the basic interface for incremental encoding and decoding. Encoding/decoding the 397 399 input isn't done with one call to the stateless encoder/decoder function, but 398 with multiple calls to the :meth:`encode`/:meth:`decode` method of the 399 incremental encoder/decoder. The incremental encoder/decoder keeps track of the 400 encoding/decoding process during method calls. 401 402 The joined output of calls to the :meth:`encode`/:meth:`decode` method is the 403 same as if all the single inputs were joined into one, and this input was 400 with multiple calls to the 401 :meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of 402 the incremental encoder/decoder. The incremental encoder/decoder keeps track of 403 the encoding/decoding process during method calls. 404 405 The joined output of calls to the 406 :meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is 407 the same as if all the single inputs were joined into one, and this input was 404 408 encoded/decoded with the stateless encoder/decoder. 405 409 … … 652 656 653 657 *size*, if given, is passed as size argument to the stream's 654 :meth:`read line` method.658 :meth:`read` method. 655 659 656 660 If *keepends* is false line-endings will be stripped from the lines … … 760 764 761 765 Unicode strings are stored internally as sequences of codepoints (to be precise 762 as :c type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either763 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the764 former being the default) :c type:`Py_UNICODE` is either a 16-bit or 32-bit data766 as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either 767 via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the 768 former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data 765 769 type. Once a Unicode object is used outside of CPU and memory, CPU endianness 766 770 and how these arrays are stored as bytes become an issue. Transforming a … … 783 787 character is mapped to which byte value. 784 788 785 All of these encodings can only encode 256 of the 65536 (or 1114111)codepoints789 All of these encodings can only encode 256 of the 1114112 codepoints 786 790 defined in unicode. A simple and straightforward way that can store each Unicode 787 code point, is to store each codepoint as twoconsecutive bytes. There are two788 possibilities: Store the bytes in big endian or in little endian order. These789 two encodings are called UTF-16-BE and UTF-16-LErespectively. Their790 disadvantage is that if e.g. you use UTF-16-BEon a little endian machine you791 will always have to swap bytes on encoding and decoding. UTF-16avoids this792 problem: Bytes will always be in natural endianness. When these bytes are read791 code point, is to store each codepoint as four consecutive bytes. There are two 792 possibilities: store the bytes in big endian or in little endian order. These 793 two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their 794 disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you 795 will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this 796 problem: bytes will always be in natural endianness. When these bytes are read 793 797 by a CPU with a different endianness, then bytes have to be swapped though. To 794 be able to detect the endianness of a UTF-16 byte sequence, there's the so 795 called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``. 796 This character will be prepended to every UTF-16 byte sequence. The byte swapped 797 version of this character (``0xFFFE``) is an illegal character that may not 798 appear in a Unicode text. So when the first character in an UTF-16 byte sequence 798 be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, 799 there's the so called BOM ("Byte Order Mark"). This is the Unicode character 800 ``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` 801 byte sequence. The byte swapped version of this character (``0xFFFE``) is an 802 illegal character that may not appear in a Unicode text. So when the 803 first character in an ``UTF-16`` or ``UTF-32`` byte sequence 799 804 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. 800 Unfortunately upto Unicode 4.0the character ``U+FEFF`` had a second purpose as801 a ``ZERO WIDTH NO-BREAK SPACE``: Acharacter that has no width and doesn't allow805 Unfortunately the character ``U+FEFF`` had a second purpose as 806 a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow 802 807 a word to be split. It can e.g. be used to give hints to a ligature algorithm. 803 808 With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been 804 809 deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless 805 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM810 Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM 806 811 it's a device to determine the storage layout of the encoded bytes, and vanishes 807 812 once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH … … 811 816 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues 812 817 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two 813 parts: Marker bits (the most significant bits) and payload bits. The marker bits814 are a sequence of zero to six 1 bits followed by a 0bit. Unicode characters are818 parts: marker bits (the most significant bits) and payload bits. The marker bits 819 are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are 815 820 encoded like this (with x being payload bits, which when concatenated give the 816 821 Unicode character): … … 825 830 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx | 826 831 +-----------------------------------+----------------------------------------------+ 827 | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 828 +-----------------------------------+----------------------------------------------+ 829 | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 830 +-----------------------------------+----------------------------------------------+ 831 | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 832 | | 10xxxxxx | 832 | ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 833 833 +-----------------------------------+----------------------------------------------+ 834 834 … … 855 855 | INVERTED QUESTION MARK 856 856 857 in iso-8859-1), this increases the probability that a utf-8-sigencoding can be857 in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be 858 858 correctly guessed from the byte sequence. So here the BOM is not used to be able 859 859 to determine the byte order used for generating the byte sequence, but as a 860 860 signature that helps in guessing the encoding. On encoding the utf-8-sig codec 861 861 will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On 862 decoding utf-8-sig will skip those three bytes if they appear as the first three 863 bytes in the file. 862 decoding ``utf-8-sig`` will skip those three bytes if they appear as the first 863 three bytes in the file. In UTF-8, the use of the BOM is discouraged and 864 should generally be avoided. 864 865 865 866 … … 891 892 * an IBM PC code page, which is ASCII compatible 892 893 894 .. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}| 895 893 896 +-----------------+--------------------------------+--------------------------------+ 894 897 | Codec | Aliases | Languages | … … 909 912 | | IBM500 | | 910 913 +-----------------+--------------------------------+--------------------------------+ 914 | cp720 | | Arabic | 915 +-----------------+--------------------------------+--------------------------------+ 911 916 | cp737 | | Greek | 912 917 +-----------------+--------------------------------+--------------------------------+ … … 923 928 +-----------------+--------------------------------+--------------------------------+ 924 929 | cp857 | 857, IBM857 | Turkish | 930 +-----------------+--------------------------------+--------------------------------+ 931 | cp858 | 858, IBM858 | Western Europe | 925 932 +-----------------+--------------------------------+--------------------------------+ 926 933 | cp860 | 860, IBM860 | Portuguese | … … 1036 1043 | iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages | 1037 1044 +-----------------+--------------------------------+--------------------------------+ 1038 | iso8859_13 | iso-8859-13 1045 | iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages | 1039 1046 +-----------------+--------------------------------+--------------------------------+ 1040 1047 | iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages | 1041 1048 +-----------------+--------------------------------+--------------------------------+ 1042 | iso8859_15 | iso-8859-15 | Western Europe | 1049 | iso8859_15 | iso-8859-15, latin9, L9 | Western Europe | 1050 +-----------------+--------------------------------+--------------------------------+ 1051 | iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe | 1043 1052 +-----------------+--------------------------------+--------------------------------+ 1044 1053 | johab | cp1361, ms1361 | Korean | … … 1092 1101 +-----------------+--------------------------------+--------------------------------+ 1093 1102 1094 A number of codecs are specific to Python, so their codec names have no meaning 1095 outside Python. Some of them don't convert from Unicode strings to byte strings, 1096 but instead use the property of the Python codecs machinery that any bijective 1097 function with one argument can be considered as an encoding. 1098 1099 For the codecs listed below, the result in the "encoding" direction is always a 1100 byte string. The result of the "decoding" direction is listed as operand type in 1101 the table. 1102 1103 +--------------------+---------------------------+----------------+---------------------------+ 1104 | Codec | Aliases | Operand type | Purpose | 1105 +====================+===========================+================+===========================+ 1106 | base64_codec | base64, base-64 | byte string | Convert operand to MIME | 1107 | | | | base64 | 1108 +--------------------+---------------------------+----------------+---------------------------+ 1109 | bz2_codec | bz2 | byte string | Compress the operand | 1110 | | | | using bz2 | 1111 +--------------------+---------------------------+----------------+---------------------------+ 1112 | hex_codec | hex | byte string | Convert operand to | 1113 | | | | hexadecimal | 1114 | | | | representation, with two | 1115 | | | | digits per byte | 1116 +--------------------+---------------------------+----------------+---------------------------+ 1117 | idna | | Unicode string | Implements :rfc:`3490`, | 1118 | | | | see also | 1119 | | | | :mod:`encodings.idna` | 1120 +--------------------+---------------------------+----------------+---------------------------+ 1121 | mbcs | dbcs | Unicode string | Windows only: Encode | 1122 | | | | operand according to the | 1123 | | | | ANSI codepage (CP_ACP) | 1124 +--------------------+---------------------------+----------------+---------------------------+ 1125 | palmos | | Unicode string | Encoding of PalmOS 3.5 | 1126 +--------------------+---------------------------+----------------+---------------------------+ 1127 | punycode | | Unicode string | Implements :rfc:`3492` | 1128 +--------------------+---------------------------+----------------+---------------------------+ 1129 | quopri_codec | quopri, quoted-printable, | byte string | Convert operand to MIME | 1130 | | quotedprintable | | quoted printable | 1131 +--------------------+---------------------------+----------------+---------------------------+ 1132 | raw_unicode_escape | | Unicode string | Produce a string that is | 1133 | | | | suitable as raw Unicode | 1134 | | | | literal in Python source | 1135 | | | | code | 1136 +--------------------+---------------------------+----------------+---------------------------+ 1137 | rot_13 | rot13 | Unicode string | Returns the Caesar-cypher | 1138 | | | | encryption of the operand | 1139 +--------------------+---------------------------+----------------+---------------------------+ 1140 | string_escape | | byte string | Produce a string that is | 1141 | | | | suitable as string | 1142 | | | | literal in Python source | 1143 | | | | code | 1144 +--------------------+---------------------------+----------------+---------------------------+ 1145 | undefined | | any | Raise an exception for | 1146 | | | | all conversions. Can be | 1147 | | | | used as the system | 1148 | | | | encoding if no automatic | 1149 | | | | :term:`coercion` between | 1150 | | | | byte and Unicode strings | 1151 | | | | is desired. | 1152 +--------------------+---------------------------+----------------+---------------------------+ 1153 | unicode_escape | | Unicode string | Produce a string that is | 1154 | | | | suitable as Unicode | 1155 | | | | literal in Python source | 1156 | | | | code | 1157 +--------------------+---------------------------+----------------+---------------------------+ 1158 | unicode_internal | | Unicode string | Return the internal | 1159 | | | | representation of the | 1160 | | | | operand | 1161 +--------------------+---------------------------+----------------+---------------------------+ 1162 | uu_codec | uu | byte string | Convert the operand using | 1163 | | | | uuencode | 1164 +--------------------+---------------------------+----------------+---------------------------+ 1165 | zlib_codec | zip, zlib | byte string | Compress the operand | 1166 | | | | using gzip | 1167 +--------------------+---------------------------+----------------+---------------------------+ 1103 Python Specific Encodings 1104 ------------------------- 1105 1106 A number of predefined codecs are specific to Python, so their codec names have 1107 no meaning outside Python. These are listed in the tables below based on the 1108 expected input and output types (note that while text encodings are the most 1109 common use case for codecs, the underlying codec infrastructure supports 1110 arbitrary data transforms rather than just text encodings). For asymmetric 1111 codecs, the stated purpose describes the encoding direction. 1112 1113 The following codecs provide unicode-to-str encoding [#encoding-note]_ and 1114 str-to-unicode decoding [#decoding-note]_, similar to the Unicode text 1115 encodings. 1116 1117 .. tabularcolumns:: |l|L|L| 1118 1119 +--------------------+---------------------------+---------------------------+ 1120 | Codec | Aliases | Purpose | 1121 +====================+===========================+===========================+ 1122 | idna | | Implements :rfc:`3490`, | 1123 | | | see also | 1124 | | | :mod:`encodings.idna` | 1125 +--------------------+---------------------------+---------------------------+ 1126 | mbcs | dbcs | Windows only: Encode | 1127 | | | operand according to the | 1128 | | | ANSI codepage (CP_ACP) | 1129 +--------------------+---------------------------+---------------------------+ 1130 | palmos | | Encoding of PalmOS 3.5 | 1131 +--------------------+---------------------------+---------------------------+ 1132 | punycode | | Implements :rfc:`3492` | 1133 +--------------------+---------------------------+---------------------------+ 1134 | raw_unicode_escape | | Produce a string that is | 1135 | | | suitable as raw Unicode | 1136 | | | literal in Python source | 1137 | | | code | 1138 +--------------------+---------------------------+---------------------------+ 1139 | rot_13 | rot13 | Returns the Caesar-cypher | 1140 | | | encryption of the operand | 1141 +--------------------+---------------------------+---------------------------+ 1142 | undefined | | Raise an exception for | 1143 | | | all conversions. Can be | 1144 | | | used as the system | 1145 | | | encoding if no automatic | 1146 | | | :term:`coercion` between | 1147 | | | byte and Unicode strings | 1148 | | | is desired. | 1149 +--------------------+---------------------------+---------------------------+ 1150 | unicode_escape | | Produce a string that is | 1151 | | | suitable as Unicode | 1152 | | | literal in Python source | 1153 | | | code | 1154 +--------------------+---------------------------+---------------------------+ 1155 | unicode_internal | | Return the internal | 1156 | | | representation of the | 1157 | | | operand | 1158 +--------------------+---------------------------+---------------------------+ 1168 1159 1169 1160 .. versionadded:: 2.3 1170 1161 The ``idna`` and ``punycode`` encodings. 1162 1163 The following codecs provide str-to-str encoding and decoding 1164 [#decoding-note]_. 1165 1166 .. tabularcolumns:: |l|L|L|L| 1167 1168 +--------------------+---------------------------+---------------------------+------------------------------+ 1169 | Codec | Aliases | Purpose | Encoder/decoder | 1170 +====================+===========================+===========================+==============================+ 1171 | base64_codec | base64, base-64 | Convert operand to MIME | :meth:`base64.b64encode`, | 1172 | | | base64 (the result always | :meth:`base64.b64decode` | 1173 | | | includes a trailing | | 1174 | | | ``'\n'``) | | 1175 +--------------------+---------------------------+---------------------------+------------------------------+ 1176 | bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress`, | 1177 | | | using bz2 | :meth:`bz2.decompress` | 1178 +--------------------+---------------------------+---------------------------+------------------------------+ 1179 | hex_codec | hex | Convert operand to | :meth:`base64.b16encode`, | 1180 | | | hexadecimal | :meth:`base64.b16decode` | 1181 | | | representation, with two | | 1182 | | | digits per byte | | 1183 +--------------------+---------------------------+---------------------------+------------------------------+ 1184 | quopri_codec | quopri, quoted-printable, | Convert operand to MIME | :meth:`quopri.encodestring`, | 1185 | | quotedprintable | quoted printable | :meth:`quopri.decodestring` | 1186 +--------------------+---------------------------+---------------------------+------------------------------+ 1187 | string_escape | | Produce a string that is | | 1188 | | | suitable as string | | 1189 | | | literal in Python source | | 1190 | | | code | | 1191 +--------------------+---------------------------+---------------------------+------------------------------+ 1192 | uu_codec | uu | Convert the operand using | :meth:`uu.encode`, | 1193 | | | uuencode | :meth:`uu.decode` | 1194 +--------------------+---------------------------+---------------------------+------------------------------+ 1195 | zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress`, | 1196 | | | using gzip | :meth:`zlib.decompress` | 1197 +--------------------+---------------------------+---------------------------+------------------------------+ 1198 1199 .. [#encoding-note] str objects are also accepted as input in place of unicode 1200 objects. They are implicitly converted to unicode by decoding them using 1201 the default encoding. If this conversion fails, it may lead to encoding 1202 operations raising :exc:`UnicodeDecodeError`. 1203 1204 .. [#decoding-note] unicode objects are also accepted as input in place of str 1205 objects. They are implicitly converted to str by encoding them using the 1206 default encoding. If this conversion fails, it may lead to decoding 1207 operations raising :exc:`UnicodeEncodeError`. 1171 1208 1172 1209 … … 1196 1233 to the user. 1197 1234 1198 Python supports this conversion in several ways: The ``idna`` codec allows to 1199 convert between Unicode and the ACE. Furthermore, the :mod:`socket` module 1235 Python supports this conversion in several ways: the ``idna`` codec performs 1236 conversion between Unicode and ACE, separating an input string into labels 1237 based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490` 1238 and converting each label to ACE as required, and conversely separating an input 1239 byte string into labels based on the ``.`` separator and converting any ACE 1240 labels found into unicode. Furthermore, the :mod:`socket` module 1200 1241 transparently converts Unicode host names to ACE, so that applications need not 1201 1242 be concerned about converting host names themselves when they pass them to the … … 1205 1246 :mailheader:`Host` field if it sends that field at all). 1206 1247 1248 .. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1 1249 1207 1250 When receiving host names from the wire (such as in reverse name lookup), no 1208 1251 automatic conversion to Unicode is performed: Applications wishing to present
Note:
See TracChangeset
for help on using the changeset viewer.