Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

codecs.rst

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 64.1 KB

Rev	Line
[2]	1
	2	:mod:`codecs` --- Codec registry and base classes
	3	=================================================
	4
	5	.. module:: codecs
	6	:synopsis: Encode and decode data and streams.
	7	.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
	8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
	9	.. sectionauthor:: Martin v. LÃ¶wis <martin@v.loewis.de>
	10
	11
	12	.. index::
	13	single: Unicode
	14	single: Codecs
	15	pair: Codecs; encode
	16	pair: Codecs; decode
	17	single: streams
	18	pair: stackable; streams
	19
	20	This module defines base classes for standard Python codecs (encoders and
	21	decoders) and provides access to the internal Python codec registry which
	22	manages the codec and error handling lookup process.
	23
	24	It defines the following functions:
	25
	26
	27	.. function:: register(search_function)
	28
	29	Register a codec search function. Search functions are expected to take one
	30	argument, the encoding name in all lower case letters, and return a
	31	:class:`CodecInfo` object having the following attributes:
	32
	33	* ``name`` The name of the encoding;
	34
	35	* ``encode`` The stateless encoding function;
	36
	37	* ``decode`` The stateless decoding function;
	38
	39	* ``incrementalencoder`` An incremental encoder class or factory function;
	40
	41	* ``incrementaldecoder`` An incremental decoder class or factory function;
	42
	43	* ``streamwriter`` A stream writer class or factory function;
	44
	45	* ``streamreader`` A stream reader class or factory function.
	46
	47	The various functions or classes take the following arguments:
	48
	49	encode and decode: These must be functions or methods which have the same
[391]	50	interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
	51	instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
	52	are expected to work in a stateless mode.
[2]	53
	54	incrementalencoder and incrementaldecoder: These have to be factory
	55	functions providing the following interface:
	56
	57	``factory(errors='strict')``
	58
	59	The factory functions must return objects providing the interfaces defined by
	60	the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
	61	respectively. Incremental codecs can maintain state.
	62
	63	streamreader and streamwriter: These have to be factory functions providing
	64	the following interface:
	65
	66	``factory(stream, errors='strict')``
	67
	68	The factory functions must return objects providing the interfaces defined by
[391]	69	the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
[2]	70	Stream codecs can maintain state.
	71
	72	Possible values for errors are
	73
	74	* ``'strict'``: raise an exception in case of an encoding error
	75	* ``'replace'``: replace malformed data with a suitable replacement marker,
	76	such as ``'?'`` or ``'\ufffd'``
	77	* ``'ignore'``: ignore malformed data and continue without further notice
	78	* ``'xmlcharrefreplace'``: replace with the appropriate XML character
	79	reference (for encoding only)
	80	* ``'backslashreplace'``: replace with backslashed escape sequences (for
	81	encoding only)
	82
	83	as well as any other error handling name defined via :func:`register_error`.
	84
	85	In case a search function cannot find a given encoding, it should return
	86	``None``.
	87
	88
	89	.. function:: lookup(encoding)
	90
	91	Looks up the codec info in the Python codec registry and returns a
	92	:class:`CodecInfo` object as defined above.
	93
	94	Encodings are first looked up in the registry's cache. If not found, the list of
	95	registered search functions is scanned. If no :class:`CodecInfo` object is
	96	found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
	97	is stored in the cache and returned to the caller.
	98
	99	To simplify access to the various codecs, the module provides these additional
	100	functions which use :func:`lookup` for the codec lookup:
	101
	102
	103	.. function:: getencoder(encoding)
	104
	105	Look up the codec for the given encoding and return its encoder function.
	106
	107	Raises a :exc:`LookupError` in case the encoding cannot be found.
	108
	109
	110	.. function:: getdecoder(encoding)
	111
	112	Look up the codec for the given encoding and return its decoder function.
	113
	114	Raises a :exc:`LookupError` in case the encoding cannot be found.
	115
	116
	117	.. function:: getincrementalencoder(encoding)
	118
	119	Look up the codec for the given encoding and return its incremental encoder
	120	class or factory function.
	121
	122	Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
	123	doesn't support an incremental encoder.
	124
	125	.. versionadded:: 2.5
	126
	127
	128	.. function:: getincrementaldecoder(encoding)
	129
	130	Look up the codec for the given encoding and return its incremental decoder
	131	class or factory function.
	132
	133	Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
	134	doesn't support an incremental decoder.
	135
	136	.. versionadded:: 2.5
	137
	138
	139	.. function:: getreader(encoding)
	140
	141	Look up the codec for the given encoding and return its StreamReader class or
	142	factory function.
	143
	144	Raises a :exc:`LookupError` in case the encoding cannot be found.
	145
	146
	147	.. function:: getwriter(encoding)
	148
	149	Look up the codec for the given encoding and return its StreamWriter class or
	150	factory function.
	151
	152	Raises a :exc:`LookupError` in case the encoding cannot be found.
	153
	154
	155	.. function:: register_error(name, error_handler)
	156
	157	Register the error handling function error_handler under the name name.
	158	error_handler will be called during encoding and decoding in case of an error,
	159	when name is specified as the errors parameter.
	160
	161	For encoding error_handler will be called with a :exc:`UnicodeEncodeError`
	162	instance, which contains information about the location of the error. The error
	163	handler must either raise this or a different exception or return a tuple with a
	164	replacement for the unencodable part of the input and a position where encoding
	165	should continue. The encoder will encode the replacement and continue encoding
	166	the original input at the specified position. Negative position values will be
	167	treated as being relative to the end of the input string. If the resulting
	168	position is out of bound an :exc:`IndexError` will be raised.
	169
	170	Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
	171	:exc:`UnicodeTranslateError` will be passed to the handler and that the
	172	replacement from the error handler will be put into the output directly.
	173
	174
	175	.. function:: lookup_error(name)
	176
	177	Return the error handler previously registered under the name name.
	178
	179	Raises a :exc:`LookupError` in case the handler cannot be found.
	180
	181
	182	.. function:: strict_errors(exception)
	183
	184	Implements the ``strict`` error handling: each encoding or decoding error
	185	raises a :exc:`UnicodeError`.
	186
	187
	188	.. function:: replace_errors(exception)
	189
	190	Implements the ``replace`` error handling: malformed data is replaced with a
	191	suitable replacement character such as ``'?'`` in bytestrings and
	192	``'\ufffd'`` in Unicode strings.
	193
	194
	195	.. function:: ignore_errors(exception)
	196
	197	Implements the ``ignore`` error handling: malformed data is ignored and
	198	encoding or decoding is continued without further notice.
	199
	200
	201	.. function:: xmlcharrefreplace_errors(exception)
	202
	203	Implements the ``xmlcharrefreplace`` error handling (for encoding only): the
	204	unencodable character is replaced by an appropriate XML character reference.
	205
	206
	207	.. function:: backslashreplace_errors(exception)
	208
	209	Implements the ``backslashreplace`` error handling (for encoding only): the
	210	unencodable character is replaced by a backslashed escape sequence.
	211
	212	To simplify working with encoded files or stream, the module also defines these
	213	utility functions:
	214
	215
	216	.. function:: open(filename, mode[, encoding[, errors[, buffering]]])
	217
	218	Open an encoded file using the given mode and return a wrapped version
	219	providing transparent encoding/decoding. The default file mode is ``'r'``
	220	meaning to open the file in read mode.
	221
	222	.. note::
	223
	224	The wrapped version will only accept the object format defined by the codecs,
	225	i.e. Unicode objects for most built-in codecs. Output is also codec-dependent
	226	and will usually be Unicode as well.
	227
	228	.. note::
	229
	230	Files are always opened in binary mode, even if no binary mode was
	231	specified. This is done to avoid data loss due to encodings using 8-bit
	232	values. This means that no automatic conversion of ``'\n'`` is done
	233	on reading and writing.
	234
	235	encoding specifies the encoding which is to be used for the file.
	236
	237	errors may be given to define the error handling. It defaults to ``'strict'``
	238	which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
	239
	240	buffering has the same meaning as for the built-in :func:`open` function. It
	241	defaults to line buffered.
	242
	243
	244	.. function:: EncodedFile(file, input[, output[, errors]])
	245
	246	Return a wrapped version of file which provides transparent encoding
	247	translation.
	248
	249	Strings written to the wrapped file are interpreted according to the given
	250	input encoding and then written to the original file as strings using the
	251	output encoding. The intermediate encoding will usually be Unicode but depends
	252	on the specified codecs.
	253
	254	If output is not given, it defaults to input.
	255
	256	errors may be given to define the error handling. It defaults to ``'strict'``,
	257	which causes :exc:`ValueError` to be raised in case an encoding error occurs.
	258
	259
	260	.. function:: iterencode(iterable, encoding[, errors])
	261
	262	Uses an incremental encoder to iteratively encode the input provided by
	263	iterable. This function is a :term:`generator`. errors (as well as any
	264	other keyword argument) is passed through to the incremental encoder.
	265
	266	.. versionadded:: 2.5
	267
	268
	269	.. function:: iterdecode(iterable, encoding[, errors])
	270
	271	Uses an incremental decoder to iteratively decode the input provided by
	272	iterable. This function is a :term:`generator`. errors (as well as any
	273	other keyword argument) is passed through to the incremental decoder.
	274
	275	.. versionadded:: 2.5
	276
	277	The module also provides the following constants which are useful for reading
	278	and writing to platform dependent files:
	279
	280
	281	.. data:: BOM
	282	BOM_BE
	283	BOM_LE
	284	BOM_UTF8
	285	BOM_UTF16
	286	BOM_UTF16_BE
	287	BOM_UTF16_LE
	288	BOM_UTF32
	289	BOM_UTF32_BE
	290	BOM_UTF32_LE
	291
	292	These constants define various encodings of the Unicode byte order mark (BOM)
	293	used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
	294	stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
	295	:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
	296	native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
	297	:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
	298	:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
	299	encodings.
	300
	301
	302	.. _codec-base-classes:
	303
	304	Codec Base Classes
	305	------------------
	306
	307	The :mod:`codecs` module defines a set of base classes which define the
	308	interface and can also be used to easily write your own codecs for use in
	309	Python.
	310
	311	Each codec has to define four interfaces to make it usable as codec in Python:
	312	stateless encoder, stateless decoder, stream reader and stream writer. The
	313	stream reader and writers typically reuse the stateless encoder/decoder to
	314	implement the file protocols.
	315
	316	The :class:`Codec` class defines the interface for stateless encoders/decoders.
	317
[391]	318	To simplify and standardize error handling, the :meth:`~Codec.encode` and
	319	:meth:`~Codec.decode` methods may implement different error handling schemes by
[2]	320	providing the errors string argument. The following string values are defined
	321	and implemented by all standard Python codecs:
	322
[391]	323	.. tabularcolumns:: \|l\|L\|
	324
[2]	325	+-------------------------+-----------------------------------------------+
	326	\| Value \| Meaning \|
	327	+=========================+===============================================+
	328	\| ``'strict'`` \| Raise :exc:`UnicodeError` (or a subclass); \|
	329	\| \| this is the default. \|
	330	+-------------------------+-----------------------------------------------+
	331	\| ``'ignore'`` \| Ignore the character and continue with the \|
	332	\| \| next. \|
	333	+-------------------------+-----------------------------------------------+
	334	\| ``'replace'`` \| Replace with a suitable replacement \|
	335	\| \| character; Python will use the official \|
	336	\| \| U+FFFD REPLACEMENT CHARACTER for the built-in \|
	337	\| \| Unicode codecs on decoding and '?' on \|
	338	\| \| encoding. \|
	339	+-------------------------+-----------------------------------------------+
	340	\| ``'xmlcharrefreplace'`` \| Replace with the appropriate XML character \|
	341	\| \| reference (only for encoding). \|
	342	+-------------------------+-----------------------------------------------+
	343	\| ``'backslashreplace'`` \| Replace with backslashed escape sequences \|
	344	\| \| (only for encoding). \|
	345	+-------------------------+-----------------------------------------------+
	346
	347	The set of allowed values can be extended via :meth:`register_error`.
	348
	349
	350	.. _codec-objects:
	351
	352	Codec Objects
	353	^^^^^^^^^^^^^
	354
	355	The :class:`Codec` class defines these methods which also define the function
	356	interfaces of the stateless encoder and decoder:
	357
	358
	359	.. method:: Codec.encode(input[, errors])
	360
	361	Encodes the object input and returns a tuple (output object, length consumed).
	362	While codecs are not restricted to use with Unicode, in a Unicode context,
	363	encoding converts a Unicode object to a plain string using a particular
	364	character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
	365
	366	errors defines the error handling to apply. It defaults to ``'strict'``
	367	handling.
	368
	369	The method may not store state in the :class:`Codec` instance. Use
	370	:class:`StreamCodec` for codecs which have to keep state in order to make
	371	encoding/decoding efficient.
	372
	373	The encoder must be able to handle zero length input and return an empty object
	374	of the output object type in this situation.
	375
	376
	377	.. method:: Codec.decode(input[, errors])
	378
	379	Decodes the object input and returns a tuple (output object, length consumed).
	380	In a Unicode context, decoding converts a plain string encoded using a
	381	particular character set encoding to a Unicode object.
	382
	383	input must be an object which provides the ``bf_getreadbuf`` buffer slot.
	384	Python strings, buffer objects and memory mapped files are examples of objects
	385	providing this slot.
	386
	387	errors defines the error handling to apply. It defaults to ``'strict'``
	388	handling.
	389
	390	The method may not store state in the :class:`Codec` instance. Use
	391	:class:`StreamCodec` for codecs which have to keep state in order to make
	392	encoding/decoding efficient.
	393
	394	The decoder must be able to handle zero length input and return an empty object
	395	of the output object type in this situation.
	396
	397	The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
	398	the basic interface for incremental encoding and decoding. Encoding/decoding the
	399	input isn't done with one call to the stateless encoder/decoder function, but
[391]	400	with multiple calls to the
	401	:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
	402	the incremental encoder/decoder. The incremental encoder/decoder keeps track of
	403	the encoding/decoding process during method calls.
[2]	404
[391]	405	The joined output of calls to the
	406	:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
	407	the same as if all the single inputs were joined into one, and this input was
[2]	408	encoded/decoded with the stateless encoder/decoder.
	409
	410
	411	.. _incremental-encoder-objects:
	412
	413	IncrementalEncoder Objects
	414	^^^^^^^^^^^^^^^^^^^^^^^^^^
	415
	416	.. versionadded:: 2.5
	417
	418	The :class:`IncrementalEncoder` class is used for encoding an input in multiple
	419	steps. It defines the following methods which every incremental encoder must
	420	define in order to be compatible with the Python codec registry.
	421
	422
	423	.. class:: IncrementalEncoder([errors])
	424
	425	Constructor for an :class:`IncrementalEncoder` instance.
	426
	427	All incremental encoders must provide this constructor interface. They are free
	428	to add additional keyword arguments, but only the ones defined here are used by
	429	the Python codec registry.
	430
	431	The :class:`IncrementalEncoder` may implement different error handling schemes
	432	by providing the errors keyword argument. These parameters are predefined:
	433
	434	* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
	435
	436	* ``'ignore'`` Ignore the character and continue with the next.
	437
	438	* ``'replace'`` Replace with a suitable replacement character
	439
	440	* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
	441
	442	* ``'backslashreplace'`` Replace with backslashed escape sequences.
	443
	444	The errors argument will be assigned to an attribute of the same name.
	445	Assigning to this attribute makes it possible to switch between different error
	446	handling strategies during the lifetime of the :class:`IncrementalEncoder`
	447	object.
	448
	449	The set of allowed values for the errors argument can be extended with
	450	:func:`register_error`.
	451
	452
	453	.. method:: encode(object[, final])
	454
	455	Encodes object (taking the current state of the encoder into account)
	456	and returns the resulting encoded object. If this is the last call to
	457	:meth:`encode` final must be true (the default is false).
	458
	459
	460	.. method:: reset()
	461
	462	Reset the encoder to the initial state.
	463
	464
	465	.. _incremental-decoder-objects:
	466
	467	IncrementalDecoder Objects
	468	^^^^^^^^^^^^^^^^^^^^^^^^^^
	469
	470	The :class:`IncrementalDecoder` class is used for decoding an input in multiple
	471	steps. It defines the following methods which every incremental decoder must
	472	define in order to be compatible with the Python codec registry.
	473
	474
	475	.. class:: IncrementalDecoder([errors])
	476
	477	Constructor for an :class:`IncrementalDecoder` instance.
	478
	479	All incremental decoders must provide this constructor interface. They are free
	480	to add additional keyword arguments, but only the ones defined here are used by
	481	the Python codec registry.
	482
	483	The :class:`IncrementalDecoder` may implement different error handling schemes
	484	by providing the errors keyword argument. These parameters are predefined:
	485
	486	* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
	487
	488	* ``'ignore'`` Ignore the character and continue with the next.
	489
	490	* ``'replace'`` Replace with a suitable replacement character.
	491
	492	The errors argument will be assigned to an attribute of the same name.
	493	Assigning to this attribute makes it possible to switch between different error
	494	handling strategies during the lifetime of the :class:`IncrementalDecoder`
	495	object.
	496
	497	The set of allowed values for the errors argument can be extended with
	498	:func:`register_error`.
	499
	500
	501	.. method:: decode(object[, final])
	502
	503	Decodes object (taking the current state of the decoder into account)
	504	and returns the resulting decoded object. If this is the last call to
	505	:meth:`decode` final must be true (the default is false). If final is
	506	true the decoder must decode the input completely and must flush all
	507	buffers. If this isn't possible (e.g. because of incomplete byte sequences
	508	at the end of the input) it must initiate error handling just like in the
	509	stateless case (which might raise an exception).
	510
	511
	512	.. method:: reset()
	513
	514	Reset the decoder to the initial state.
	515
	516
	517	The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
	518	working interfaces which can be used to implement new encoding submodules very
	519	easily. See :mod:`encodings.utf_8` for an example of how this is done.
	520
	521
	522	.. _stream-writer-objects:
	523
	524	StreamWriter Objects
	525	^^^^^^^^^^^^^^^^^^^^
	526
	527	The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
	528	following methods which every stream writer must define in order to be
	529	compatible with the Python codec registry.
	530
	531
	532	.. class:: StreamWriter(stream[, errors])
	533
	534	Constructor for a :class:`StreamWriter` instance.
	535
	536	All stream writers must provide this constructor interface. They are free to add
	537	additional keyword arguments, but only the ones defined here are used by the
	538	Python codec registry.
	539
	540	stream must be a file-like object open for writing binary data.
	541
	542	The :class:`StreamWriter` may implement different error handling schemes by
	543	providing the errors keyword argument. These parameters are predefined:
	544
	545	* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
	546
	547	* ``'ignore'`` Ignore the character and continue with the next.
	548
	549	* ``'replace'`` Replace with a suitable replacement character
	550
	551	* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
	552
	553	* ``'backslashreplace'`` Replace with backslashed escape sequences.
	554
	555	The errors argument will be assigned to an attribute of the same name.
	556	Assigning to this attribute makes it possible to switch between different error
	557	handling strategies during the lifetime of the :class:`StreamWriter` object.
	558
	559	The set of allowed values for the errors argument can be extended with
	560	:func:`register_error`.
	561
	562
	563	.. method:: write(object)
	564
	565	Writes the object's contents encoded to the stream.
	566
	567
	568	.. method:: writelines(list)
	569
	570	Writes the concatenated list of strings to the stream (possibly by reusing
	571	the :meth:`write` method).
	572
	573
	574	.. method:: reset()
	575
	576	Flushes and resets the codec buffers used for keeping state.
	577
	578	Calling this method should ensure that the data on the output is put into
	579	a clean state that allows appending of new fresh data without having to
	580	rescan the whole stream to recover state.
	581
	582
	583	In addition to the above methods, the :class:`StreamWriter` must also inherit
	584	all other methods and attributes from the underlying stream.
	585
	586
	587	.. _stream-reader-objects:
	588
	589	StreamReader Objects
	590	^^^^^^^^^^^^^^^^^^^^
	591
	592	The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
	593	following methods which every stream reader must define in order to be
	594	compatible with the Python codec registry.
	595
	596
	597	.. class:: StreamReader(stream[, errors])
	598
	599	Constructor for a :class:`StreamReader` instance.
	600
	601	All stream readers must provide this constructor interface. They are free to add
	602	additional keyword arguments, but only the ones defined here are used by the
	603	Python codec registry.
	604
	605	stream must be a file-like object open for reading (binary) data.
	606
	607	The :class:`StreamReader` may implement different error handling schemes by
	608	providing the errors keyword argument. These parameters are defined:
	609
	610	* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
	611
	612	* ``'ignore'`` Ignore the character and continue with the next.
	613
	614	* ``'replace'`` Replace with a suitable replacement character.
	615
	616	The errors argument will be assigned to an attribute of the same name.
	617	Assigning to this attribute makes it possible to switch between different error
	618	handling strategies during the lifetime of the :class:`StreamReader` object.
	619
	620	The set of allowed values for the errors argument can be extended with
	621	:func:`register_error`.
	622
	623
	624	.. method:: read([size[, chars, [firstline]]])
	625
	626	Decodes data from the stream and returns the resulting object.
	627
	628	chars indicates the number of characters to read from the
	629	stream. :func:`read` will never return more than chars characters, but
	630	it might return less, if there are not enough characters available.
	631
	632	size indicates the approximate maximum number of bytes to read from the
	633	stream for decoding purposes. The decoder can modify this setting as
	634	appropriate. The default value -1 indicates to read and decode as much as
	635	possible. size is intended to prevent having to decode huge files in
	636	one step.
	637
	638	firstline indicates that it would be sufficient to only return the first
	639	line, if there are decoding errors on later lines.
	640
	641	The method should use a greedy read strategy meaning that it should read
	642	as much data as is allowed within the definition of the encoding and the
	643	given size, e.g. if optional encoding endings or state markers are
	644	available on the stream, these should be read too.
	645
	646	.. versionchanged:: 2.4
	647	chars argument added.
	648
	649	.. versionchanged:: 2.4.2
	650	firstline argument added.
	651
	652
	653	.. method:: readline([size[, keepends]])
	654
	655	Read one line from the input stream and return the decoded data.
	656
	657	size, if given, is passed as size argument to the stream's
[391]	658	:meth:`read` method.
[2]	659
	660	If keepends is false line-endings will be stripped from the lines
	661	returned.
	662
	663	.. versionchanged:: 2.4
	664	keepends argument added.
	665
	666
	667	.. method:: readlines([sizehint[, keepends]])
	668
	669	Read all lines available on the input stream and return them as a list of
	670	lines.
	671
	672	Line-endings are implemented using the codec's decoder method and are
	673	included in the list entries if keepends is true.
	674
	675	sizehint, if given, is passed as the size argument to the stream's
	676	:meth:`read` method.
	677
	678
	679	.. method:: reset()
	680
	681	Resets the codec buffers used for keeping state.
	682
	683	Note that no stream repositioning should take place. This method is
	684	primarily intended to be able to recover from decoding errors.
	685
	686
	687	In addition to the above methods, the :class:`StreamReader` must also inherit
	688	all other methods and attributes from the underlying stream.
	689
	690	The next two base classes are included for convenience. They are not needed by
	691	the codec registry, but may provide useful in practice.
	692
	693
	694	.. _stream-reader-writer:
	695
	696	StreamReaderWriter Objects
	697	^^^^^^^^^^^^^^^^^^^^^^^^^^
	698
	699	The :class:`StreamReaderWriter` allows wrapping streams which work in both read
	700	and write modes.
	701
	702	The design is such that one can use the factory functions returned by the
	703	:func:`lookup` function to construct the instance.
	704
	705
	706	.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
	707
	708	Creates a :class:`StreamReaderWriter` instance. stream must be a file-like
	709	object. Reader and Writer must be factory functions or classes providing the
	710	:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
	711	is done in the same way as defined for the stream readers and writers.
	712
	713	:class:`StreamReaderWriter` instances define the combined interfaces of
	714	:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
	715	methods and attributes from the underlying stream.
	716
	717
	718	.. _stream-recoder-objects:
	719
	720	StreamRecoder Objects
	721	^^^^^^^^^^^^^^^^^^^^^
	722
	723	The :class:`StreamRecoder` provide a frontend - backend view of encoding data
	724	which is sometimes useful when dealing with different encoding environments.
	725
	726	The design is such that one can use the factory functions returned by the
	727	:func:`lookup` function to construct the instance.
	728
	729
	730	.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
	731
	732	Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
	733	encode and decode work on the frontend (the input to :meth:`read` and output
	734	of :meth:`write`) while Reader and Writer work on the backend (reading and
	735	writing to the stream).
	736
	737	You can use these objects to do transparent direct recodings from e.g. Latin-1
	738	to UTF-8 and back.
	739
	740	stream must be a file-like object.
	741
	742	encode, decode must adhere to the :class:`Codec` interface. Reader,
	743	Writer must be factory functions or classes providing objects of the
	744	:class:`StreamReader` and :class:`StreamWriter` interface respectively.
	745
	746	encode and decode are needed for the frontend translation, Reader and
	747	Writer for the backend translation. The intermediate format used is
	748	determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
	749	as the intermediate encoding.
	750
	751	Error handling is done in the same way as defined for the stream readers and
	752	writers.
	753
	754
	755	:class:`StreamRecoder` instances define the combined interfaces of
	756	:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
	757	methods and attributes from the underlying stream.
	758
	759
	760	.. _encodings-overview:
	761
	762	Encodings and Unicode
	763	---------------------
	764
	765	Unicode strings are stored internally as sequences of codepoints (to be precise
[391]	766	as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
	767	via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the
	768	former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data
[2]	769	type. Once a Unicode object is used outside of CPU and memory, CPU endianness
	770	and how these arrays are stored as bytes become an issue. Transforming a
	771	unicode object into a sequence of bytes is called encoding and recreating the
	772	unicode object from the sequence of bytes is known as decoding. There are many
	773	different methods for how this transformation can be done (these methods are
	774	also called encodings). The simplest method is to map the codepoints 0-255 to
	775	the bytes ``0x0``-``0xff``. This means that a unicode object that contains
	776	codepoints above ``U+00FF`` can't be encoded with this method (which is called
	777	``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a
	778	:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
	779	codec can't encode character u'\u1234' in position 3: ordinal not in
	780	range(256)``.
	781
	782	There's another group of encodings (the so called charmap encodings) that choose
	783	a different subset of all unicode code points and how these codepoints are
	784	mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
	785	e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
	786	Windows). There's a string constant with 256 characters that shows you which
	787	character is mapped to which byte value.
	788
[391]	789	All of these encodings can only encode 256 of the 1114112 codepoints
[2]	790	defined in unicode. A simple and straightforward way that can store each Unicode
[391]	791	code point, is to store each codepoint as four consecutive bytes. There are two
	792	possibilities: store the bytes in big endian or in little endian order. These
	793	two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
	794	disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
	795	will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
	796	problem: bytes will always be in natural endianness. When these bytes are read
[2]	797	by a CPU with a different endianness, then bytes have to be swapped though. To
[391]	798	be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
	799	there's the so called BOM ("Byte Order Mark"). This is the Unicode character
	800	``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
	801	byte sequence. The byte swapped version of this character (``0xFFFE``) is an
	802	illegal character that may not appear in a Unicode text. So when the
	803	first character in an ``UTF-16`` or ``UTF-32`` byte sequence
[2]	804	appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
[391]	805	Unfortunately the character ``U+FEFF`` had a second purpose as
	806	a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
[2]	807	a word to be split. It can e.g. be used to give hints to a ligature algorithm.
	808	With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
	809	deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
[391]	810	Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
[2]	811	it's a device to determine the storage layout of the encoded bytes, and vanishes
	812	once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
	813	NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
	814
	815	There's another encoding that is able to encoding the full range of Unicode
	816	characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
	817	with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
[391]	818	parts: marker bits (the most significant bits) and payload bits. The marker bits
	819	are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
[2]	820	encoded like this (with x being payload bits, which when concatenated give the
	821	Unicode character):
	822
	823	+-----------------------------------+----------------------------------------------+
	824	\| Range \| Encoding \|
	825	+===================================+==============================================+
	826	\| ``U-00000000`` ... ``U-0000007F`` \| 0xxxxxxx \|
	827	+-----------------------------------+----------------------------------------------+
	828	\| ``U-00000080`` ... ``U-000007FF`` \| 110xxxxx 10xxxxxx \|
	829	+-----------------------------------+----------------------------------------------+
	830	\| ``U-00000800`` ... ``U-0000FFFF`` \| 1110xxxx 10xxxxxx 10xxxxxx \|
	831	+-----------------------------------+----------------------------------------------+
[391]	832	\| ``U-00010000`` ... ``U-0010FFFF`` \| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx \|
[2]	833	+-----------------------------------+----------------------------------------------+
	834
	835	The least significant bit of the Unicode character is the rightmost x bit.
	836
	837	As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
	838	the decoded Unicode string (even if it's the first character) is treated as a
	839	``ZERO WIDTH NO-BREAK SPACE``.
	840
	841	Without external information it's impossible to reliably determine which
	842	encoding was used for encoding a Unicode string. Each charmap encoding can
	843	decode any random byte sequence. However that's not possible with UTF-8, as
	844	UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
	845	sequences. To increase the reliability with which a UTF-8 encoding can be
	846	detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
	847	``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
	848	is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
	849	sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
	850	that any charmap encoded file starts with these byte values (which would e.g.
	851	map to
	852
	853	\| LATIN SMALL LETTER I WITH DIAERESIS
	854	\| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
	855	\| INVERTED QUESTION MARK
	856
[391]	857	in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
[2]	858	correctly guessed from the byte sequence. So here the BOM is not used to be able
	859	to determine the byte order used for generating the byte sequence, but as a
	860	signature that helps in guessing the encoding. On encoding the utf-8-sig codec
	861	will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
[391]	862	decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
	863	three bytes in the file. In UTF-8, the use of the BOM is discouraged and
	864	should generally be avoided.
[2]	865
	866
	867	.. _standard-encodings:
	868
	869	Standard Encodings
	870	------------------
	871
	872	Python comes with a number of codecs built-in, either implemented as C functions
	873	or with dictionaries as mapping tables. The following table lists the codecs by
	874	name, together with a few common aliases, and the languages for which the
	875	encoding is likely used. Neither the list of aliases nor the list of languages
	876	is meant to be exhaustive. Notice that spelling alternatives that only differ in
	877	case or use a hyphen instead of an underscore are also valid aliases; therefore,
	878	e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
	879
	880	Many of the character sets support the same languages. They vary in individual
	881	characters (e.g. whether the EURO SIGN is supported or not), and in the
	882	assignment of characters to code positions. For the European languages in
	883	particular, the following variants typically exist:
	884
	885	* an ISO 8859 codeset
	886
	887	* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
	888	but replaces control characters with additional graphic characters
	889
	890	* an IBM EBCDIC code page
	891
	892	* an IBM PC code page, which is ASCII compatible
	893
[391]	894	.. tabularcolumns:: \|l\|p{0.3\linewidth}\|p{0.3\linewidth}\|
	895
[2]	896	+-----------------+--------------------------------+--------------------------------+
	897	\| Codec \| Aliases \| Languages \|
	898	+=================+================================+================================+
	899	\| ascii \| 646, us-ascii \| English \|
	900	+-----------------+--------------------------------+--------------------------------+
	901	\| big5 \| big5-tw, csbig5 \| Traditional Chinese \|
	902	+-----------------+--------------------------------+--------------------------------+
	903	\| big5hkscs \| big5-hkscs, hkscs \| Traditional Chinese \|
	904	+-----------------+--------------------------------+--------------------------------+
	905	\| cp037 \| IBM037, IBM039 \| English \|
	906	+-----------------+--------------------------------+--------------------------------+
	907	\| cp424 \| EBCDIC-CP-HE, IBM424 \| Hebrew \|
	908	+-----------------+--------------------------------+--------------------------------+
	909	\| cp437 \| 437, IBM437 \| English \|
	910	+-----------------+--------------------------------+--------------------------------+
	911	\| cp500 \| EBCDIC-CP-BE, EBCDIC-CP-CH, \| Western Europe \|
	912	\| \| IBM500 \| \|
	913	+-----------------+--------------------------------+--------------------------------+
[391]	914	\| cp720 \| \| Arabic \|
	915	+-----------------+--------------------------------+--------------------------------+
[2]	916	\| cp737 \| \| Greek \|
	917	+-----------------+--------------------------------+--------------------------------+
	918	\| cp775 \| IBM775 \| Baltic languages \|
	919	+-----------------+--------------------------------+--------------------------------+
	920	\| cp850 \| 850, IBM850 \| Western Europe \|
	921	+-----------------+--------------------------------+--------------------------------+
	922	\| cp852 \| 852, IBM852 \| Central and Eastern Europe \|
	923	+-----------------+--------------------------------+--------------------------------+
	924	\| cp855 \| 855, IBM855 \| Bulgarian, Byelorussian, \|
	925	\| \| \| Macedonian, Russian, Serbian \|
	926	+-----------------+--------------------------------+--------------------------------+
	927	\| cp856 \| \| Hebrew \|
	928	+-----------------+--------------------------------+--------------------------------+
	929	\| cp857 \| 857, IBM857 \| Turkish \|
	930	+-----------------+--------------------------------+--------------------------------+
[391]	931	\| cp858 \| 858, IBM858 \| Western Europe \|
	932	+-----------------+--------------------------------+--------------------------------+
[2]	933	\| cp860 \| 860, IBM860 \| Portuguese \|
	934	+-----------------+--------------------------------+--------------------------------+
	935	\| cp861 \| 861, CP-IS, IBM861 \| Icelandic \|
	936	+-----------------+--------------------------------+--------------------------------+
	937	\| cp862 \| 862, IBM862 \| Hebrew \|
	938	+-----------------+--------------------------------+--------------------------------+
	939	\| cp863 \| 863, IBM863 \| Canadian \|
	940	+-----------------+--------------------------------+--------------------------------+
	941	\| cp864 \| IBM864 \| Arabic \|
	942	+-----------------+--------------------------------+--------------------------------+
	943	\| cp865 \| 865, IBM865 \| Danish, Norwegian \|
	944	+-----------------+--------------------------------+--------------------------------+
	945	\| cp866 \| 866, IBM866 \| Russian \|
	946	+-----------------+--------------------------------+--------------------------------+
	947	\| cp869 \| 869, CP-GR, IBM869 \| Greek \|
	948	+-----------------+--------------------------------+--------------------------------+
	949	\| cp874 \| \| Thai \|
	950	+-----------------+--------------------------------+--------------------------------+
	951	\| cp875 \| \| Greek \|
	952	+-----------------+--------------------------------+--------------------------------+
	953	\| cp932 \| 932, ms932, mskanji, ms-kanji \| Japanese \|
	954	+-----------------+--------------------------------+--------------------------------+
	955	\| cp949 \| 949, ms949, uhc \| Korean \|
	956	+-----------------+--------------------------------+--------------------------------+
	957	\| cp950 \| 950, ms950 \| Traditional Chinese \|
	958	+-----------------+--------------------------------+--------------------------------+
	959	\| cp1006 \| \| Urdu \|
	960	+-----------------+--------------------------------+--------------------------------+
	961	\| cp1026 \| ibm1026 \| Turkish \|
	962	+-----------------+--------------------------------+--------------------------------+
	963	\| cp1140 \| ibm1140 \| Western Europe \|
	964	+-----------------+--------------------------------+--------------------------------+
	965	\| cp1250 \| windows-1250 \| Central and Eastern Europe \|
	966	+-----------------+--------------------------------+--------------------------------+
	967	\| cp1251 \| windows-1251 \| Bulgarian, Byelorussian, \|
	968	\| \| \| Macedonian, Russian, Serbian \|
	969	+-----------------+--------------------------------+--------------------------------+
	970	\| cp1252 \| windows-1252 \| Western Europe \|
	971	+-----------------+--------------------------------+--------------------------------+
	972	\| cp1253 \| windows-1253 \| Greek \|
	973	+-----------------+--------------------------------+--------------------------------+
	974	\| cp1254 \| windows-1254 \| Turkish \|
	975	+-----------------+--------------------------------+--------------------------------+
	976	\| cp1255 \| windows-1255 \| Hebrew \|
	977	+-----------------+--------------------------------+--------------------------------+
	978	\| cp1256 \| windows-1256 \| Arabic \|
	979	+-----------------+--------------------------------+--------------------------------+
	980	\| cp1257 \| windows-1257 \| Baltic languages \|
	981	+-----------------+--------------------------------+--------------------------------+
	982	\| cp1258 \| windows-1258 \| Vietnamese \|
	983	+-----------------+--------------------------------+--------------------------------+
	984	\| euc_jp \| eucjp, ujis, u-jis \| Japanese \|
	985	+-----------------+--------------------------------+--------------------------------+
	986	\| euc_jis_2004 \| jisx0213, eucjis2004 \| Japanese \|
	987	+-----------------+--------------------------------+--------------------------------+
	988	\| euc_jisx0213 \| eucjisx0213 \| Japanese \|
	989	+-----------------+--------------------------------+--------------------------------+
	990	\| euc_kr \| euckr, korean, ksc5601, \| Korean \|
	991	\| \| ks_c-5601, ks_c-5601-1987, \| \|
	992	\| \| ksx1001, ks_x-1001 \| \|
	993	+-----------------+--------------------------------+--------------------------------+
	994	\| gb2312 \| chinese, csiso58gb231280, euc- \| Simplified Chinese \|
	995	\| \| cn, euccn, eucgb2312-cn, \| \|
	996	\| \| gb2312-1980, gb2312-80, iso- \| \|
	997	\| \| ir-58 \| \|
	998	+-----------------+--------------------------------+--------------------------------+
	999	\| gbk \| 936, cp936, ms936 \| Unified Chinese \|
	1000	+-----------------+--------------------------------+--------------------------------+
	1001	\| gb18030 \| gb18030-2000 \| Unified Chinese \|
	1002	+-----------------+--------------------------------+--------------------------------+
	1003	\| hz \| hzgb, hz-gb, hz-gb-2312 \| Simplified Chinese \|
	1004	+-----------------+--------------------------------+--------------------------------+
	1005	\| iso2022_jp \| csiso2022jp, iso2022jp, \| Japanese \|
	1006	\| \| iso-2022-jp \| \|
	1007	+-----------------+--------------------------------+--------------------------------+
	1008	\| iso2022_jp_1 \| iso2022jp-1, iso-2022-jp-1 \| Japanese \|
	1009	+-----------------+--------------------------------+--------------------------------+
	1010	\| iso2022_jp_2 \| iso2022jp-2, iso-2022-jp-2 \| Japanese, Korean, Simplified \|
	1011	\| \| \| Chinese, Western Europe, Greek \|
	1012	+-----------------+--------------------------------+--------------------------------+
	1013	\| iso2022_jp_2004 \| iso2022jp-2004, \| Japanese \|
	1014	\| \| iso-2022-jp-2004 \| \|
	1015	+-----------------+--------------------------------+--------------------------------+
	1016	\| iso2022_jp_3 \| iso2022jp-3, iso-2022-jp-3 \| Japanese \|
	1017	+-----------------+--------------------------------+--------------------------------+
	1018	\| iso2022_jp_ext \| iso2022jp-ext, iso-2022-jp-ext \| Japanese \|
	1019	+-----------------+--------------------------------+--------------------------------+
	1020	\| iso2022_kr \| csiso2022kr, iso2022kr, \| Korean \|
	1021	\| \| iso-2022-kr \| \|
	1022	+-----------------+--------------------------------+--------------------------------+
	1023	\| latin_1 \| iso-8859-1, iso8859-1, 8859, \| West Europe \|
	1024	\| \| cp819, latin, latin1, L1 \| \|
	1025	+-----------------+--------------------------------+--------------------------------+
	1026	\| iso8859_2 \| iso-8859-2, latin2, L2 \| Central and Eastern Europe \|
	1027	+-----------------+--------------------------------+--------------------------------+
	1028	\| iso8859_3 \| iso-8859-3, latin3, L3 \| Esperanto, Maltese \|
	1029	+-----------------+--------------------------------+--------------------------------+
	1030	\| iso8859_4 \| iso-8859-4, latin4, L4 \| Baltic languages \|
	1031	+-----------------+--------------------------------+--------------------------------+
	1032	\| iso8859_5 \| iso-8859-5, cyrillic \| Bulgarian, Byelorussian, \|
	1033	\| \| \| Macedonian, Russian, Serbian \|
	1034	+-----------------+--------------------------------+--------------------------------+
	1035	\| iso8859_6 \| iso-8859-6, arabic \| Arabic \|
	1036	+-----------------+--------------------------------+--------------------------------+
	1037	\| iso8859_7 \| iso-8859-7, greek, greek8 \| Greek \|
	1038	+-----------------+--------------------------------+--------------------------------+
	1039	\| iso8859_8 \| iso-8859-8, hebrew \| Hebrew \|
	1040	+-----------------+--------------------------------+--------------------------------+
	1041	\| iso8859_9 \| iso-8859-9, latin5, L5 \| Turkish \|
	1042	+-----------------+--------------------------------+--------------------------------+
	1043	\| iso8859_10 \| iso-8859-10, latin6, L6 \| Nordic languages \|
	1044	+-----------------+--------------------------------+--------------------------------+
[391]	1045	\| iso8859_13 \| iso-8859-13, latin7, L7 \| Baltic languages \|
[2]	1046	+-----------------+--------------------------------+--------------------------------+
	1047	\| iso8859_14 \| iso-8859-14, latin8, L8 \| Celtic languages \|
	1048	+-----------------+--------------------------------+--------------------------------+
[391]	1049	\| iso8859_15 \| iso-8859-15, latin9, L9 \| Western Europe \|
[2]	1050	+-----------------+--------------------------------+--------------------------------+
[391]	1051	\| iso8859_16 \| iso-8859-16, latin10, L10 \| South-Eastern Europe \|
	1052	+-----------------+--------------------------------+--------------------------------+
[2]	1053	\| johab \| cp1361, ms1361 \| Korean \|
	1054	+-----------------+--------------------------------+--------------------------------+
	1055	\| koi8_r \| \| Russian \|
	1056	+-----------------+--------------------------------+--------------------------------+
	1057	\| koi8_u \| \| Ukrainian \|
	1058	+-----------------+--------------------------------+--------------------------------+
	1059	\| mac_cyrillic \| maccyrillic \| Bulgarian, Byelorussian, \|
	1060	\| \| \| Macedonian, Russian, Serbian \|
	1061	+-----------------+--------------------------------+--------------------------------+
	1062	\| mac_greek \| macgreek \| Greek \|
	1063	+-----------------+--------------------------------+--------------------------------+
	1064	\| mac_iceland \| maciceland \| Icelandic \|
	1065	+-----------------+--------------------------------+--------------------------------+
	1066	\| mac_latin2 \| maclatin2, maccentraleurope \| Central and Eastern Europe \|
	1067	+-----------------+--------------------------------+--------------------------------+
	1068	\| mac_roman \| macroman \| Western Europe \|
	1069	+-----------------+--------------------------------+--------------------------------+
	1070	\| mac_turkish \| macturkish \| Turkish \|
	1071	+-----------------+--------------------------------+--------------------------------+
	1072	\| ptcp154 \| csptcp154, pt154, cp154, \| Kazakh \|
	1073	\| \| cyrillic-asian \| \|
	1074	+-----------------+--------------------------------+--------------------------------+
	1075	\| shift_jis \| csshiftjis, shiftjis, sjis, \| Japanese \|
	1076	\| \| s_jis \| \|
	1077	+-----------------+--------------------------------+--------------------------------+
	1078	\| shift_jis_2004 \| shiftjis2004, sjis_2004, \| Japanese \|
	1079	\| \| sjis2004 \| \|
	1080	+-----------------+--------------------------------+--------------------------------+
	1081	\| shift_jisx0213 \| shiftjisx0213, sjisx0213, \| Japanese \|
	1082	\| \| s_jisx0213 \| \|
	1083	+-----------------+--------------------------------+--------------------------------+
	1084	\| utf_32 \| U32, utf32 \| all languages \|
	1085	+-----------------+--------------------------------+--------------------------------+
	1086	\| utf_32_be \| UTF-32BE \| all languages \|
	1087	+-----------------+--------------------------------+--------------------------------+
	1088	\| utf_32_le \| UTF-32LE \| all languages \|
	1089	+-----------------+--------------------------------+--------------------------------+
	1090	\| utf_16 \| U16, utf16 \| all languages \|
	1091	+-----------------+--------------------------------+--------------------------------+
	1092	\| utf_16_be \| UTF-16BE \| all languages (BMP only) \|
	1093	+-----------------+--------------------------------+--------------------------------+
	1094	\| utf_16_le \| UTF-16LE \| all languages (BMP only) \|
	1095	+-----------------+--------------------------------+--------------------------------+
	1096	\| utf_7 \| U7, unicode-1-1-utf-7 \| all languages \|
	1097	+-----------------+--------------------------------+--------------------------------+
	1098	\| utf_8 \| U8, UTF, utf8 \| all languages \|
	1099	+-----------------+--------------------------------+--------------------------------+
	1100	\| utf_8_sig \| \| all languages \|
	1101	+-----------------+--------------------------------+--------------------------------+
	1102
[391]	1103	Python Specific Encodings
	1104	-------------------------
[2]	1105
[391]	1106	A number of predefined codecs are specific to Python, so their codec names have
	1107	no meaning outside Python. These are listed in the tables below based on the
	1108	expected input and output types (note that while text encodings are the most
	1109	common use case for codecs, the underlying codec infrastructure supports
	1110	arbitrary data transforms rather than just text encodings). For asymmetric
	1111	codecs, the stated purpose describes the encoding direction.
[2]	1112
[391]	1113	The following codecs provide unicode-to-str encoding [#encoding-note]_ and
	1114	str-to-unicode decoding [#decoding-note]_, similar to the Unicode text
	1115	encodings.
[2]	1116
[391]	1117	.. tabularcolumns:: \|l\|L\|L\|
	1118
	1119	+--------------------+---------------------------+---------------------------+
	1120	\| Codec \| Aliases \| Purpose \|
	1121	+====================+===========================+===========================+
	1122	\| idna \| \| Implements :rfc:`3490`, \|
	1123	\| \| \| see also \|
	1124	\| \| \| :mod:`encodings.idna` \|
	1125	+--------------------+---------------------------+---------------------------+
	1126	\| mbcs \| dbcs \| Windows only: Encode \|
	1127	\| \| \| operand according to the \|
	1128	\| \| \| ANSI codepage (CP_ACP) \|
	1129	+--------------------+---------------------------+---------------------------+
	1130	\| palmos \| \| Encoding of PalmOS 3.5 \|
	1131	+--------------------+---------------------------+---------------------------+
	1132	\| punycode \| \| Implements :rfc:`3492` \|
	1133	+--------------------+---------------------------+---------------------------+
	1134	\| raw_unicode_escape \| \| Produce a string that is \|
	1135	\| \| \| suitable as raw Unicode \|
	1136	\| \| \| literal in Python source \|
	1137	\| \| \| code \|
	1138	+--------------------+---------------------------+---------------------------+
	1139	\| rot_13 \| rot13 \| Returns the Caesar-cypher \|
	1140	\| \| \| encryption of the operand \|
	1141	+--------------------+---------------------------+---------------------------+
	1142	\| undefined \| \| Raise an exception for \|
	1143	\| \| \| all conversions. Can be \|
	1144	\| \| \| used as the system \|
	1145	\| \| \| encoding if no automatic \|
	1146	\| \| \| :term:`coercion` between \|
	1147	\| \| \| byte and Unicode strings \|
	1148	\| \| \| is desired. \|
	1149	+--------------------+---------------------------+---------------------------+
	1150	\| unicode_escape \| \| Produce a string that is \|
	1151	\| \| \| suitable as Unicode \|
	1152	\| \| \| literal in Python source \|
	1153	\| \| \| code \|
	1154	+--------------------+---------------------------+---------------------------+
	1155	\| unicode_internal \| \| Return the internal \|
	1156	\| \| \| representation of the \|
	1157	\| \| \| operand \|
	1158	+--------------------+---------------------------+---------------------------+
	1159
[2]	1160	.. versionadded:: 2.3
	1161	The ``idna`` and ``punycode`` encodings.
	1162
[391]	1163	The following codecs provide str-to-str encoding and decoding
	1164	[#decoding-note]_.
[2]	1165
[391]	1166	.. tabularcolumns:: \|l\|L\|L\|L\|
	1167
	1168	+--------------------+---------------------------+---------------------------+------------------------------+
	1169	\| Codec \| Aliases \| Purpose \| Encoder/decoder \|
	1170	+====================+===========================+===========================+==============================+
	1171	\| base64_codec \| base64, base-64 \| Convert operand to MIME \| :meth:`base64.b64encode`, \|
	1172	\| \| \| base64 (the result always \| :meth:`base64.b64decode` \|
	1173	\| \| \| includes a trailing \| \|
	1174	\| \| \| ``'\n'``) \| \|
	1175	+--------------------+---------------------------+---------------------------+------------------------------+
	1176	\| bz2_codec \| bz2 \| Compress the operand \| :meth:`bz2.compress`, \|
	1177	\| \| \| using bz2 \| :meth:`bz2.decompress` \|
	1178	+--------------------+---------------------------+---------------------------+------------------------------+
	1179	\| hex_codec \| hex \| Convert operand to \| :meth:`base64.b16encode`, \|
	1180	\| \| \| hexadecimal \| :meth:`base64.b16decode` \|
	1181	\| \| \| representation, with two \| \|
	1182	\| \| \| digits per byte \| \|
	1183	+--------------------+---------------------------+---------------------------+------------------------------+
	1184	\| quopri_codec \| quopri, quoted-printable, \| Convert operand to MIME \| :meth:`quopri.encodestring`, \|
	1185	\| \| quotedprintable \| quoted printable \| :meth:`quopri.decodestring` \|
	1186	+--------------------+---------------------------+---------------------------+------------------------------+
	1187	\| string_escape \| \| Produce a string that is \| \|
	1188	\| \| \| suitable as string \| \|
	1189	\| \| \| literal in Python source \| \|
	1190	\| \| \| code \| \|
	1191	+--------------------+---------------------------+---------------------------+------------------------------+
	1192	\| uu_codec \| uu \| Convert the operand using \| :meth:`uu.encode`, \|
	1193	\| \| \| uuencode \| :meth:`uu.decode` \|
	1194	+--------------------+---------------------------+---------------------------+------------------------------+
	1195	\| zlib_codec \| zip, zlib \| Compress the operand \| :meth:`zlib.compress`, \|
	1196	\| \| \| using gzip \| :meth:`zlib.decompress` \|
	1197	+--------------------+---------------------------+---------------------------+------------------------------+
	1198
	1199	.. [#encoding-note] str objects are also accepted as input in place of unicode
	1200	objects. They are implicitly converted to unicode by decoding them using
	1201	the default encoding. If this conversion fails, it may lead to encoding
	1202	operations raising :exc:`UnicodeDecodeError`.
	1203
	1204	.. [#decoding-note] unicode objects are also accepted as input in place of str
	1205	objects. They are implicitly converted to str by encoding them using the
	1206	default encoding. If this conversion fails, it may lead to decoding
	1207	operations raising :exc:`UnicodeEncodeError`.
	1208
	1209
[2]	1210	:mod:`encodings.idna` --- Internationalized Domain Names in Applications
	1211	------------------------------------------------------------------------
	1212
	1213	.. module:: encodings.idna
	1214	:synopsis: Internationalized Domain Names implementation
	1215	.. moduleauthor:: Martin v. LÃ¶wis
	1216
	1217	.. versionadded:: 2.3
	1218
	1219	This module implements :rfc:`3490` (Internationalized Domain Names in
	1220	Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
	1221	Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
	1222	and :mod:`stringprep`.
	1223
	1224	These RFCs together define a protocol to support non-ASCII characters in domain
	1225	names. A domain name containing non-ASCII characters (such as
	1226	``www.AlliancefranÃ§aise.nu``) is converted into an ASCII-compatible encoding
	1227	(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
	1228	name is then used in all places where arbitrary characters are not allowed by
	1229	the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
	1230	on. This conversion is carried out in the application; if possible invisible to
	1231	the user: The application should transparently convert Unicode domain labels to
	1232	IDNA on the wire, and convert back ACE labels to Unicode before presenting them
	1233	to the user.
	1234
[391]	1235	Python supports this conversion in several ways: the ``idna`` codec performs
	1236	conversion between Unicode and ACE, separating an input string into labels
	1237	based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
	1238	and converting each label to ACE as required, and conversely separating an input
	1239	byte string into labels based on the ``.`` separator and converting any ACE
	1240	labels found into unicode. Furthermore, the :mod:`socket` module
[2]	1241	transparently converts Unicode host names to ACE, so that applications need not
	1242	be concerned about converting host names themselves when they pass them to the
	1243	socket module. On top of that, modules that have host names as function
	1244	parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names
	1245	(:mod:`httplib` then also transparently sends an IDNA hostname in the
	1246	:mailheader:`Host` field if it sends that field at all).
	1247
[391]	1248	.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
	1249
[2]	1250	When receiving host names from the wire (such as in reverse name lookup), no
	1251	automatic conversion to Unicode is performed: Applications wishing to present
	1252	such host names to the user should decode them to Unicode.
	1253
	1254	The module :mod:`encodings.idna` also implements the nameprep procedure, which
	1255	performs certain normalizations on host names, to achieve case-insensitivity of
	1256	international domain names, and to unify similar characters. The nameprep
	1257	functions can be used directly if desired.
	1258
	1259
	1260	.. function:: nameprep(label)
	1261
	1262	Return the nameprepped version of label. The implementation currently assumes
	1263	query strings, so ``AllowUnassigned`` is true.
	1264
	1265
	1266	.. function:: ToASCII(label)
	1267
	1268	Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
	1269	assumed to be false.
	1270
	1271
	1272	.. function:: ToUnicode(label)
	1273
	1274	Convert a label to Unicode, as specified in :rfc:`3490`.
	1275
	1276
	1277	:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
	1278	-------------------------------------------------------------
	1279
	1280	.. module:: encodings.utf_8_sig
	1281	:synopsis: UTF-8 codec with BOM signature
	1282	.. moduleauthor:: Walter DÃ¶rwald
	1283
	1284	.. versionadded:: 2.5
	1285
	1286	This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
	1287	BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
	1288	is only done once (on the first write to the byte stream). For decoding an
	1289	optional UTF-8 encoded BOM at the start of the data will be skipped.
	1290

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Doc/library/codecs.rst

Download in other formats: