Context Navigation

libcodecs.tex

Visit:

Last change on this file was 3225, checked in by bird, 18 years ago
Python 2.5
File size: 46.8 KB

Line
1	\section{\module{codecs} ---
2	Codec registry and base classes}
3
4	\declaremodule{standard}{codecs}
5	\modulesynopsis{Encode and decode data and streams.}
6	\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7	\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8	\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
9
10	\index{Unicode}
11	\index{Codecs}
12	\indexii{Codecs}{encode}
13	\indexii{Codecs}{decode}
14	\index{streams}
15	\indexii{stackable}{streams}
16
17
18	This module defines base classes for standard Python codecs (encoders
19	and decoders) and provides access to the internal Python codec
20	registry which manages the codec and error handling lookup process.
21
22	It defines the following functions:
23
24	\begin{funcdesc}{register}{search_function}
25	Register a codec search function. Search functions are expected to
26	take one argument, the encoding name in all lower case letters, and
27	return a \class{CodecInfo} object having the following attributes:
28
29	\begin{itemize}
30	\item \code{name} The name of the encoding;
31	\item \code{encoder} The stateless encoding function;
32	\item \code{decoder} The stateless decoding function;
33	\item \code{incrementalencoder} An incremental encoder class or factory function;
34	\item \code{incrementaldecoder} An incremental decoder class or factory function;
35	\item \code{streamwriter} A stream writer class or factory function;
36	\item \code{streamreader} A stream reader class or factory function.
37	\end{itemize}
38
39	The various functions or classes take the following arguments:
40
41	\var{encoder} and \var{decoder}: These must be functions or methods
42	which have the same interface as the
43	\method{encode()}/\method{decode()} methods of Codec instances (see
44	Codec Interface). The functions/methods are expected to work in a
45	stateless mode.
46
47	\var{incrementalencoder} and \var{incrementalencoder}: These have to be
48	factory functions providing the following interface:
49
50	\code{factory(\var{errors}='strict')}
51
52	The factory functions must return objects providing the interfaces
53	defined by the base classes \class{IncrementalEncoder} and
54	\class{IncrementalEncoder}, respectively. Incremental codecs can maintain
55	state.
56
57	\var{streamreader} and \var{streamwriter}: These have to be
58	factory functions providing the following interface:
59
60	\code{factory(\var{stream}, \var{errors}='strict')}
61
62	The factory functions must return objects providing the interfaces
63	defined by the base classes \class{StreamWriter} and
64	\class{StreamReader}, respectively. Stream codecs can maintain
65	state.
66
67	Possible values for errors are \code{'strict'} (raise an exception
68	in case of an encoding error), \code{'replace'} (replace malformed
69	data with a suitable replacement marker, such as \character{?}),
70	\code{'ignore'} (ignore malformed data and continue without further
71	notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
72	character reference (for encoding only)) and \code{'backslashreplace'}
73	(replace with backslashed escape sequences (for encoding only)) as
74	well as any other error handling name defined via
75	\function{register_error()}.
76
77	In case a search function cannot find a given encoding, it should
78	return \code{None}.
79	\end{funcdesc}
80
81	\begin{funcdesc}{lookup}{encoding}
82	Looks up the codec info in the Python codec registry and returns a
83	\class{CodecInfo} object as defined above.
84
85	Encodings are first looked up in the registry's cache. If not found,
86	the list of registered search functions is scanned. If no \class{CodecInfo}
87	object is found, a \exception{LookupError} is raised. Otherwise, the
88	\class{CodecInfo} object is stored in the cache and returned to the caller.
89	\end{funcdesc}
90
91	To simplify access to the various codecs, the module provides these
92	additional functions which use \function{lookup()} for the codec
93	lookup:
94
95	\begin{funcdesc}{getencoder}{encoding}
96	Look up the codec for the given encoding and return its encoder
97	function.
98
99	Raises a \exception{LookupError} in case the encoding cannot be found.
100	\end{funcdesc}
101
102	\begin{funcdesc}{getdecoder}{encoding}
103	Look up the codec for the given encoding and return its decoder
104	function.
105
106	Raises a \exception{LookupError} in case the encoding cannot be found.
107	\end{funcdesc}
108
109	\begin{funcdesc}{getincrementalencoder}{encoding}
110	Look up the codec for the given encoding and return its incremental encoder
111	class or factory function.
112
113	Raises a \exception{LookupError} in case the encoding cannot be found or the
114	codec doesn't support an incremental encoder.
115	\versionadded{2.5}
116	\end{funcdesc}
117
118	\begin{funcdesc}{getincrementaldecoder}{encoding}
119	Look up the codec for the given encoding and return its incremental decoder
120	class or factory function.
121
122	Raises a \exception{LookupError} in case the encoding cannot be found or the
123	codec doesn't support an incremental decoder.
124	\versionadded{2.5}
125	\end{funcdesc}
126
127	\begin{funcdesc}{getreader}{encoding}
128	Look up the codec for the given encoding and return its StreamReader
129	class or factory function.
130
131	Raises a \exception{LookupError} in case the encoding cannot be found.
132	\end{funcdesc}
133
134	\begin{funcdesc}{getwriter}{encoding}
135	Look up the codec for the given encoding and return its StreamWriter
136	class or factory function.
137
138	Raises a \exception{LookupError} in case the encoding cannot be found.
139	\end{funcdesc}
140
141	\begin{funcdesc}{register_error}{name, error_handler}
142	Register the error handling function \var{error_handler} under the
143	name \var{name}. \var{error_handler} will be called during encoding
144	and decoding in case of an error, when \var{name} is specified as the
145	errors parameter.
146
147	For encoding \var{error_handler} will be called with a
148	\exception{UnicodeEncodeError} instance, which contains information about
149	the location of the error. The error handler must either raise this or
150	a different exception or return a tuple with a replacement for the
151	unencodable part of the input and a position where encoding should
152	continue. The encoder will encode the replacement and continue encoding
153	the original input at the specified position. Negative position values
154	will be treated as being relative to the end of the input string. If the
155	resulting position is out of bound an \exception{IndexError} will be raised.
156
157	Decoding and translating works similar, except \exception{UnicodeDecodeError}
158	or \exception{UnicodeTranslateError} will be passed to the handler and
159	that the replacement from the error handler will be put into the output
160	directly.
161	\end{funcdesc}
162
163	\begin{funcdesc}{lookup_error}{name}
164	Return the error handler previously registered under the name \var{name}.
165
166	Raises a \exception{LookupError} in case the handler cannot be found.
167	\end{funcdesc}
168
169	\begin{funcdesc}{strict_errors}{exception}
170	Implements the \code{strict} error handling.
171	\end{funcdesc}
172
173	\begin{funcdesc}{replace_errors}{exception}
174	Implements the \code{replace} error handling.
175	\end{funcdesc}
176
177	\begin{funcdesc}{ignore_errors}{exception}
178	Implements the \code{ignore} error handling.
179	\end{funcdesc}
180
181	\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
182	Implements the \code{xmlcharrefreplace} error handling.
183	\end{funcdesc}
184
185	\begin{funcdesc}{backslashreplace_errors_errors}{exception}
186	Implements the \code{backslashreplace} error handling.
187	\end{funcdesc}
188
189	To simplify working with encoded files or stream, the module
190	also defines these utility functions:
191
192	\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
193	errors\optional{, buffering}}}}
194	Open an encoded file using the given \var{mode} and return
195	a wrapped version providing transparent encoding/decoding.
196
197	\note{The wrapped version will only accept the object format
198	defined by the codecs, i.e.\ Unicode objects for most built-in
199	codecs. Output is also codec-dependent and will usually be Unicode as
200	well.}
201
202	\var{encoding} specifies the encoding which is to be used for the
203	file.
204
205	\var{errors} may be given to define the error handling. It defaults
206	to \code{'strict'} which causes a \exception{ValueError} to be raised
207	in case an encoding error occurs.
208
209	\var{buffering} has the same meaning as for the built-in
210	\function{open()} function. It defaults to line buffered.
211	\end{funcdesc}
212
213	\begin{funcdesc}{EncodedFile}{file, input\optional{,
214	output\optional{, errors}}}
215	Return a wrapped version of file which provides transparent
216	encoding translation.
217
218	Strings written to the wrapped file are interpreted according to the
219	given \var{input} encoding and then written to the original file as
220	strings using the \var{output} encoding. The intermediate encoding will
221	usually be Unicode but depends on the specified codecs.
222
223	If \var{output} is not given, it defaults to \var{input}.
224
225	\var{errors} may be given to define the error handling. It defaults to
226	\code{'strict'}, which causes \exception{ValueError} to be raised in case
227	an encoding error occurs.
228	\end{funcdesc}
229
230	\begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
231	Uses an incremental encoder to iteratively encode the input provided by
232	\var{iterable}. This function is a generator. \var{errors} (as well as
233	any other keyword argument) is passed through to the incremental encoder.
234	\versionadded{2.5}
235	\end{funcdesc}
236
237	\begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
238	Uses an incremental decoder to iteratively decode the input provided by
239	\var{iterable}. This function is a generator. \var{errors} (as well as
240	any other keyword argument) is passed through to the incremental encoder.
241	\versionadded{2.5}
242	\end{funcdesc}
243
244	The module also provides the following constants which are useful
245	for reading and writing to platform dependent files:
246
247	\begin{datadesc}{BOM}
248	\dataline{BOM_BE}
249	\dataline{BOM_LE}
250	\dataline{BOM_UTF8}
251	\dataline{BOM_UTF16}
252	\dataline{BOM_UTF16_BE}
253	\dataline{BOM_UTF16_LE}
254	\dataline{BOM_UTF32}
255	\dataline{BOM_UTF32_BE}
256	\dataline{BOM_UTF32_LE}
257	These constants define various encodings of the Unicode byte order mark
258	(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
259	used in the stream or file and in UTF-8 as a Unicode signature.
260	\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
261	\constant{BOM_UTF16_LE} depending on the platform's native byte order,
262	\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
263	for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
264	The others represent the BOM in UTF-8 and UTF-32 encodings.
265	\end{datadesc}
266
267
268	\subsection{Codec Base Classes \label{codec-base-classes}}
269
270	The \module{codecs} module defines a set of base classes which define the
271	interface and can also be used to easily write you own codecs for use
272	in Python.
273
274	Each codec has to define four interfaces to make it usable as codec in
275	Python: stateless encoder, stateless decoder, stream reader and stream
276	writer. The stream reader and writers typically reuse the stateless
277	encoder/decoder to implement the file protocols.
278
279	The \class{Codec} class defines the interface for stateless
280	encoders/decoders.
281
282	To simplify and standardize error handling, the \method{encode()} and
283	\method{decode()} methods may implement different error handling
284	schemes by providing the \var{errors} string argument. The following
285	string values are defined and implemented by all standard Python
286	codecs:
287
288	\begin{tableii}{l\|l}{code}{Value}{Meaning}
289	\lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
290	this is the default.}
291	\lineii{'ignore'}{Ignore the character and continue with the next.}
292	\lineii{'replace'}{Replace with a suitable replacement character;
293	Python will use the official U+FFFD REPLACEMENT
294	CHARACTER for the built-in Unicode codecs on
295	decoding and '?' on encoding.}
296	\lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
297	character reference (only for encoding).}
298	\lineii{'backslashreplace'}{Replace with backslashed escape sequences
299	(only for encoding).}
300	\end{tableii}
301
302	The set of allowed values can be extended via \method{register_error}.
303
304
305	\subsubsection{Codec Objects \label{codec-objects}}
306
307	The \class{Codec} class defines these methods which also define the
308	function interfaces of the stateless encoder and decoder:
309
310	\begin{methoddesc}{encode}{input\optional{, errors}}
311	Encodes the object \var{input} and returns a tuple (output object,
312	length consumed). While codecs are not restricted to use with Unicode, in
313	a Unicode context, encoding converts a Unicode object to a plain string
314	using a particular character set encoding (e.g., \code{cp1252} or
315	\code{iso-8859-1}).
316
317	\var{errors} defines the error handling to apply. It defaults to
318	\code{'strict'} handling.
319
320	The method may not store state in the \class{Codec} instance. Use
321	\class{StreamCodec} for codecs which have to keep state in order to
322	make encoding/decoding efficient.
323
324	The encoder must be able to handle zero length input and return an
325	empty object of the output object type in this situation.
326	\end{methoddesc}
327
328	\begin{methoddesc}{decode}{input\optional{, errors}}
329	Decodes the object \var{input} and returns a tuple (output object,
330	length consumed). In a Unicode context, decoding converts a plain string
331	encoded using a particular character set encoding to a Unicode object.
332
333	\var{input} must be an object which provides the \code{bf_getreadbuf}
334	buffer slot. Python strings, buffer objects and memory mapped files
335	are examples of objects providing this slot.
336
337	\var{errors} defines the error handling to apply. It defaults to
338	\code{'strict'} handling.
339
340	The method may not store state in the \class{Codec} instance. Use
341	\class{StreamCodec} for codecs which have to keep state in order to
342	make encoding/decoding efficient.
343
344	The decoder must be able to handle zero length input and return an
345	empty object of the output object type in this situation.
346	\end{methoddesc}
347
348	The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
349	the basic interface for incremental encoding and decoding. Encoding/decoding the
350	input isn't done with one call to the stateless encoder/decoder function,
351	but with multiple calls to the \method{encode}/\method{decode} method of the
352	incremental encoder/decoder. The incremental encoder/decoder keeps track of
353	the encoding/decoding process during method calls.
354
355	The joined output of calls to the \method{encode}/\method{decode} method is the
356	same as if all the single inputs were joined into one, and this input was
357	encoded/decoded with the stateless encoder/decoder.
358
359
360	\subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
361
362	\versionadded{2.5}
363
364	The \class{IncrementalEncoder} class is used for encoding an input in multiple
365	steps. It defines the following methods which every incremental encoder must
366	define in order to be compatible with the Python codec registry.
367
368	\begin{classdesc}{IncrementalEncoder}{\optional{errors}}
369	Constructor for an \class{IncrementalEncoder} instance.
370
371	All incremental encoders must provide this constructor interface. They are
372	free to add additional keyword arguments, but only the ones defined
373	here are used by the Python codec registry.
374
375	The \class{IncrementalEncoder} may implement different error handling
376	schemes by providing the \var{errors} keyword argument. These
377	parameters are predefined:
378
379	\begin{itemize}
380	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
381	this is the default.
382	\item \code{'ignore'} Ignore the character and continue with the next.
383	\item \code{'replace'} Replace with a suitable replacement character
384	\item \code{'xmlcharrefreplace'} Replace with the appropriate XML
385	character reference
386	\item \code{'backslashreplace'} Replace with backslashed escape sequences.
387	\end{itemize}
388
389	The \var{errors} argument will be assigned to an attribute of the
390	same name. Assigning to this attribute makes it possible to switch
391	between different error handling strategies during the lifetime
392	of the \class{IncrementalEncoder} object.
393
394	The set of allowed values for the \var{errors} argument can
395	be extended with \function{register_error()}.
396	\end{classdesc}
397
398	\begin{methoddesc}{encode}{object\optional{, final}}
399	Encodes \var{object} (taking the current state of the encoder into account)
400	and returns the resulting encoded object. If this is the last call to
401	\method{encode} \var{final} must be true (the default is false).
402	\end{methoddesc}
403
404	\begin{methoddesc}{reset}{}
405	Reset the encoder to the initial state.
406	\end{methoddesc}
407
408
409	\subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
410
411	The \class{IncrementalDecoder} class is used for decoding an input in multiple
412	steps. It defines the following methods which every incremental decoder must
413	define in order to be compatible with the Python codec registry.
414
415	\begin{classdesc}{IncrementalDecoder}{\optional{errors}}
416	Constructor for an \class{IncrementalDecoder} instance.
417
418	All incremental decoders must provide this constructor interface. They are
419	free to add additional keyword arguments, but only the ones defined
420	here are used by the Python codec registry.
421
422	The \class{IncrementalDecoder} may implement different error handling
423	schemes by providing the \var{errors} keyword argument. These
424	parameters are predefined:
425
426	\begin{itemize}
427	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
428	this is the default.
429	\item \code{'ignore'} Ignore the character and continue with the next.
430	\item \code{'replace'} Replace with a suitable replacement character.
431	\end{itemize}
432
433	The \var{errors} argument will be assigned to an attribute of the
434	same name. Assigning to this attribute makes it possible to switch
435	between different error handling strategies during the lifetime
436	of the \class{IncrementalEncoder} object.
437
438	The set of allowed values for the \var{errors} argument can
439	be extended with \function{register_error()}.
440	\end{classdesc}
441
442	\begin{methoddesc}{decode}{object\optional{, final}}
443	Decodes \var{object} (taking the current state of the decoder into account)
444	and returns the resulting decoded object. If this is the last call to
445	\method{decode} \var{final} must be true (the default is false).
446	If \var{final} is true the decoder must decode the input completely and must
447	flush all buffers. If this isn't possible (e.g. because of incomplete byte
448	sequences at the end of the input) it must initiate error handling just like
449	in the stateless case (which might raise an exception).
450	\end{methoddesc}
451
452	\begin{methoddesc}{reset}{}
453	Reset the decoder to the initial state.
454	\end{methoddesc}
455
456
457	The \class{StreamWriter} and \class{StreamReader} classes provide
458	generic working interfaces which can be used to implement new
459	encoding submodules very easily. See \module{encodings.utf_8} for an
460	example of how this is done.
461
462
463	\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
464
465	The \class{StreamWriter} class is a subclass of \class{Codec} and
466	defines the following methods which every stream writer must define in
467	order to be compatible with the Python codec registry.
468
469	\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
470	Constructor for a \class{StreamWriter} instance.
471
472	All stream writers must provide this constructor interface. They are
473	free to add additional keyword arguments, but only the ones defined
474	here are used by the Python codec registry.
475
476	\var{stream} must be a file-like object open for writing binary
477	data.
478
479	The \class{StreamWriter} may implement different error handling
480	schemes by providing the \var{errors} keyword argument. These
481	parameters are predefined:
482
483	\begin{itemize}
484	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
485	this is the default.
486	\item \code{'ignore'} Ignore the character and continue with the next.
487	\item \code{'replace'} Replace with a suitable replacement character
488	\item \code{'xmlcharrefreplace'} Replace with the appropriate XML
489	character reference
490	\item \code{'backslashreplace'} Replace with backslashed escape sequences.
491	\end{itemize}
492
493	The \var{errors} argument will be assigned to an attribute of the
494	same name. Assigning to this attribute makes it possible to switch
495	between different error handling strategies during the lifetime
496	of the \class{StreamWriter} object.
497
498	The set of allowed values for the \var{errors} argument can
499	be extended with \function{register_error()}.
500	\end{classdesc}
501
502	\begin{methoddesc}{write}{object}
503	Writes the object's contents encoded to the stream.
504	\end{methoddesc}
505
506	\begin{methoddesc}{writelines}{list}
507	Writes the concatenated list of strings to the stream (possibly by
508	reusing the \method{write()} method).
509	\end{methoddesc}
510
511	\begin{methoddesc}{reset}{}
512	Flushes and resets the codec buffers used for keeping state.
513
514	Calling this method should ensure that the data on the output is put
515	into a clean state that allows appending of new fresh data without
516	having to rescan the whole stream to recover state.
517	\end{methoddesc}
518
519	In addition to the above methods, the \class{StreamWriter} must also
520	inherit all other methods and attributes from the underlying stream.
521
522
523	\subsubsection{StreamReader Objects \label{stream-reader-objects}}
524
525	The \class{StreamReader} class is a subclass of \class{Codec} and
526	defines the following methods which every stream reader must define in
527	order to be compatible with the Python codec registry.
528
529	\begin{classdesc}{StreamReader}{stream\optional{, errors}}
530	Constructor for a \class{StreamReader} instance.
531
532	All stream readers must provide this constructor interface. They are
533	free to add additional keyword arguments, but only the ones defined
534	here are used by the Python codec registry.
535
536	\var{stream} must be a file-like object open for reading (binary)
537	data.
538
539	The \class{StreamReader} may implement different error handling
540	schemes by providing the \var{errors} keyword argument. These
541	parameters are defined:
542
543	\begin{itemize}
544	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
545	this is the default.
546	\item \code{'ignore'} Ignore the character and continue with the next.
547	\item \code{'replace'} Replace with a suitable replacement character.
548	\end{itemize}
549
550	The \var{errors} argument will be assigned to an attribute of the
551	same name. Assigning to this attribute makes it possible to switch
552	between different error handling strategies during the lifetime
553	of the \class{StreamReader} object.
554
555	The set of allowed values for the \var{errors} argument can
556	be extended with \function{register_error()}.
557	\end{classdesc}
558
559	\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
560	Decodes data from the stream and returns the resulting object.
561
562	\var{chars} indicates the number of characters to read from the
563	stream. \function{read()} will never return more than \var{chars}
564	characters, but it might return less, if there are not enough
565	characters available.
566
567	\var{size} indicates the approximate maximum number of bytes to read
568	from the stream for decoding purposes. The decoder can modify this
569	setting as appropriate. The default value -1 indicates to read and
570	decode as much as possible. \var{size} is intended to prevent having
571	to decode huge files in one step.
572
573	\var{firstline} indicates that it would be sufficient to only return
574	the first line, if there are decoding errors on later lines.
575
576	The method should use a greedy read strategy meaning that it should
577	read as much data as is allowed within the definition of the encoding
578	and the given size, e.g. if optional encoding endings or state
579	markers are available on the stream, these should be read too.
580
581	\versionchanged[\var{chars} argument added]{2.4}
582	\versionchanged[\var{firstline} argument added]{2.4.2}
583	\end{methoddesc}
584
585	\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
586	Read one line from the input stream and return the
587	decoded data.
588
589	\var{size}, if given, is passed as size argument to the stream's
590	\method{readline()} method.
591
592	If \var{keepends} is false line-endings will be stripped from the
593	lines returned.
594
595	\versionchanged[\var{keepends} argument added]{2.4}
596	\end{methoddesc}
597
598	\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
599	Read all lines available on the input stream and return them as a list
600	of lines.
601
602	Line-endings are implemented using the codec's decoder method and are
603	included in the list entries if \var{keepends} is true.
604
605	\var{sizehint}, if given, is passed as the \var{size} argument to the
606	stream's \method{read()} method.
607	\end{methoddesc}
608
609	\begin{methoddesc}{reset}{}
610	Resets the codec buffers used for keeping state.
611
612	Note that no stream repositioning should take place. This method is
613	primarily intended to be able to recover from decoding errors.
614	\end{methoddesc}
615
616	In addition to the above methods, the \class{StreamReader} must also
617	inherit all other methods and attributes from the underlying stream.
618
619	The next two base classes are included for convenience. They are not
620	needed by the codec registry, but may provide useful in practice.
621
622
623	\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
624
625	The \class{StreamReaderWriter} allows wrapping streams which work in
626	both read and write modes.
627
628	The design is such that one can use the factory functions returned by
629	the \function{lookup()} function to construct the instance.
630
631	\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
632	Creates a \class{StreamReaderWriter} instance.
633	\var{stream} must be a file-like object.
634	\var{Reader} and \var{Writer} must be factory functions or classes
635	providing the \class{StreamReader} and \class{StreamWriter} interface
636	resp.
637	Error handling is done in the same way as defined for the
638	stream readers and writers.
639	\end{classdesc}
640
641	\class{StreamReaderWriter} instances define the combined interfaces of
642	\class{StreamReader} and \class{StreamWriter} classes. They inherit
643	all other methods and attributes from the underlying stream.
644
645
646	\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
647
648	The \class{StreamRecoder} provide a frontend - backend view of
649	encoding data which is sometimes useful when dealing with different
650	encoding environments.
651
652	The design is such that one can use the factory functions returned by
653	the \function{lookup()} function to construct the instance.
654
655	\begin{classdesc}{StreamRecoder}{stream, encode, decode,
656	Reader, Writer, errors}
657	Creates a \class{StreamRecoder} instance which implements a two-way
658	conversion: \var{encode} and \var{decode} work on the frontend (the
659	input to \method{read()} and output of \method{write()}) while
660	\var{Reader} and \var{Writer} work on the backend (reading and
661	writing to the stream).
662
663	You can use these objects to do transparent direct recodings from
664	e.g.\ Latin-1 to UTF-8 and back.
665
666	\var{stream} must be a file-like object.
667
668	\var{encode}, \var{decode} must adhere to the \class{Codec}
669	interface. \var{Reader}, \var{Writer} must be factory functions or
670	classes providing objects of the \class{StreamReader} and
671	\class{StreamWriter} interface respectively.
672
673	\var{encode} and \var{decode} are needed for the frontend
674	translation, \var{Reader} and \var{Writer} for the backend
675	translation. The intermediate format used is determined by the two
676	sets of codecs, e.g. the Unicode codecs will use Unicode as the
677	intermediate encoding.
678
679	Error handling is done in the same way as defined for the
680	stream readers and writers.
681	\end{classdesc}
682
683	\class{StreamRecoder} instances define the combined interfaces of
684	\class{StreamReader} and \class{StreamWriter} classes. They inherit
685	all other methods and attributes from the underlying stream.
686
687	\subsection{Encodings and Unicode\label{encodings-overview}}
688
689	Unicode strings are stored internally as sequences of codepoints (to
690	be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
691	compiled (either via \longprogramopt{enable-unicode=ucs2} or
692	\longprogramopt{enable-unicode=ucs4}, with the former being the default)
693	\ctype{Py_UNICODE} is either a 16-bit or
694	32-bit data type. Once a Unicode object is used outside of CPU and
695	memory, CPU endianness and how these arrays are stored as bytes become
696	an issue. Transforming a unicode object into a sequence of bytes is
697	called encoding and recreating the unicode object from the sequence of
698	bytes is known as decoding. There are many different methods for how this
699	transformation can be done (these methods are also called encodings).
700	The simplest method is to map the codepoints 0-255 to the bytes
701	\code{0x0}-\code{0xff}. This means that a unicode object that contains
702	codepoints above \code{U+00FF} can't be encoded with this method (which
703	is called \code{'latin-1'} or \code{'iso-8859-1'}).
704	\function{unicode.encode()} will raise a \exception{UnicodeEncodeError}
705	that looks like this: \samp{UnicodeEncodeError: 'latin-1' codec can't
706	encode character u'\e u1234' in position 3: ordinal not in range(256)}.
707
708	There's another group of encodings (the so called charmap encodings)
709	that choose a different subset of all unicode code points and how
710	these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
711	To see how this is done simply open e.g. \file{encodings/cp1252.py}
712	(which is an encoding that is used primarily on Windows).
713	There's a string constant with 256 characters that shows you which
714	character is mapped to which byte value.
715
716	All of these encodings can only encode 256 of the 65536 (or 1114111)
717	codepoints defined in unicode. A simple and straightforward way that
718	can store each Unicode code point, is to store each codepoint as two
719	consecutive bytes. There are two possibilities: Store the bytes in big
720	endian or in little endian order. These two encodings are called
721	UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
722	e.g. you use UTF-16-BE on a little endian machine you will always have
723	to swap bytes on encoding and decoding. UTF-16 avoids this problem:
724	Bytes will always be in natural endianness. When these bytes are read
725	by a CPU with a different endianness, then bytes have to be swapped
726	though. To be able to detect the endianness of a UTF-16 byte sequence,
727	there's the so called BOM (the "Byte Order Mark"). This is the Unicode
728	character \code{U+FEFF}. This character will be prepended to every UTF-16
729	byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
730	an illegal character that may not appear in a Unicode text. So when
731	the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
732	the bytes have to be swapped on decoding. Unfortunately upto Unicode
733	4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
734	NO-BREAK SPACE}: A character that has no width and doesn't allow a
735	word to be split. It can e.g. be used to give hints to a ligature
736	algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
737	SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
738	this role). Nevertheless Unicode software still must be able to handle
739	\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
740	layout of the encoded bytes, and vanishes once the byte sequence has
741	been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
742	it's a normal character that will be decoded like any other.
743
744	There's another encoding that is able to encoding the full range of
745	Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means
746	there are no issues with byte order in UTF-8. Each byte in a UTF-8
747	byte sequence consists of two parts: Marker bits (the most significant
748	bits) and payload bits. The marker bits are a sequence of zero to six
749	1 bits followed by a 0 bit. Unicode characters are encoded like this
750	(with x being payload bits, which when concatenated give the Unicode
751	character):
752
753	\begin{tableii}{l\|l}{textrm}{Range}{Encoding}
754	\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
755	\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
756	\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
757	\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
758	\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
759	\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
760	\end{tableii}
761
762	The least significant bit of the Unicode character is the rightmost x
763	bit.
764
765	As UTF-8 is an 8-bit encoding no BOM is required and any \code{U+FEFF}
766	character in the decoded Unicode string (even if it's the first
767	character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
768
769	Without external information it's impossible to reliably determine
770	which encoding was used for encoding a Unicode string. Each charmap
771	encoding can decode any random byte sequence. However that's not
772	possible with UTF-8, as UTF-8 byte sequences have a structure that
773	doesn't allow arbitrary byte sequence. To increase the reliability
774	with which a UTF-8 encoding can be detected, Microsoft invented a
775	variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
776	program: Before any of the Unicode characters is written to the file,
777	a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
778	\code{0xbb}, \code{0xbf}) is written. As it's rather improbable that any
779	charmap encoded file starts with these byte values (which would e.g. map to
780
781	LATIN SMALL LETTER I WITH DIAERESIS \\
782	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
783	INVERTED QUESTION MARK
784
785	in iso-8859-1), this increases the probability that a utf-8-sig
786	encoding can be correctly guessed from the byte sequence. So here the
787	BOM is not used to be able to determine the byte order used for
788	generating the byte sequence, but as a signature that helps in
789	guessing the encoding. On encoding the utf-8-sig codec will write
790	\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
791	On decoding utf-8-sig will skip those three bytes if they appear as the
792	first three bytes in the file.
793
794
795	\subsection{Standard Encodings\label{standard-encodings}}
796
797	Python comes with a number of codecs built-in, either implemented as C
798	functions or with dictionaries as mapping tables. The following table
799	lists the codecs by name, together with a few common aliases, and the
800	languages for which the encoding is likely used. Neither the list of
801	aliases nor the list of languages is meant to be exhaustive. Notice
802	that spelling alternatives that only differ in case or use a hyphen
803	instead of an underscore are also valid aliases.
804
805	Many of the character sets support the same languages. They vary in
806	individual characters (e.g. whether the EURO SIGN is supported or
807	not), and in the assignment of characters to code positions. For the
808	European languages in particular, the following variants typically
809	exist:
810
811	\begin{itemize}
812	\item an ISO 8859 codeset
813	\item a Microsoft Windows code page, which is typically derived from
814	a 8859 codeset, but replaces control characters with additional
815	graphic characters
816	\item an IBM EBCDIC code page
817	\item an IBM PC code page, which is \ASCII{} compatible
818	\end{itemize}
819
820	\begin{longtableiii}{l\|l\|l}{textrm}{Codec}{Aliases}{Languages}
821
822	\lineiii{ascii}
823	{646, us-ascii}
824	{English}
825
826	\lineiii{big5}
827	{big5-tw, csbig5}
828	{Traditional Chinese}
829
830	\lineiii{big5hkscs}
831	{big5-hkscs, hkscs}
832	{Traditional Chinese}
833
834	\lineiii{cp037}
835	{IBM037, IBM039}
836	{English}
837
838	\lineiii{cp424}
839	{EBCDIC-CP-HE, IBM424}
840	{Hebrew}
841
842	\lineiii{cp437}
843	{437, IBM437}
844	{English}
845
846	\lineiii{cp500}
847	{EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
848	{Western Europe}
849
850	\lineiii{cp737}
851	{}
852	{Greek}
853
854	\lineiii{cp775}
855	{IBM775}
856	{Baltic languages}
857
858	\lineiii{cp850}
859	{850, IBM850}
860	{Western Europe}
861
862	\lineiii{cp852}
863	{852, IBM852}
864	{Central and Eastern Europe}
865
866	\lineiii{cp855}
867	{855, IBM855}
868	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
869
870	\lineiii{cp856}
871	{}
872	{Hebrew}
873
874	\lineiii{cp857}
875	{857, IBM857}
876	{Turkish}
877
878	\lineiii{cp860}
879	{860, IBM860}
880	{Portuguese}
881
882	\lineiii{cp861}
883	{861, CP-IS, IBM861}
884	{Icelandic}
885
886	\lineiii{cp862}
887	{862, IBM862}
888	{Hebrew}
889
890	\lineiii{cp863}
891	{863, IBM863}
892	{Canadian}
893
894	\lineiii{cp864}
895	{IBM864}
896	{Arabic}
897
898	\lineiii{cp865}
899	{865, IBM865}
900	{Danish, Norwegian}
901
902	\lineiii{cp866}
903	{866, IBM866}
904	{Russian}
905
906	\lineiii{cp869}
907	{869, CP-GR, IBM869}
908	{Greek}
909
910	\lineiii{cp874}
911	{}
912	{Thai}
913
914	\lineiii{cp875}
915	{}
916	{Greek}
917
918	\lineiii{cp932}
919	{932, ms932, mskanji, ms-kanji}
920	{Japanese}
921
922	\lineiii{cp949}
923	{949, ms949, uhc}
924	{Korean}
925
926	\lineiii{cp950}
927	{950, ms950}
928	{Traditional Chinese}
929
930	\lineiii{cp1006}
931	{}
932	{Urdu}
933
934	\lineiii{cp1026}
935	{ibm1026}
936	{Turkish}
937
938	\lineiii{cp1140}
939	{ibm1140}
940	{Western Europe}
941
942	\lineiii{cp1250}
943	{windows-1250}
944	{Central and Eastern Europe}
945
946	\lineiii{cp1251}
947	{windows-1251}
948	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
949
950	\lineiii{cp1252}
951	{windows-1252}
952	{Western Europe}
953
954	\lineiii{cp1253}
955	{windows-1253}
956	{Greek}
957
958	\lineiii{cp1254}
959	{windows-1254}
960	{Turkish}
961
962	\lineiii{cp1255}
963	{windows-1255}
964	{Hebrew}
965
966	\lineiii{cp1256}
967	{windows1256}
968	{Arabic}
969
970	\lineiii{cp1257}
971	{windows-1257}
972	{Baltic languages}
973
974	\lineiii{cp1258}
975	{windows-1258}
976	{Vietnamese}
977
978	\lineiii{euc_jp}
979	{eucjp, ujis, u-jis}
980	{Japanese}
981
982	\lineiii{euc_jis_2004}
983	{jisx0213, eucjis2004}
984	{Japanese}
985
986	\lineiii{euc_jisx0213}
987	{eucjisx0213}
988	{Japanese}
989
990	\lineiii{euc_kr}
991	{euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
992	{Korean}
993
994	\lineiii{gb2312}
995	{chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
996	gb2312-80, iso-ir-58}
997	{Simplified Chinese}
998
999	\lineiii{gbk}
1000	{936, cp936, ms936}
1001	{Unified Chinese}
1002
1003	\lineiii{gb18030}
1004	{gb18030-2000}
1005	{Unified Chinese}
1006
1007	\lineiii{hz}
1008	{hzgb, hz-gb, hz-gb-2312}
1009	{Simplified Chinese}
1010
1011	\lineiii{iso2022_jp}
1012	{csiso2022jp, iso2022jp, iso-2022-jp}
1013	{Japanese}
1014
1015	\lineiii{iso2022_jp_1}
1016	{iso2022jp-1, iso-2022-jp-1}
1017	{Japanese}
1018
1019	\lineiii{iso2022_jp_2}
1020	{iso2022jp-2, iso-2022-jp-2}
1021	{Japanese, Korean, Simplified Chinese, Western Europe, Greek}
1022
1023	\lineiii{iso2022_jp_2004}
1024	{iso2022jp-2004, iso-2022-jp-2004}
1025	{Japanese}
1026
1027	\lineiii{iso2022_jp_3}
1028	{iso2022jp-3, iso-2022-jp-3}
1029	{Japanese}
1030
1031	\lineiii{iso2022_jp_ext}
1032	{iso2022jp-ext, iso-2022-jp-ext}
1033	{Japanese}
1034
1035	\lineiii{iso2022_kr}
1036	{csiso2022kr, iso2022kr, iso-2022-kr}
1037	{Korean}
1038
1039	\lineiii{latin_1}
1040	{iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
1041	{West Europe}
1042
1043	\lineiii{iso8859_2}
1044	{iso-8859-2, latin2, L2}
1045	{Central and Eastern Europe}
1046
1047	\lineiii{iso8859_3}
1048	{iso-8859-3, latin3, L3}
1049	{Esperanto, Maltese}
1050
1051	\lineiii{iso8859_4}
1052	{iso-8859-4, latin4, L4}
1053	{Baltic languagues}
1054
1055	\lineiii{iso8859_5}
1056	{iso-8859-5, cyrillic}
1057	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1058
1059	\lineiii{iso8859_6}
1060	{iso-8859-6, arabic}
1061	{Arabic}
1062
1063	\lineiii{iso8859_7}
1064	{iso-8859-7, greek, greek8}
1065	{Greek}
1066
1067	\lineiii{iso8859_8}
1068	{iso-8859-8, hebrew}
1069	{Hebrew}
1070
1071	\lineiii{iso8859_9}
1072	{iso-8859-9, latin5, L5}
1073	{Turkish}
1074
1075	\lineiii{iso8859_10}
1076	{iso-8859-10, latin6, L6}
1077	{Nordic languages}
1078
1079	\lineiii{iso8859_13}
1080	{iso-8859-13}
1081	{Baltic languages}
1082
1083	\lineiii{iso8859_14}
1084	{iso-8859-14, latin8, L8}
1085	{Celtic languages}
1086
1087	\lineiii{iso8859_15}
1088	{iso-8859-15}
1089	{Western Europe}
1090
1091	\lineiii{johab}
1092	{cp1361, ms1361}
1093	{Korean}
1094
1095	\lineiii{koi8_r}
1096	{}
1097	{Russian}
1098
1099	\lineiii{koi8_u}
1100	{}
1101	{Ukrainian}
1102
1103	\lineiii{mac_cyrillic}
1104	{maccyrillic}
1105	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1106
1107	\lineiii{mac_greek}
1108	{macgreek}
1109	{Greek}
1110
1111	\lineiii{mac_iceland}
1112	{maciceland}
1113	{Icelandic}
1114
1115	\lineiii{mac_latin2}
1116	{maclatin2, maccentraleurope}
1117	{Central and Eastern Europe}
1118
1119	\lineiii{mac_roman}
1120	{macroman}
1121	{Western Europe}
1122
1123	\lineiii{mac_turkish}
1124	{macturkish}
1125	{Turkish}
1126
1127	\lineiii{ptcp154}
1128	{csptcp154, pt154, cp154, cyrillic-asian}
1129	{Kazakh}
1130
1131	\lineiii{shift_jis}
1132	{csshiftjis, shiftjis, sjis, s_jis}
1133	{Japanese}
1134
1135	\lineiii{shift_jis_2004}
1136	{shiftjis2004, sjis_2004, sjis2004}
1137	{Japanese}
1138
1139	\lineiii{shift_jisx0213}
1140	{shiftjisx0213, sjisx0213, s_jisx0213}
1141	{Japanese}
1142
1143	\lineiii{utf_16}
1144	{U16, utf16}
1145	{all languages}
1146
1147	\lineiii{utf_16_be}
1148	{UTF-16BE}
1149	{all languages (BMP only)}
1150
1151	\lineiii{utf_16_le}
1152	{UTF-16LE}
1153	{all languages (BMP only)}
1154
1155	\lineiii{utf_7}
1156	{U7, unicode-1-1-utf-7}
1157	{all languages}
1158
1159	\lineiii{utf_8}
1160	{U8, UTF, utf8}
1161	{all languages}
1162
1163	\lineiii{utf_8_sig}
1164	{}
1165	{all languages}
1166
1167	\end{longtableiii}
1168
1169	A number of codecs are specific to Python, so their codec names have
1170	no meaning outside Python. Some of them don't convert from Unicode
1171	strings to byte strings, but instead use the property of the Python
1172	codecs machinery that any bijective function with one argument can be
1173	considered as an encoding.
1174
1175	For the codecs listed below, the result in the ``encoding'' direction
1176	is always a byte string. The result of the ``decoding'' direction is
1177	listed as operand type in the table.
1178
1179	\begin{tableiv}{l\|l\|l\|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
1180
1181	\lineiv{base64_codec}
1182	{base64, base-64}
1183	{byte string}
1184	{Convert operand to MIME base64}
1185
1186	\lineiv{bz2_codec}
1187	{bz2}
1188	{byte string}
1189	{Compress the operand using bz2}
1190
1191	\lineiv{hex_codec}
1192	{hex}
1193	{byte string}
1194	{Convert operand to hexadecimal representation, with two
1195	digits per byte}
1196
1197	\lineiv{idna}
1198	{}
1199	{Unicode string}
1200	{Implements \rfc{3490}.
1201	\versionadded{2.3}
1202	See also \refmodule{encodings.idna}}
1203
1204	\lineiv{mbcs}
1205	{dbcs}
1206	{Unicode string}
1207	{Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
1208
1209	\lineiv{palmos}
1210	{}
1211	{Unicode string}
1212	{Encoding of PalmOS 3.5}
1213
1214	\lineiv{punycode}
1215	{}
1216	{Unicode string}
1217	{Implements \rfc{3492}.
1218	\versionadded{2.3}}
1219
1220	\lineiv{quopri_codec}
1221	{quopri, quoted-printable, quotedprintable}
1222	{byte string}
1223	{Convert operand to MIME quoted printable}
1224
1225	\lineiv{raw_unicode_escape}
1226	{}
1227	{Unicode string}
1228	{Produce a string that is suitable as raw Unicode literal in
1229	Python source code}
1230
1231	\lineiv{rot_13}
1232	{rot13}
1233	{Unicode string}
1234	{Returns the Caesar-cypher encryption of the operand}
1235
1236	\lineiv{string_escape}
1237	{}
1238	{byte string}
1239	{Produce a string that is suitable as string literal in
1240	Python source code}
1241
1242	\lineiv{undefined}
1243	{}
1244	{any}
1245	{Raise an exception for all conversions. Can be used as the
1246	system encoding if no automatic coercion between byte and
1247	Unicode strings is desired.}
1248
1249	\lineiv{unicode_escape}
1250	{}
1251	{Unicode string}
1252	{Produce a string that is suitable as Unicode literal in
1253	Python source code}
1254
1255	\lineiv{unicode_internal}
1256	{}
1257	{Unicode string}
1258	{Return the internal representation of the operand}
1259
1260	\lineiv{uu_codec}
1261	{uu}
1262	{byte string}
1263	{Convert the operand using uuencode}
1264
1265	\lineiv{zlib_codec}
1266	{zip, zlib}
1267	{byte string}
1268	{Compress the operand using gzip}
1269
1270	\end{tableiv}
1271
1272	\subsection{\module{encodings.idna} ---
1273	Internationalized Domain Names in Applications}
1274
1275	\declaremodule{standard}{encodings.idna}
1276	\modulesynopsis{Internationalized Domain Names implementation}
1277	% XXX The next line triggers a formatting bug, so it's commented out
1278	% until that can be fixed.
1279	%\moduleauthor{Martin v. L\"owis}
1280
1281	\versionadded{2.3}
1282
1283	This module implements \rfc{3490} (Internationalized Domain Names in
1284	Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
1285	Internationalized Domain Names (IDN)). It builds upon the
1286	\code{punycode} encoding and \refmodule{stringprep}.
1287
1288	These RFCs together define a protocol to support non-\ASCII{} characters
1289	in domain names. A domain name containing non-\ASCII{} characters (such
1290	as ``www.Alliancefran\c caise.nu'') is converted into an
1291	\ASCII-compatible encoding (ACE, such as
1292	``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
1293	is then used in all places where arbitrary characters are not allowed
1294	by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
1295	on. This conversion is carried out in the application; if possible
1296	invisible to the user: The application should transparently convert
1297	Unicode domain labels to IDNA on the wire, and convert back ACE labels
1298	to Unicode before presenting them to the user.
1299
1300	Python supports this conversion in several ways: The \code{idna} codec
1301	allows to convert between Unicode and the ACE. Furthermore, the
1302	\refmodule{socket} module transparently converts Unicode host names to
1303	ACE, so that applications need not be concerned about converting host
1304	names themselves when they pass them to the socket module. On top of
1305	that, modules that have host names as function parameters, such as
1306	\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
1307	(\refmodule{httplib} then also transparently sends an IDNA hostname in
1308	the \mailheader{Host} field if it sends that field at all).
1309
1310	When receiving host names from the wire (such as in reverse name
1311	lookup), no automatic conversion to Unicode is performed: Applications
1312	wishing to present such host names to the user should decode them to
1313	Unicode.
1314
1315	The module \module{encodings.idna} also implements the nameprep
1316	procedure, which performs certain normalizations on host names, to
1317	achieve case-insensitivity of international domain names, and to unify
1318	similar characters. The nameprep functions can be used directly if
1319	desired.
1320
1321	\begin{funcdesc}{nameprep}{label}
1322	Return the nameprepped version of \var{label}. The implementation
1323	currently assumes query strings, so \code{AllowUnassigned} is
1324	true.
1325	\end{funcdesc}
1326
1327	\begin{funcdesc}{ToASCII}{label}
1328	Convert a label to \ASCII, as specified in \rfc{3490}.
1329	\code{UseSTD3ASCIIRules} is assumed to be false.
1330	\end{funcdesc}
1331
1332	\begin{funcdesc}{ToUnicode}{label}
1333	Convert a label to Unicode, as specified in \rfc{3490}.
1334	\end{funcdesc}
1335
1336	\subsection{\module{encodings.utf_8_sig} ---
1337	UTF-8 codec with BOM signature}
1338	\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
1339	\modulesynopsis{UTF-8 codec with BOM signature}
1340	\moduleauthor{Walter D\"orwald}{}
1341
1342	\versionadded{2.5}
1343
1344	This module implements a variant of the UTF-8 codec: On encoding a
1345	UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
1346	the stateful encoder this is only done once (on the first write to the
1347	byte stream). For decoding an optional UTF-8 encoded BOM at the start
1348	of the data will be skipped.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: vendor/python/2.5/Doc/lib/libcodecs.tex

Download in other formats: