1 | \section{\module{codecs} ---
|
---|
2 | Codec registry and base classes}
|
---|
3 |
|
---|
4 | \declaremodule{standard}{codecs}
|
---|
5 | \modulesynopsis{Encode and decode data and streams.}
|
---|
6 | \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
---|
7 | \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
---|
8 | \sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
|
---|
9 |
|
---|
10 | \index{Unicode}
|
---|
11 | \index{Codecs}
|
---|
12 | \indexii{Codecs}{encode}
|
---|
13 | \indexii{Codecs}{decode}
|
---|
14 | \index{streams}
|
---|
15 | \indexii{stackable}{streams}
|
---|
16 |
|
---|
17 |
|
---|
18 | This module defines base classes for standard Python codecs (encoders
|
---|
19 | and decoders) and provides access to the internal Python codec
|
---|
20 | registry which manages the codec and error handling lookup process.
|
---|
21 |
|
---|
22 | It defines the following functions:
|
---|
23 |
|
---|
24 | \begin{funcdesc}{register}{search_function}
|
---|
25 | Register a codec search function. Search functions are expected to
|
---|
26 | take one argument, the encoding name in all lower case letters, and
|
---|
27 | return a \class{CodecInfo} object having the following attributes:
|
---|
28 |
|
---|
29 | \begin{itemize}
|
---|
30 | \item \code{name} The name of the encoding;
|
---|
31 | \item \code{encoder} The stateless encoding function;
|
---|
32 | \item \code{decoder} The stateless decoding function;
|
---|
33 | \item \code{incrementalencoder} An incremental encoder class or factory function;
|
---|
34 | \item \code{incrementaldecoder} An incremental decoder class or factory function;
|
---|
35 | \item \code{streamwriter} A stream writer class or factory function;
|
---|
36 | \item \code{streamreader} A stream reader class or factory function.
|
---|
37 | \end{itemize}
|
---|
38 |
|
---|
39 | The various functions or classes take the following arguments:
|
---|
40 |
|
---|
41 | \var{encoder} and \var{decoder}: These must be functions or methods
|
---|
42 | which have the same interface as the
|
---|
43 | \method{encode()}/\method{decode()} methods of Codec instances (see
|
---|
44 | Codec Interface). The functions/methods are expected to work in a
|
---|
45 | stateless mode.
|
---|
46 |
|
---|
47 | \var{incrementalencoder} and \var{incrementalencoder}: These have to be
|
---|
48 | factory functions providing the following interface:
|
---|
49 |
|
---|
50 | \code{factory(\var{errors}='strict')}
|
---|
51 |
|
---|
52 | The factory functions must return objects providing the interfaces
|
---|
53 | defined by the base classes \class{IncrementalEncoder} and
|
---|
54 | \class{IncrementalEncoder}, respectively. Incremental codecs can maintain
|
---|
55 | state.
|
---|
56 |
|
---|
57 | \var{streamreader} and \var{streamwriter}: These have to be
|
---|
58 | factory functions providing the following interface:
|
---|
59 |
|
---|
60 | \code{factory(\var{stream}, \var{errors}='strict')}
|
---|
61 |
|
---|
62 | The factory functions must return objects providing the interfaces
|
---|
63 | defined by the base classes \class{StreamWriter} and
|
---|
64 | \class{StreamReader}, respectively. Stream codecs can maintain
|
---|
65 | state.
|
---|
66 |
|
---|
67 | Possible values for errors are \code{'strict'} (raise an exception
|
---|
68 | in case of an encoding error), \code{'replace'} (replace malformed
|
---|
69 | data with a suitable replacement marker, such as \character{?}),
|
---|
70 | \code{'ignore'} (ignore malformed data and continue without further
|
---|
71 | notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
|
---|
72 | character reference (for encoding only)) and \code{'backslashreplace'}
|
---|
73 | (replace with backslashed escape sequences (for encoding only)) as
|
---|
74 | well as any other error handling name defined via
|
---|
75 | \function{register_error()}.
|
---|
76 |
|
---|
77 | In case a search function cannot find a given encoding, it should
|
---|
78 | return \code{None}.
|
---|
79 | \end{funcdesc}
|
---|
80 |
|
---|
81 | \begin{funcdesc}{lookup}{encoding}
|
---|
82 | Looks up the codec info in the Python codec registry and returns a
|
---|
83 | \class{CodecInfo} object as defined above.
|
---|
84 |
|
---|
85 | Encodings are first looked up in the registry's cache. If not found,
|
---|
86 | the list of registered search functions is scanned. If no \class{CodecInfo}
|
---|
87 | object is found, a \exception{LookupError} is raised. Otherwise, the
|
---|
88 | \class{CodecInfo} object is stored in the cache and returned to the caller.
|
---|
89 | \end{funcdesc}
|
---|
90 |
|
---|
91 | To simplify access to the various codecs, the module provides these
|
---|
92 | additional functions which use \function{lookup()} for the codec
|
---|
93 | lookup:
|
---|
94 |
|
---|
95 | \begin{funcdesc}{getencoder}{encoding}
|
---|
96 | Look up the codec for the given encoding and return its encoder
|
---|
97 | function.
|
---|
98 |
|
---|
99 | Raises a \exception{LookupError} in case the encoding cannot be found.
|
---|
100 | \end{funcdesc}
|
---|
101 |
|
---|
102 | \begin{funcdesc}{getdecoder}{encoding}
|
---|
103 | Look up the codec for the given encoding and return its decoder
|
---|
104 | function.
|
---|
105 |
|
---|
106 | Raises a \exception{LookupError} in case the encoding cannot be found.
|
---|
107 | \end{funcdesc}
|
---|
108 |
|
---|
109 | \begin{funcdesc}{getincrementalencoder}{encoding}
|
---|
110 | Look up the codec for the given encoding and return its incremental encoder
|
---|
111 | class or factory function.
|
---|
112 |
|
---|
113 | Raises a \exception{LookupError} in case the encoding cannot be found or the
|
---|
114 | codec doesn't support an incremental encoder.
|
---|
115 | \versionadded{2.5}
|
---|
116 | \end{funcdesc}
|
---|
117 |
|
---|
118 | \begin{funcdesc}{getincrementaldecoder}{encoding}
|
---|
119 | Look up the codec for the given encoding and return its incremental decoder
|
---|
120 | class or factory function.
|
---|
121 |
|
---|
122 | Raises a \exception{LookupError} in case the encoding cannot be found or the
|
---|
123 | codec doesn't support an incremental decoder.
|
---|
124 | \versionadded{2.5}
|
---|
125 | \end{funcdesc}
|
---|
126 |
|
---|
127 | \begin{funcdesc}{getreader}{encoding}
|
---|
128 | Look up the codec for the given encoding and return its StreamReader
|
---|
129 | class or factory function.
|
---|
130 |
|
---|
131 | Raises a \exception{LookupError} in case the encoding cannot be found.
|
---|
132 | \end{funcdesc}
|
---|
133 |
|
---|
134 | \begin{funcdesc}{getwriter}{encoding}
|
---|
135 | Look up the codec for the given encoding and return its StreamWriter
|
---|
136 | class or factory function.
|
---|
137 |
|
---|
138 | Raises a \exception{LookupError} in case the encoding cannot be found.
|
---|
139 | \end{funcdesc}
|
---|
140 |
|
---|
141 | \begin{funcdesc}{register_error}{name, error_handler}
|
---|
142 | Register the error handling function \var{error_handler} under the
|
---|
143 | name \var{name}. \var{error_handler} will be called during encoding
|
---|
144 | and decoding in case of an error, when \var{name} is specified as the
|
---|
145 | errors parameter.
|
---|
146 |
|
---|
147 | For encoding \var{error_handler} will be called with a
|
---|
148 | \exception{UnicodeEncodeError} instance, which contains information about
|
---|
149 | the location of the error. The error handler must either raise this or
|
---|
150 | a different exception or return a tuple with a replacement for the
|
---|
151 | unencodable part of the input and a position where encoding should
|
---|
152 | continue. The encoder will encode the replacement and continue encoding
|
---|
153 | the original input at the specified position. Negative position values
|
---|
154 | will be treated as being relative to the end of the input string. If the
|
---|
155 | resulting position is out of bound an \exception{IndexError} will be raised.
|
---|
156 |
|
---|
157 | Decoding and translating works similar, except \exception{UnicodeDecodeError}
|
---|
158 | or \exception{UnicodeTranslateError} will be passed to the handler and
|
---|
159 | that the replacement from the error handler will be put into the output
|
---|
160 | directly.
|
---|
161 | \end{funcdesc}
|
---|
162 |
|
---|
163 | \begin{funcdesc}{lookup_error}{name}
|
---|
164 | Return the error handler previously registered under the name \var{name}.
|
---|
165 |
|
---|
166 | Raises a \exception{LookupError} in case the handler cannot be found.
|
---|
167 | \end{funcdesc}
|
---|
168 |
|
---|
169 | \begin{funcdesc}{strict_errors}{exception}
|
---|
170 | Implements the \code{strict} error handling.
|
---|
171 | \end{funcdesc}
|
---|
172 |
|
---|
173 | \begin{funcdesc}{replace_errors}{exception}
|
---|
174 | Implements the \code{replace} error handling.
|
---|
175 | \end{funcdesc}
|
---|
176 |
|
---|
177 | \begin{funcdesc}{ignore_errors}{exception}
|
---|
178 | Implements the \code{ignore} error handling.
|
---|
179 | \end{funcdesc}
|
---|
180 |
|
---|
181 | \begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
|
---|
182 | Implements the \code{xmlcharrefreplace} error handling.
|
---|
183 | \end{funcdesc}
|
---|
184 |
|
---|
185 | \begin{funcdesc}{backslashreplace_errors_errors}{exception}
|
---|
186 | Implements the \code{backslashreplace} error handling.
|
---|
187 | \end{funcdesc}
|
---|
188 |
|
---|
189 | To simplify working with encoded files or stream, the module
|
---|
190 | also defines these utility functions:
|
---|
191 |
|
---|
192 | \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
|
---|
193 | errors\optional{, buffering}}}}
|
---|
194 | Open an encoded file using the given \var{mode} and return
|
---|
195 | a wrapped version providing transparent encoding/decoding.
|
---|
196 |
|
---|
197 | \note{The wrapped version will only accept the object format
|
---|
198 | defined by the codecs, i.e.\ Unicode objects for most built-in
|
---|
199 | codecs. Output is also codec-dependent and will usually be Unicode as
|
---|
200 | well.}
|
---|
201 |
|
---|
202 | \var{encoding} specifies the encoding which is to be used for the
|
---|
203 | file.
|
---|
204 |
|
---|
205 | \var{errors} may be given to define the error handling. It defaults
|
---|
206 | to \code{'strict'} which causes a \exception{ValueError} to be raised
|
---|
207 | in case an encoding error occurs.
|
---|
208 |
|
---|
209 | \var{buffering} has the same meaning as for the built-in
|
---|
210 | \function{open()} function. It defaults to line buffered.
|
---|
211 | \end{funcdesc}
|
---|
212 |
|
---|
213 | \begin{funcdesc}{EncodedFile}{file, input\optional{,
|
---|
214 | output\optional{, errors}}}
|
---|
215 | Return a wrapped version of file which provides transparent
|
---|
216 | encoding translation.
|
---|
217 |
|
---|
218 | Strings written to the wrapped file are interpreted according to the
|
---|
219 | given \var{input} encoding and then written to the original file as
|
---|
220 | strings using the \var{output} encoding. The intermediate encoding will
|
---|
221 | usually be Unicode but depends on the specified codecs.
|
---|
222 |
|
---|
223 | If \var{output} is not given, it defaults to \var{input}.
|
---|
224 |
|
---|
225 | \var{errors} may be given to define the error handling. It defaults to
|
---|
226 | \code{'strict'}, which causes \exception{ValueError} to be raised in case
|
---|
227 | an encoding error occurs.
|
---|
228 | \end{funcdesc}
|
---|
229 |
|
---|
230 | \begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
|
---|
231 | Uses an incremental encoder to iteratively encode the input provided by
|
---|
232 | \var{iterable}. This function is a generator. \var{errors} (as well as
|
---|
233 | any other keyword argument) is passed through to the incremental encoder.
|
---|
234 | \versionadded{2.5}
|
---|
235 | \end{funcdesc}
|
---|
236 |
|
---|
237 | \begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
|
---|
238 | Uses an incremental decoder to iteratively decode the input provided by
|
---|
239 | \var{iterable}. This function is a generator. \var{errors} (as well as
|
---|
240 | any other keyword argument) is passed through to the incremental encoder.
|
---|
241 | \versionadded{2.5}
|
---|
242 | \end{funcdesc}
|
---|
243 |
|
---|
244 | The module also provides the following constants which are useful
|
---|
245 | for reading and writing to platform dependent files:
|
---|
246 |
|
---|
247 | \begin{datadesc}{BOM}
|
---|
248 | \dataline{BOM_BE}
|
---|
249 | \dataline{BOM_LE}
|
---|
250 | \dataline{BOM_UTF8}
|
---|
251 | \dataline{BOM_UTF16}
|
---|
252 | \dataline{BOM_UTF16_BE}
|
---|
253 | \dataline{BOM_UTF16_LE}
|
---|
254 | \dataline{BOM_UTF32}
|
---|
255 | \dataline{BOM_UTF32_BE}
|
---|
256 | \dataline{BOM_UTF32_LE}
|
---|
257 | These constants define various encodings of the Unicode byte order mark
|
---|
258 | (BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
|
---|
259 | used in the stream or file and in UTF-8 as a Unicode signature.
|
---|
260 | \constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
|
---|
261 | \constant{BOM_UTF16_LE} depending on the platform's native byte order,
|
---|
262 | \constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
|
---|
263 | for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
|
---|
264 | The others represent the BOM in UTF-8 and UTF-32 encodings.
|
---|
265 | \end{datadesc}
|
---|
266 |
|
---|
267 |
|
---|
268 | \subsection{Codec Base Classes \label{codec-base-classes}}
|
---|
269 |
|
---|
270 | The \module{codecs} module defines a set of base classes which define the
|
---|
271 | interface and can also be used to easily write you own codecs for use
|
---|
272 | in Python.
|
---|
273 |
|
---|
274 | Each codec has to define four interfaces to make it usable as codec in
|
---|
275 | Python: stateless encoder, stateless decoder, stream reader and stream
|
---|
276 | writer. The stream reader and writers typically reuse the stateless
|
---|
277 | encoder/decoder to implement the file protocols.
|
---|
278 |
|
---|
279 | The \class{Codec} class defines the interface for stateless
|
---|
280 | encoders/decoders.
|
---|
281 |
|
---|
282 | To simplify and standardize error handling, the \method{encode()} and
|
---|
283 | \method{decode()} methods may implement different error handling
|
---|
284 | schemes by providing the \var{errors} string argument. The following
|
---|
285 | string values are defined and implemented by all standard Python
|
---|
286 | codecs:
|
---|
287 |
|
---|
288 | \begin{tableii}{l|l}{code}{Value}{Meaning}
|
---|
289 | \lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
|
---|
290 | this is the default.}
|
---|
291 | \lineii{'ignore'}{Ignore the character and continue with the next.}
|
---|
292 | \lineii{'replace'}{Replace with a suitable replacement character;
|
---|
293 | Python will use the official U+FFFD REPLACEMENT
|
---|
294 | CHARACTER for the built-in Unicode codecs on
|
---|
295 | decoding and '?' on encoding.}
|
---|
296 | \lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
|
---|
297 | character reference (only for encoding).}
|
---|
298 | \lineii{'backslashreplace'}{Replace with backslashed escape sequences
|
---|
299 | (only for encoding).}
|
---|
300 | \end{tableii}
|
---|
301 |
|
---|
302 | The set of allowed values can be extended via \method{register_error}.
|
---|
303 |
|
---|
304 |
|
---|
305 | \subsubsection{Codec Objects \label{codec-objects}}
|
---|
306 |
|
---|
307 | The \class{Codec} class defines these methods which also define the
|
---|
308 | function interfaces of the stateless encoder and decoder:
|
---|
309 |
|
---|
310 | \begin{methoddesc}{encode}{input\optional{, errors}}
|
---|
311 | Encodes the object \var{input} and returns a tuple (output object,
|
---|
312 | length consumed). While codecs are not restricted to use with Unicode, in
|
---|
313 | a Unicode context, encoding converts a Unicode object to a plain string
|
---|
314 | using a particular character set encoding (e.g., \code{cp1252} or
|
---|
315 | \code{iso-8859-1}).
|
---|
316 |
|
---|
317 | \var{errors} defines the error handling to apply. It defaults to
|
---|
318 | \code{'strict'} handling.
|
---|
319 |
|
---|
320 | The method may not store state in the \class{Codec} instance. Use
|
---|
321 | \class{StreamCodec} for codecs which have to keep state in order to
|
---|
322 | make encoding/decoding efficient.
|
---|
323 |
|
---|
324 | The encoder must be able to handle zero length input and return an
|
---|
325 | empty object of the output object type in this situation.
|
---|
326 | \end{methoddesc}
|
---|
327 |
|
---|
328 | \begin{methoddesc}{decode}{input\optional{, errors}}
|
---|
329 | Decodes the object \var{input} and returns a tuple (output object,
|
---|
330 | length consumed). In a Unicode context, decoding converts a plain string
|
---|
331 | encoded using a particular character set encoding to a Unicode object.
|
---|
332 |
|
---|
333 | \var{input} must be an object which provides the \code{bf_getreadbuf}
|
---|
334 | buffer slot. Python strings, buffer objects and memory mapped files
|
---|
335 | are examples of objects providing this slot.
|
---|
336 |
|
---|
337 | \var{errors} defines the error handling to apply. It defaults to
|
---|
338 | \code{'strict'} handling.
|
---|
339 |
|
---|
340 | The method may not store state in the \class{Codec} instance. Use
|
---|
341 | \class{StreamCodec} for codecs which have to keep state in order to
|
---|
342 | make encoding/decoding efficient.
|
---|
343 |
|
---|
344 | The decoder must be able to handle zero length input and return an
|
---|
345 | empty object of the output object type in this situation.
|
---|
346 | \end{methoddesc}
|
---|
347 |
|
---|
348 | The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
|
---|
349 | the basic interface for incremental encoding and decoding. Encoding/decoding the
|
---|
350 | input isn't done with one call to the stateless encoder/decoder function,
|
---|
351 | but with multiple calls to the \method{encode}/\method{decode} method of the
|
---|
352 | incremental encoder/decoder. The incremental encoder/decoder keeps track of
|
---|
353 | the encoding/decoding process during method calls.
|
---|
354 |
|
---|
355 | The joined output of calls to the \method{encode}/\method{decode} method is the
|
---|
356 | same as if all the single inputs were joined into one, and this input was
|
---|
357 | encoded/decoded with the stateless encoder/decoder.
|
---|
358 |
|
---|
359 |
|
---|
360 | \subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
|
---|
361 |
|
---|
362 | \versionadded{2.5}
|
---|
363 |
|
---|
364 | The \class{IncrementalEncoder} class is used for encoding an input in multiple
|
---|
365 | steps. It defines the following methods which every incremental encoder must
|
---|
366 | define in order to be compatible with the Python codec registry.
|
---|
367 |
|
---|
368 | \begin{classdesc}{IncrementalEncoder}{\optional{errors}}
|
---|
369 | Constructor for an \class{IncrementalEncoder} instance.
|
---|
370 |
|
---|
371 | All incremental encoders must provide this constructor interface. They are
|
---|
372 | free to add additional keyword arguments, but only the ones defined
|
---|
373 | here are used by the Python codec registry.
|
---|
374 |
|
---|
375 | The \class{IncrementalEncoder} may implement different error handling
|
---|
376 | schemes by providing the \var{errors} keyword argument. These
|
---|
377 | parameters are predefined:
|
---|
378 |
|
---|
379 | \begin{itemize}
|
---|
380 | \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
|
---|
381 | this is the default.
|
---|
382 | \item \code{'ignore'} Ignore the character and continue with the next.
|
---|
383 | \item \code{'replace'} Replace with a suitable replacement character
|
---|
384 | \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
|
---|
385 | character reference
|
---|
386 | \item \code{'backslashreplace'} Replace with backslashed escape sequences.
|
---|
387 | \end{itemize}
|
---|
388 |
|
---|
389 | The \var{errors} argument will be assigned to an attribute of the
|
---|
390 | same name. Assigning to this attribute makes it possible to switch
|
---|
391 | between different error handling strategies during the lifetime
|
---|
392 | of the \class{IncrementalEncoder} object.
|
---|
393 |
|
---|
394 | The set of allowed values for the \var{errors} argument can
|
---|
395 | be extended with \function{register_error()}.
|
---|
396 | \end{classdesc}
|
---|
397 |
|
---|
398 | \begin{methoddesc}{encode}{object\optional{, final}}
|
---|
399 | Encodes \var{object} (taking the current state of the encoder into account)
|
---|
400 | and returns the resulting encoded object. If this is the last call to
|
---|
401 | \method{encode} \var{final} must be true (the default is false).
|
---|
402 | \end{methoddesc}
|
---|
403 |
|
---|
404 | \begin{methoddesc}{reset}{}
|
---|
405 | Reset the encoder to the initial state.
|
---|
406 | \end{methoddesc}
|
---|
407 |
|
---|
408 |
|
---|
409 | \subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
|
---|
410 |
|
---|
411 | The \class{IncrementalDecoder} class is used for decoding an input in multiple
|
---|
412 | steps. It defines the following methods which every incremental decoder must
|
---|
413 | define in order to be compatible with the Python codec registry.
|
---|
414 |
|
---|
415 | \begin{classdesc}{IncrementalDecoder}{\optional{errors}}
|
---|
416 | Constructor for an \class{IncrementalDecoder} instance.
|
---|
417 |
|
---|
418 | All incremental decoders must provide this constructor interface. They are
|
---|
419 | free to add additional keyword arguments, but only the ones defined
|
---|
420 | here are used by the Python codec registry.
|
---|
421 |
|
---|
422 | The \class{IncrementalDecoder} may implement different error handling
|
---|
423 | schemes by providing the \var{errors} keyword argument. These
|
---|
424 | parameters are predefined:
|
---|
425 |
|
---|
426 | \begin{itemize}
|
---|
427 | \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
|
---|
428 | this is the default.
|
---|
429 | \item \code{'ignore'} Ignore the character and continue with the next.
|
---|
430 | \item \code{'replace'} Replace with a suitable replacement character.
|
---|
431 | \end{itemize}
|
---|
432 |
|
---|
433 | The \var{errors} argument will be assigned to an attribute of the
|
---|
434 | same name. Assigning to this attribute makes it possible to switch
|
---|
435 | between different error handling strategies during the lifetime
|
---|
436 | of the \class{IncrementalEncoder} object.
|
---|
437 |
|
---|
438 | The set of allowed values for the \var{errors} argument can
|
---|
439 | be extended with \function{register_error()}.
|
---|
440 | \end{classdesc}
|
---|
441 |
|
---|
442 | \begin{methoddesc}{decode}{object\optional{, final}}
|
---|
443 | Decodes \var{object} (taking the current state of the decoder into account)
|
---|
444 | and returns the resulting decoded object. If this is the last call to
|
---|
445 | \method{decode} \var{final} must be true (the default is false).
|
---|
446 | If \var{final} is true the decoder must decode the input completely and must
|
---|
447 | flush all buffers. If this isn't possible (e.g. because of incomplete byte
|
---|
448 | sequences at the end of the input) it must initiate error handling just like
|
---|
449 | in the stateless case (which might raise an exception).
|
---|
450 | \end{methoddesc}
|
---|
451 |
|
---|
452 | \begin{methoddesc}{reset}{}
|
---|
453 | Reset the decoder to the initial state.
|
---|
454 | \end{methoddesc}
|
---|
455 |
|
---|
456 |
|
---|
457 | The \class{StreamWriter} and \class{StreamReader} classes provide
|
---|
458 | generic working interfaces which can be used to implement new
|
---|
459 | encoding submodules very easily. See \module{encodings.utf_8} for an
|
---|
460 | example of how this is done.
|
---|
461 |
|
---|
462 |
|
---|
463 | \subsubsection{StreamWriter Objects \label{stream-writer-objects}}
|
---|
464 |
|
---|
465 | The \class{StreamWriter} class is a subclass of \class{Codec} and
|
---|
466 | defines the following methods which every stream writer must define in
|
---|
467 | order to be compatible with the Python codec registry.
|
---|
468 |
|
---|
469 | \begin{classdesc}{StreamWriter}{stream\optional{, errors}}
|
---|
470 | Constructor for a \class{StreamWriter} instance.
|
---|
471 |
|
---|
472 | All stream writers must provide this constructor interface. They are
|
---|
473 | free to add additional keyword arguments, but only the ones defined
|
---|
474 | here are used by the Python codec registry.
|
---|
475 |
|
---|
476 | \var{stream} must be a file-like object open for writing binary
|
---|
477 | data.
|
---|
478 |
|
---|
479 | The \class{StreamWriter} may implement different error handling
|
---|
480 | schemes by providing the \var{errors} keyword argument. These
|
---|
481 | parameters are predefined:
|
---|
482 |
|
---|
483 | \begin{itemize}
|
---|
484 | \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
|
---|
485 | this is the default.
|
---|
486 | \item \code{'ignore'} Ignore the character and continue with the next.
|
---|
487 | \item \code{'replace'} Replace with a suitable replacement character
|
---|
488 | \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
|
---|
489 | character reference
|
---|
490 | \item \code{'backslashreplace'} Replace with backslashed escape sequences.
|
---|
491 | \end{itemize}
|
---|
492 |
|
---|
493 | The \var{errors} argument will be assigned to an attribute of the
|
---|
494 | same name. Assigning to this attribute makes it possible to switch
|
---|
495 | between different error handling strategies during the lifetime
|
---|
496 | of the \class{StreamWriter} object.
|
---|
497 |
|
---|
498 | The set of allowed values for the \var{errors} argument can
|
---|
499 | be extended with \function{register_error()}.
|
---|
500 | \end{classdesc}
|
---|
501 |
|
---|
502 | \begin{methoddesc}{write}{object}
|
---|
503 | Writes the object's contents encoded to the stream.
|
---|
504 | \end{methoddesc}
|
---|
505 |
|
---|
506 | \begin{methoddesc}{writelines}{list}
|
---|
507 | Writes the concatenated list of strings to the stream (possibly by
|
---|
508 | reusing the \method{write()} method).
|
---|
509 | \end{methoddesc}
|
---|
510 |
|
---|
511 | \begin{methoddesc}{reset}{}
|
---|
512 | Flushes and resets the codec buffers used for keeping state.
|
---|
513 |
|
---|
514 | Calling this method should ensure that the data on the output is put
|
---|
515 | into a clean state that allows appending of new fresh data without
|
---|
516 | having to rescan the whole stream to recover state.
|
---|
517 | \end{methoddesc}
|
---|
518 |
|
---|
519 | In addition to the above methods, the \class{StreamWriter} must also
|
---|
520 | inherit all other methods and attributes from the underlying stream.
|
---|
521 |
|
---|
522 |
|
---|
523 | \subsubsection{StreamReader Objects \label{stream-reader-objects}}
|
---|
524 |
|
---|
525 | The \class{StreamReader} class is a subclass of \class{Codec} and
|
---|
526 | defines the following methods which every stream reader must define in
|
---|
527 | order to be compatible with the Python codec registry.
|
---|
528 |
|
---|
529 | \begin{classdesc}{StreamReader}{stream\optional{, errors}}
|
---|
530 | Constructor for a \class{StreamReader} instance.
|
---|
531 |
|
---|
532 | All stream readers must provide this constructor interface. They are
|
---|
533 | free to add additional keyword arguments, but only the ones defined
|
---|
534 | here are used by the Python codec registry.
|
---|
535 |
|
---|
536 | \var{stream} must be a file-like object open for reading (binary)
|
---|
537 | data.
|
---|
538 |
|
---|
539 | The \class{StreamReader} may implement different error handling
|
---|
540 | schemes by providing the \var{errors} keyword argument. These
|
---|
541 | parameters are defined:
|
---|
542 |
|
---|
543 | \begin{itemize}
|
---|
544 | \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
|
---|
545 | this is the default.
|
---|
546 | \item \code{'ignore'} Ignore the character and continue with the next.
|
---|
547 | \item \code{'replace'} Replace with a suitable replacement character.
|
---|
548 | \end{itemize}
|
---|
549 |
|
---|
550 | The \var{errors} argument will be assigned to an attribute of the
|
---|
551 | same name. Assigning to this attribute makes it possible to switch
|
---|
552 | between different error handling strategies during the lifetime
|
---|
553 | of the \class{StreamReader} object.
|
---|
554 |
|
---|
555 | The set of allowed values for the \var{errors} argument can
|
---|
556 | be extended with \function{register_error()}.
|
---|
557 | \end{classdesc}
|
---|
558 |
|
---|
559 | \begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
|
---|
560 | Decodes data from the stream and returns the resulting object.
|
---|
561 |
|
---|
562 | \var{chars} indicates the number of characters to read from the
|
---|
563 | stream. \function{read()} will never return more than \var{chars}
|
---|
564 | characters, but it might return less, if there are not enough
|
---|
565 | characters available.
|
---|
566 |
|
---|
567 | \var{size} indicates the approximate maximum number of bytes to read
|
---|
568 | from the stream for decoding purposes. The decoder can modify this
|
---|
569 | setting as appropriate. The default value -1 indicates to read and
|
---|
570 | decode as much as possible. \var{size} is intended to prevent having
|
---|
571 | to decode huge files in one step.
|
---|
572 |
|
---|
573 | \var{firstline} indicates that it would be sufficient to only return
|
---|
574 | the first line, if there are decoding errors on later lines.
|
---|
575 |
|
---|
576 | The method should use a greedy read strategy meaning that it should
|
---|
577 | read as much data as is allowed within the definition of the encoding
|
---|
578 | and the given size, e.g. if optional encoding endings or state
|
---|
579 | markers are available on the stream, these should be read too.
|
---|
580 |
|
---|
581 | \versionchanged[\var{chars} argument added]{2.4}
|
---|
582 | \versionchanged[\var{firstline} argument added]{2.4.2}
|
---|
583 | \end{methoddesc}
|
---|
584 |
|
---|
585 | \begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
|
---|
586 | Read one line from the input stream and return the
|
---|
587 | decoded data.
|
---|
588 |
|
---|
589 | \var{size}, if given, is passed as size argument to the stream's
|
---|
590 | \method{readline()} method.
|
---|
591 |
|
---|
592 | If \var{keepends} is false line-endings will be stripped from the
|
---|
593 | lines returned.
|
---|
594 |
|
---|
595 | \versionchanged[\var{keepends} argument added]{2.4}
|
---|
596 | \end{methoddesc}
|
---|
597 |
|
---|
598 | \begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
|
---|
599 | Read all lines available on the input stream and return them as a list
|
---|
600 | of lines.
|
---|
601 |
|
---|
602 | Line-endings are implemented using the codec's decoder method and are
|
---|
603 | included in the list entries if \var{keepends} is true.
|
---|
604 |
|
---|
605 | \var{sizehint}, if given, is passed as the \var{size} argument to the
|
---|
606 | stream's \method{read()} method.
|
---|
607 | \end{methoddesc}
|
---|
608 |
|
---|
609 | \begin{methoddesc}{reset}{}
|
---|
610 | Resets the codec buffers used for keeping state.
|
---|
611 |
|
---|
612 | Note that no stream repositioning should take place. This method is
|
---|
613 | primarily intended to be able to recover from decoding errors.
|
---|
614 | \end{methoddesc}
|
---|
615 |
|
---|
616 | In addition to the above methods, the \class{StreamReader} must also
|
---|
617 | inherit all other methods and attributes from the underlying stream.
|
---|
618 |
|
---|
619 | The next two base classes are included for convenience. They are not
|
---|
620 | needed by the codec registry, but may provide useful in practice.
|
---|
621 |
|
---|
622 |
|
---|
623 | \subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
|
---|
624 |
|
---|
625 | The \class{StreamReaderWriter} allows wrapping streams which work in
|
---|
626 | both read and write modes.
|
---|
627 |
|
---|
628 | The design is such that one can use the factory functions returned by
|
---|
629 | the \function{lookup()} function to construct the instance.
|
---|
630 |
|
---|
631 | \begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
|
---|
632 | Creates a \class{StreamReaderWriter} instance.
|
---|
633 | \var{stream} must be a file-like object.
|
---|
634 | \var{Reader} and \var{Writer} must be factory functions or classes
|
---|
635 | providing the \class{StreamReader} and \class{StreamWriter} interface
|
---|
636 | resp.
|
---|
637 | Error handling is done in the same way as defined for the
|
---|
638 | stream readers and writers.
|
---|
639 | \end{classdesc}
|
---|
640 |
|
---|
641 | \class{StreamReaderWriter} instances define the combined interfaces of
|
---|
642 | \class{StreamReader} and \class{StreamWriter} classes. They inherit
|
---|
643 | all other methods and attributes from the underlying stream.
|
---|
644 |
|
---|
645 |
|
---|
646 | \subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
|
---|
647 |
|
---|
648 | The \class{StreamRecoder} provide a frontend - backend view of
|
---|
649 | encoding data which is sometimes useful when dealing with different
|
---|
650 | encoding environments.
|
---|
651 |
|
---|
652 | The design is such that one can use the factory functions returned by
|
---|
653 | the \function{lookup()} function to construct the instance.
|
---|
654 |
|
---|
655 | \begin{classdesc}{StreamRecoder}{stream, encode, decode,
|
---|
656 | Reader, Writer, errors}
|
---|
657 | Creates a \class{StreamRecoder} instance which implements a two-way
|
---|
658 | conversion: \var{encode} and \var{decode} work on the frontend (the
|
---|
659 | input to \method{read()} and output of \method{write()}) while
|
---|
660 | \var{Reader} and \var{Writer} work on the backend (reading and
|
---|
661 | writing to the stream).
|
---|
662 |
|
---|
663 | You can use these objects to do transparent direct recodings from
|
---|
664 | e.g.\ Latin-1 to UTF-8 and back.
|
---|
665 |
|
---|
666 | \var{stream} must be a file-like object.
|
---|
667 |
|
---|
668 | \var{encode}, \var{decode} must adhere to the \class{Codec}
|
---|
669 | interface. \var{Reader}, \var{Writer} must be factory functions or
|
---|
670 | classes providing objects of the \class{StreamReader} and
|
---|
671 | \class{StreamWriter} interface respectively.
|
---|
672 |
|
---|
673 | \var{encode} and \var{decode} are needed for the frontend
|
---|
674 | translation, \var{Reader} and \var{Writer} for the backend
|
---|
675 | translation. The intermediate format used is determined by the two
|
---|
676 | sets of codecs, e.g. the Unicode codecs will use Unicode as the
|
---|
677 | intermediate encoding.
|
---|
678 |
|
---|
679 | Error handling is done in the same way as defined for the
|
---|
680 | stream readers and writers.
|
---|
681 | \end{classdesc}
|
---|
682 |
|
---|
683 | \class{StreamRecoder} instances define the combined interfaces of
|
---|
684 | \class{StreamReader} and \class{StreamWriter} classes. They inherit
|
---|
685 | all other methods and attributes from the underlying stream.
|
---|
686 |
|
---|
687 | \subsection{Encodings and Unicode\label{encodings-overview}}
|
---|
688 |
|
---|
689 | Unicode strings are stored internally as sequences of codepoints (to
|
---|
690 | be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
|
---|
691 | compiled (either via \longprogramopt{enable-unicode=ucs2} or
|
---|
692 | \longprogramopt{enable-unicode=ucs4}, with the former being the default)
|
---|
693 | \ctype{Py_UNICODE} is either a 16-bit or
|
---|
694 | 32-bit data type. Once a Unicode object is used outside of CPU and
|
---|
695 | memory, CPU endianness and how these arrays are stored as bytes become
|
---|
696 | an issue. Transforming a unicode object into a sequence of bytes is
|
---|
697 | called encoding and recreating the unicode object from the sequence of
|
---|
698 | bytes is known as decoding. There are many different methods for how this
|
---|
699 | transformation can be done (these methods are also called encodings).
|
---|
700 | The simplest method is to map the codepoints 0-255 to the bytes
|
---|
701 | \code{0x0}-\code{0xff}. This means that a unicode object that contains
|
---|
702 | codepoints above \code{U+00FF} can't be encoded with this method (which
|
---|
703 | is called \code{'latin-1'} or \code{'iso-8859-1'}).
|
---|
704 | \function{unicode.encode()} will raise a \exception{UnicodeEncodeError}
|
---|
705 | that looks like this: \samp{UnicodeEncodeError: 'latin-1' codec can't
|
---|
706 | encode character u'\e u1234' in position 3: ordinal not in range(256)}.
|
---|
707 |
|
---|
708 | There's another group of encodings (the so called charmap encodings)
|
---|
709 | that choose a different subset of all unicode code points and how
|
---|
710 | these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
|
---|
711 | To see how this is done simply open e.g. \file{encodings/cp1252.py}
|
---|
712 | (which is an encoding that is used primarily on Windows).
|
---|
713 | There's a string constant with 256 characters that shows you which
|
---|
714 | character is mapped to which byte value.
|
---|
715 |
|
---|
716 | All of these encodings can only encode 256 of the 65536 (or 1114111)
|
---|
717 | codepoints defined in unicode. A simple and straightforward way that
|
---|
718 | can store each Unicode code point, is to store each codepoint as two
|
---|
719 | consecutive bytes. There are two possibilities: Store the bytes in big
|
---|
720 | endian or in little endian order. These two encodings are called
|
---|
721 | UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
|
---|
722 | e.g. you use UTF-16-BE on a little endian machine you will always have
|
---|
723 | to swap bytes on encoding and decoding. UTF-16 avoids this problem:
|
---|
724 | Bytes will always be in natural endianness. When these bytes are read
|
---|
725 | by a CPU with a different endianness, then bytes have to be swapped
|
---|
726 | though. To be able to detect the endianness of a UTF-16 byte sequence,
|
---|
727 | there's the so called BOM (the "Byte Order Mark"). This is the Unicode
|
---|
728 | character \code{U+FEFF}. This character will be prepended to every UTF-16
|
---|
729 | byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
|
---|
730 | an illegal character that may not appear in a Unicode text. So when
|
---|
731 | the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
|
---|
732 | the bytes have to be swapped on decoding. Unfortunately upto Unicode
|
---|
733 | 4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
|
---|
734 | NO-BREAK SPACE}: A character that has no width and doesn't allow a
|
---|
735 | word to be split. It can e.g. be used to give hints to a ligature
|
---|
736 | algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
|
---|
737 | SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
|
---|
738 | this role). Nevertheless Unicode software still must be able to handle
|
---|
739 | \code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
|
---|
740 | layout of the encoded bytes, and vanishes once the byte sequence has
|
---|
741 | been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
|
---|
742 | it's a normal character that will be decoded like any other.
|
---|
743 |
|
---|
744 | There's another encoding that is able to encoding the full range of
|
---|
745 | Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means
|
---|
746 | there are no issues with byte order in UTF-8. Each byte in a UTF-8
|
---|
747 | byte sequence consists of two parts: Marker bits (the most significant
|
---|
748 | bits) and payload bits. The marker bits are a sequence of zero to six
|
---|
749 | 1 bits followed by a 0 bit. Unicode characters are encoded like this
|
---|
750 | (with x being payload bits, which when concatenated give the Unicode
|
---|
751 | character):
|
---|
752 |
|
---|
753 | \begin{tableii}{l|l}{textrm}{Range}{Encoding}
|
---|
754 | \lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
|
---|
755 | \lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
|
---|
756 | \lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
|
---|
757 | \lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
---|
758 | \lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
---|
759 | \lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
---|
760 | \end{tableii}
|
---|
761 |
|
---|
762 | The least significant bit of the Unicode character is the rightmost x
|
---|
763 | bit.
|
---|
764 |
|
---|
765 | As UTF-8 is an 8-bit encoding no BOM is required and any \code{U+FEFF}
|
---|
766 | character in the decoded Unicode string (even if it's the first
|
---|
767 | character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
|
---|
768 |
|
---|
769 | Without external information it's impossible to reliably determine
|
---|
770 | which encoding was used for encoding a Unicode string. Each charmap
|
---|
771 | encoding can decode any random byte sequence. However that's not
|
---|
772 | possible with UTF-8, as UTF-8 byte sequences have a structure that
|
---|
773 | doesn't allow arbitrary byte sequence. To increase the reliability
|
---|
774 | with which a UTF-8 encoding can be detected, Microsoft invented a
|
---|
775 | variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
|
---|
776 | program: Before any of the Unicode characters is written to the file,
|
---|
777 | a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
|
---|
778 | \code{0xbb}, \code{0xbf}) is written. As it's rather improbable that any
|
---|
779 | charmap encoded file starts with these byte values (which would e.g. map to
|
---|
780 |
|
---|
781 | LATIN SMALL LETTER I WITH DIAERESIS \\
|
---|
782 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
|
---|
783 | INVERTED QUESTION MARK
|
---|
784 |
|
---|
785 | in iso-8859-1), this increases the probability that a utf-8-sig
|
---|
786 | encoding can be correctly guessed from the byte sequence. So here the
|
---|
787 | BOM is not used to be able to determine the byte order used for
|
---|
788 | generating the byte sequence, but as a signature that helps in
|
---|
789 | guessing the encoding. On encoding the utf-8-sig codec will write
|
---|
790 | \code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
|
---|
791 | On decoding utf-8-sig will skip those three bytes if they appear as the
|
---|
792 | first three bytes in the file.
|
---|
793 |
|
---|
794 |
|
---|
795 | \subsection{Standard Encodings\label{standard-encodings}}
|
---|
796 |
|
---|
797 | Python comes with a number of codecs built-in, either implemented as C
|
---|
798 | functions or with dictionaries as mapping tables. The following table
|
---|
799 | lists the codecs by name, together with a few common aliases, and the
|
---|
800 | languages for which the encoding is likely used. Neither the list of
|
---|
801 | aliases nor the list of languages is meant to be exhaustive. Notice
|
---|
802 | that spelling alternatives that only differ in case or use a hyphen
|
---|
803 | instead of an underscore are also valid aliases.
|
---|
804 |
|
---|
805 | Many of the character sets support the same languages. They vary in
|
---|
806 | individual characters (e.g. whether the EURO SIGN is supported or
|
---|
807 | not), and in the assignment of characters to code positions. For the
|
---|
808 | European languages in particular, the following variants typically
|
---|
809 | exist:
|
---|
810 |
|
---|
811 | \begin{itemize}
|
---|
812 | \item an ISO 8859 codeset
|
---|
813 | \item a Microsoft Windows code page, which is typically derived from
|
---|
814 | a 8859 codeset, but replaces control characters with additional
|
---|
815 | graphic characters
|
---|
816 | \item an IBM EBCDIC code page
|
---|
817 | \item an IBM PC code page, which is \ASCII{} compatible
|
---|
818 | \end{itemize}
|
---|
819 |
|
---|
820 | \begin{longtableiii}{l|l|l}{textrm}{Codec}{Aliases}{Languages}
|
---|
821 |
|
---|
822 | \lineiii{ascii}
|
---|
823 | {646, us-ascii}
|
---|
824 | {English}
|
---|
825 |
|
---|
826 | \lineiii{big5}
|
---|
827 | {big5-tw, csbig5}
|
---|
828 | {Traditional Chinese}
|
---|
829 |
|
---|
830 | \lineiii{big5hkscs}
|
---|
831 | {big5-hkscs, hkscs}
|
---|
832 | {Traditional Chinese}
|
---|
833 |
|
---|
834 | \lineiii{cp037}
|
---|
835 | {IBM037, IBM039}
|
---|
836 | {English}
|
---|
837 |
|
---|
838 | \lineiii{cp424}
|
---|
839 | {EBCDIC-CP-HE, IBM424}
|
---|
840 | {Hebrew}
|
---|
841 |
|
---|
842 | \lineiii{cp437}
|
---|
843 | {437, IBM437}
|
---|
844 | {English}
|
---|
845 |
|
---|
846 | \lineiii{cp500}
|
---|
847 | {EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
|
---|
848 | {Western Europe}
|
---|
849 |
|
---|
850 | \lineiii{cp737}
|
---|
851 | {}
|
---|
852 | {Greek}
|
---|
853 |
|
---|
854 | \lineiii{cp775}
|
---|
855 | {IBM775}
|
---|
856 | {Baltic languages}
|
---|
857 |
|
---|
858 | \lineiii{cp850}
|
---|
859 | {850, IBM850}
|
---|
860 | {Western Europe}
|
---|
861 |
|
---|
862 | \lineiii{cp852}
|
---|
863 | {852, IBM852}
|
---|
864 | {Central and Eastern Europe}
|
---|
865 |
|
---|
866 | \lineiii{cp855}
|
---|
867 | {855, IBM855}
|
---|
868 | {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
|
---|
869 |
|
---|
870 | \lineiii{cp856}
|
---|
871 | {}
|
---|
872 | {Hebrew}
|
---|
873 |
|
---|
874 | \lineiii{cp857}
|
---|
875 | {857, IBM857}
|
---|
876 | {Turkish}
|
---|
877 |
|
---|
878 | \lineiii{cp860}
|
---|
879 | {860, IBM860}
|
---|
880 | {Portuguese}
|
---|
881 |
|
---|
882 | \lineiii{cp861}
|
---|
883 | {861, CP-IS, IBM861}
|
---|
884 | {Icelandic}
|
---|
885 |
|
---|
886 | \lineiii{cp862}
|
---|
887 | {862, IBM862}
|
---|
888 | {Hebrew}
|
---|
889 |
|
---|
890 | \lineiii{cp863}
|
---|
891 | {863, IBM863}
|
---|
892 | {Canadian}
|
---|
893 |
|
---|
894 | \lineiii{cp864}
|
---|
895 | {IBM864}
|
---|
896 | {Arabic}
|
---|
897 |
|
---|
898 | \lineiii{cp865}
|
---|
899 | {865, IBM865}
|
---|
900 | {Danish, Norwegian}
|
---|
901 |
|
---|
902 | \lineiii{cp866}
|
---|
903 | {866, IBM866}
|
---|
904 | {Russian}
|
---|
905 |
|
---|
906 | \lineiii{cp869}
|
---|
907 | {869, CP-GR, IBM869}
|
---|
908 | {Greek}
|
---|
909 |
|
---|
910 | \lineiii{cp874}
|
---|
911 | {}
|
---|
912 | {Thai}
|
---|
913 |
|
---|
914 | \lineiii{cp875}
|
---|
915 | {}
|
---|
916 | {Greek}
|
---|
917 |
|
---|
918 | \lineiii{cp932}
|
---|
919 | {932, ms932, mskanji, ms-kanji}
|
---|
920 | {Japanese}
|
---|
921 |
|
---|
922 | \lineiii{cp949}
|
---|
923 | {949, ms949, uhc}
|
---|
924 | {Korean}
|
---|
925 |
|
---|
926 | \lineiii{cp950}
|
---|
927 | {950, ms950}
|
---|
928 | {Traditional Chinese}
|
---|
929 |
|
---|
930 | \lineiii{cp1006}
|
---|
931 | {}
|
---|
932 | {Urdu}
|
---|
933 |
|
---|
934 | \lineiii{cp1026}
|
---|
935 | {ibm1026}
|
---|
936 | {Turkish}
|
---|
937 |
|
---|
938 | \lineiii{cp1140}
|
---|
939 | {ibm1140}
|
---|
940 | {Western Europe}
|
---|
941 |
|
---|
942 | \lineiii{cp1250}
|
---|
943 | {windows-1250}
|
---|
944 | {Central and Eastern Europe}
|
---|
945 |
|
---|
946 | \lineiii{cp1251}
|
---|
947 | {windows-1251}
|
---|
948 | {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
|
---|
949 |
|
---|
950 | \lineiii{cp1252}
|
---|
951 | {windows-1252}
|
---|
952 | {Western Europe}
|
---|
953 |
|
---|
954 | \lineiii{cp1253}
|
---|
955 | {windows-1253}
|
---|
956 | {Greek}
|
---|
957 |
|
---|
958 | \lineiii{cp1254}
|
---|
959 | {windows-1254}
|
---|
960 | {Turkish}
|
---|
961 |
|
---|
962 | \lineiii{cp1255}
|
---|
963 | {windows-1255}
|
---|
964 | {Hebrew}
|
---|
965 |
|
---|
966 | \lineiii{cp1256}
|
---|
967 | {windows1256}
|
---|
968 | {Arabic}
|
---|
969 |
|
---|
970 | \lineiii{cp1257}
|
---|
971 | {windows-1257}
|
---|
972 | {Baltic languages}
|
---|
973 |
|
---|
974 | \lineiii{cp1258}
|
---|
975 | {windows-1258}
|
---|
976 | {Vietnamese}
|
---|
977 |
|
---|
978 | \lineiii{euc_jp}
|
---|
979 | {eucjp, ujis, u-jis}
|
---|
980 | {Japanese}
|
---|
981 |
|
---|
982 | \lineiii{euc_jis_2004}
|
---|
983 | {jisx0213, eucjis2004}
|
---|
984 | {Japanese}
|
---|
985 |
|
---|
986 | \lineiii{euc_jisx0213}
|
---|
987 | {eucjisx0213}
|
---|
988 | {Japanese}
|
---|
989 |
|
---|
990 | \lineiii{euc_kr}
|
---|
991 | {euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
|
---|
992 | {Korean}
|
---|
993 |
|
---|
994 | \lineiii{gb2312}
|
---|
995 | {chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
|
---|
996 | gb2312-80, iso-ir-58}
|
---|
997 | {Simplified Chinese}
|
---|
998 |
|
---|
999 | \lineiii{gbk}
|
---|
1000 | {936, cp936, ms936}
|
---|
1001 | {Unified Chinese}
|
---|
1002 |
|
---|
1003 | \lineiii{gb18030}
|
---|
1004 | {gb18030-2000}
|
---|
1005 | {Unified Chinese}
|
---|
1006 |
|
---|
1007 | \lineiii{hz}
|
---|
1008 | {hzgb, hz-gb, hz-gb-2312}
|
---|
1009 | {Simplified Chinese}
|
---|
1010 |
|
---|
1011 | \lineiii{iso2022_jp}
|
---|
1012 | {csiso2022jp, iso2022jp, iso-2022-jp}
|
---|
1013 | {Japanese}
|
---|
1014 |
|
---|
1015 | \lineiii{iso2022_jp_1}
|
---|
1016 | {iso2022jp-1, iso-2022-jp-1}
|
---|
1017 | {Japanese}
|
---|
1018 |
|
---|
1019 | \lineiii{iso2022_jp_2}
|
---|
1020 | {iso2022jp-2, iso-2022-jp-2}
|
---|
1021 | {Japanese, Korean, Simplified Chinese, Western Europe, Greek}
|
---|
1022 |
|
---|
1023 | \lineiii{iso2022_jp_2004}
|
---|
1024 | {iso2022jp-2004, iso-2022-jp-2004}
|
---|
1025 | {Japanese}
|
---|
1026 |
|
---|
1027 | \lineiii{iso2022_jp_3}
|
---|
1028 | {iso2022jp-3, iso-2022-jp-3}
|
---|
1029 | {Japanese}
|
---|
1030 |
|
---|
1031 | \lineiii{iso2022_jp_ext}
|
---|
1032 | {iso2022jp-ext, iso-2022-jp-ext}
|
---|
1033 | {Japanese}
|
---|
1034 |
|
---|
1035 | \lineiii{iso2022_kr}
|
---|
1036 | {csiso2022kr, iso2022kr, iso-2022-kr}
|
---|
1037 | {Korean}
|
---|
1038 |
|
---|
1039 | \lineiii{latin_1}
|
---|
1040 | {iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
|
---|
1041 | {West Europe}
|
---|
1042 |
|
---|
1043 | \lineiii{iso8859_2}
|
---|
1044 | {iso-8859-2, latin2, L2}
|
---|
1045 | {Central and Eastern Europe}
|
---|
1046 |
|
---|
1047 | \lineiii{iso8859_3}
|
---|
1048 | {iso-8859-3, latin3, L3}
|
---|
1049 | {Esperanto, Maltese}
|
---|
1050 |
|
---|
1051 | \lineiii{iso8859_4}
|
---|
1052 | {iso-8859-4, latin4, L4}
|
---|
1053 | {Baltic languagues}
|
---|
1054 |
|
---|
1055 | \lineiii{iso8859_5}
|
---|
1056 | {iso-8859-5, cyrillic}
|
---|
1057 | {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
|
---|
1058 |
|
---|
1059 | \lineiii{iso8859_6}
|
---|
1060 | {iso-8859-6, arabic}
|
---|
1061 | {Arabic}
|
---|
1062 |
|
---|
1063 | \lineiii{iso8859_7}
|
---|
1064 | {iso-8859-7, greek, greek8}
|
---|
1065 | {Greek}
|
---|
1066 |
|
---|
1067 | \lineiii{iso8859_8}
|
---|
1068 | {iso-8859-8, hebrew}
|
---|
1069 | {Hebrew}
|
---|
1070 |
|
---|
1071 | \lineiii{iso8859_9}
|
---|
1072 | {iso-8859-9, latin5, L5}
|
---|
1073 | {Turkish}
|
---|
1074 |
|
---|
1075 | \lineiii{iso8859_10}
|
---|
1076 | {iso-8859-10, latin6, L6}
|
---|
1077 | {Nordic languages}
|
---|
1078 |
|
---|
1079 | \lineiii{iso8859_13}
|
---|
1080 | {iso-8859-13}
|
---|
1081 | {Baltic languages}
|
---|
1082 |
|
---|
1083 | \lineiii{iso8859_14}
|
---|
1084 | {iso-8859-14, latin8, L8}
|
---|
1085 | {Celtic languages}
|
---|
1086 |
|
---|
1087 | \lineiii{iso8859_15}
|
---|
1088 | {iso-8859-15}
|
---|
1089 | {Western Europe}
|
---|
1090 |
|
---|
1091 | \lineiii{johab}
|
---|
1092 | {cp1361, ms1361}
|
---|
1093 | {Korean}
|
---|
1094 |
|
---|
1095 | \lineiii{koi8_r}
|
---|
1096 | {}
|
---|
1097 | {Russian}
|
---|
1098 |
|
---|
1099 | \lineiii{koi8_u}
|
---|
1100 | {}
|
---|
1101 | {Ukrainian}
|
---|
1102 |
|
---|
1103 | \lineiii{mac_cyrillic}
|
---|
1104 | {maccyrillic}
|
---|
1105 | {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
|
---|
1106 |
|
---|
1107 | \lineiii{mac_greek}
|
---|
1108 | {macgreek}
|
---|
1109 | {Greek}
|
---|
1110 |
|
---|
1111 | \lineiii{mac_iceland}
|
---|
1112 | {maciceland}
|
---|
1113 | {Icelandic}
|
---|
1114 |
|
---|
1115 | \lineiii{mac_latin2}
|
---|
1116 | {maclatin2, maccentraleurope}
|
---|
1117 | {Central and Eastern Europe}
|
---|
1118 |
|
---|
1119 | \lineiii{mac_roman}
|
---|
1120 | {macroman}
|
---|
1121 | {Western Europe}
|
---|
1122 |
|
---|
1123 | \lineiii{mac_turkish}
|
---|
1124 | {macturkish}
|
---|
1125 | {Turkish}
|
---|
1126 |
|
---|
1127 | \lineiii{ptcp154}
|
---|
1128 | {csptcp154, pt154, cp154, cyrillic-asian}
|
---|
1129 | {Kazakh}
|
---|
1130 |
|
---|
1131 | \lineiii{shift_jis}
|
---|
1132 | {csshiftjis, shiftjis, sjis, s_jis}
|
---|
1133 | {Japanese}
|
---|
1134 |
|
---|
1135 | \lineiii{shift_jis_2004}
|
---|
1136 | {shiftjis2004, sjis_2004, sjis2004}
|
---|
1137 | {Japanese}
|
---|
1138 |
|
---|
1139 | \lineiii{shift_jisx0213}
|
---|
1140 | {shiftjisx0213, sjisx0213, s_jisx0213}
|
---|
1141 | {Japanese}
|
---|
1142 |
|
---|
1143 | \lineiii{utf_16}
|
---|
1144 | {U16, utf16}
|
---|
1145 | {all languages}
|
---|
1146 |
|
---|
1147 | \lineiii{utf_16_be}
|
---|
1148 | {UTF-16BE}
|
---|
1149 | {all languages (BMP only)}
|
---|
1150 |
|
---|
1151 | \lineiii{utf_16_le}
|
---|
1152 | {UTF-16LE}
|
---|
1153 | {all languages (BMP only)}
|
---|
1154 |
|
---|
1155 | \lineiii{utf_7}
|
---|
1156 | {U7, unicode-1-1-utf-7}
|
---|
1157 | {all languages}
|
---|
1158 |
|
---|
1159 | \lineiii{utf_8}
|
---|
1160 | {U8, UTF, utf8}
|
---|
1161 | {all languages}
|
---|
1162 |
|
---|
1163 | \lineiii{utf_8_sig}
|
---|
1164 | {}
|
---|
1165 | {all languages}
|
---|
1166 |
|
---|
1167 | \end{longtableiii}
|
---|
1168 |
|
---|
1169 | A number of codecs are specific to Python, so their codec names have
|
---|
1170 | no meaning outside Python. Some of them don't convert from Unicode
|
---|
1171 | strings to byte strings, but instead use the property of the Python
|
---|
1172 | codecs machinery that any bijective function with one argument can be
|
---|
1173 | considered as an encoding.
|
---|
1174 |
|
---|
1175 | For the codecs listed below, the result in the ``encoding'' direction
|
---|
1176 | is always a byte string. The result of the ``decoding'' direction is
|
---|
1177 | listed as operand type in the table.
|
---|
1178 |
|
---|
1179 | \begin{tableiv}{l|l|l|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
|
---|
1180 |
|
---|
1181 | \lineiv{base64_codec}
|
---|
1182 | {base64, base-64}
|
---|
1183 | {byte string}
|
---|
1184 | {Convert operand to MIME base64}
|
---|
1185 |
|
---|
1186 | \lineiv{bz2_codec}
|
---|
1187 | {bz2}
|
---|
1188 | {byte string}
|
---|
1189 | {Compress the operand using bz2}
|
---|
1190 |
|
---|
1191 | \lineiv{hex_codec}
|
---|
1192 | {hex}
|
---|
1193 | {byte string}
|
---|
1194 | {Convert operand to hexadecimal representation, with two
|
---|
1195 | digits per byte}
|
---|
1196 |
|
---|
1197 | \lineiv{idna}
|
---|
1198 | {}
|
---|
1199 | {Unicode string}
|
---|
1200 | {Implements \rfc{3490}.
|
---|
1201 | \versionadded{2.3}
|
---|
1202 | See also \refmodule{encodings.idna}}
|
---|
1203 |
|
---|
1204 | \lineiv{mbcs}
|
---|
1205 | {dbcs}
|
---|
1206 | {Unicode string}
|
---|
1207 | {Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
|
---|
1208 |
|
---|
1209 | \lineiv{palmos}
|
---|
1210 | {}
|
---|
1211 | {Unicode string}
|
---|
1212 | {Encoding of PalmOS 3.5}
|
---|
1213 |
|
---|
1214 | \lineiv{punycode}
|
---|
1215 | {}
|
---|
1216 | {Unicode string}
|
---|
1217 | {Implements \rfc{3492}.
|
---|
1218 | \versionadded{2.3}}
|
---|
1219 |
|
---|
1220 | \lineiv{quopri_codec}
|
---|
1221 | {quopri, quoted-printable, quotedprintable}
|
---|
1222 | {byte string}
|
---|
1223 | {Convert operand to MIME quoted printable}
|
---|
1224 |
|
---|
1225 | \lineiv{raw_unicode_escape}
|
---|
1226 | {}
|
---|
1227 | {Unicode string}
|
---|
1228 | {Produce a string that is suitable as raw Unicode literal in
|
---|
1229 | Python source code}
|
---|
1230 |
|
---|
1231 | \lineiv{rot_13}
|
---|
1232 | {rot13}
|
---|
1233 | {Unicode string}
|
---|
1234 | {Returns the Caesar-cypher encryption of the operand}
|
---|
1235 |
|
---|
1236 | \lineiv{string_escape}
|
---|
1237 | {}
|
---|
1238 | {byte string}
|
---|
1239 | {Produce a string that is suitable as string literal in
|
---|
1240 | Python source code}
|
---|
1241 |
|
---|
1242 | \lineiv{undefined}
|
---|
1243 | {}
|
---|
1244 | {any}
|
---|
1245 | {Raise an exception for all conversions. Can be used as the
|
---|
1246 | system encoding if no automatic coercion between byte and
|
---|
1247 | Unicode strings is desired.}
|
---|
1248 |
|
---|
1249 | \lineiv{unicode_escape}
|
---|
1250 | {}
|
---|
1251 | {Unicode string}
|
---|
1252 | {Produce a string that is suitable as Unicode literal in
|
---|
1253 | Python source code}
|
---|
1254 |
|
---|
1255 | \lineiv{unicode_internal}
|
---|
1256 | {}
|
---|
1257 | {Unicode string}
|
---|
1258 | {Return the internal representation of the operand}
|
---|
1259 |
|
---|
1260 | \lineiv{uu_codec}
|
---|
1261 | {uu}
|
---|
1262 | {byte string}
|
---|
1263 | {Convert the operand using uuencode}
|
---|
1264 |
|
---|
1265 | \lineiv{zlib_codec}
|
---|
1266 | {zip, zlib}
|
---|
1267 | {byte string}
|
---|
1268 | {Compress the operand using gzip}
|
---|
1269 |
|
---|
1270 | \end{tableiv}
|
---|
1271 |
|
---|
1272 | \subsection{\module{encodings.idna} ---
|
---|
1273 | Internationalized Domain Names in Applications}
|
---|
1274 |
|
---|
1275 | \declaremodule{standard}{encodings.idna}
|
---|
1276 | \modulesynopsis{Internationalized Domain Names implementation}
|
---|
1277 | % XXX The next line triggers a formatting bug, so it's commented out
|
---|
1278 | % until that can be fixed.
|
---|
1279 | %\moduleauthor{Martin v. L\"owis}
|
---|
1280 |
|
---|
1281 | \versionadded{2.3}
|
---|
1282 |
|
---|
1283 | This module implements \rfc{3490} (Internationalized Domain Names in
|
---|
1284 | Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
|
---|
1285 | Internationalized Domain Names (IDN)). It builds upon the
|
---|
1286 | \code{punycode} encoding and \refmodule{stringprep}.
|
---|
1287 |
|
---|
1288 | These RFCs together define a protocol to support non-\ASCII{} characters
|
---|
1289 | in domain names. A domain name containing non-\ASCII{} characters (such
|
---|
1290 | as ``www.Alliancefran\c caise.nu'') is converted into an
|
---|
1291 | \ASCII-compatible encoding (ACE, such as
|
---|
1292 | ``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
|
---|
1293 | is then used in all places where arbitrary characters are not allowed
|
---|
1294 | by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
|
---|
1295 | on. This conversion is carried out in the application; if possible
|
---|
1296 | invisible to the user: The application should transparently convert
|
---|
1297 | Unicode domain labels to IDNA on the wire, and convert back ACE labels
|
---|
1298 | to Unicode before presenting them to the user.
|
---|
1299 |
|
---|
1300 | Python supports this conversion in several ways: The \code{idna} codec
|
---|
1301 | allows to convert between Unicode and the ACE. Furthermore, the
|
---|
1302 | \refmodule{socket} module transparently converts Unicode host names to
|
---|
1303 | ACE, so that applications need not be concerned about converting host
|
---|
1304 | names themselves when they pass them to the socket module. On top of
|
---|
1305 | that, modules that have host names as function parameters, such as
|
---|
1306 | \refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
|
---|
1307 | (\refmodule{httplib} then also transparently sends an IDNA hostname in
|
---|
1308 | the \mailheader{Host} field if it sends that field at all).
|
---|
1309 |
|
---|
1310 | When receiving host names from the wire (such as in reverse name
|
---|
1311 | lookup), no automatic conversion to Unicode is performed: Applications
|
---|
1312 | wishing to present such host names to the user should decode them to
|
---|
1313 | Unicode.
|
---|
1314 |
|
---|
1315 | The module \module{encodings.idna} also implements the nameprep
|
---|
1316 | procedure, which performs certain normalizations on host names, to
|
---|
1317 | achieve case-insensitivity of international domain names, and to unify
|
---|
1318 | similar characters. The nameprep functions can be used directly if
|
---|
1319 | desired.
|
---|
1320 |
|
---|
1321 | \begin{funcdesc}{nameprep}{label}
|
---|
1322 | Return the nameprepped version of \var{label}. The implementation
|
---|
1323 | currently assumes query strings, so \code{AllowUnassigned} is
|
---|
1324 | true.
|
---|
1325 | \end{funcdesc}
|
---|
1326 |
|
---|
1327 | \begin{funcdesc}{ToASCII}{label}
|
---|
1328 | Convert a label to \ASCII, as specified in \rfc{3490}.
|
---|
1329 | \code{UseSTD3ASCIIRules} is assumed to be false.
|
---|
1330 | \end{funcdesc}
|
---|
1331 |
|
---|
1332 | \begin{funcdesc}{ToUnicode}{label}
|
---|
1333 | Convert a label to Unicode, as specified in \rfc{3490}.
|
---|
1334 | \end{funcdesc}
|
---|
1335 |
|
---|
1336 | \subsection{\module{encodings.utf_8_sig} ---
|
---|
1337 | UTF-8 codec with BOM signature}
|
---|
1338 | \declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
|
---|
1339 | \modulesynopsis{UTF-8 codec with BOM signature}
|
---|
1340 | \moduleauthor{Walter D\"orwald}{}
|
---|
1341 |
|
---|
1342 | \versionadded{2.5}
|
---|
1343 |
|
---|
1344 | This module implements a variant of the UTF-8 codec: On encoding a
|
---|
1345 | UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
|
---|
1346 | the stateful encoder this is only done once (on the first write to the
|
---|
1347 | byte stream). For decoding an optional UTF-8 encoded BOM at the start
|
---|
1348 | of the data will be skipped.
|
---|