source: python/trunk/Doc/c-api/unicode.rst

Last change on this file was 391, checked in by dmik, 11 years ago

python: Merge vendor 2.7.6 to trunk.

  • Property svn:eol-style set to native
File size: 44.0 KB
RevLine 
[2]1.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
[391]14Unicode Type
15""""""""""""
16
[2]17These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
20
[391]21.. c:type:: Py_UNICODE
[2]22
23 This type represents the storage type which is used by Python internally as
24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
[391]25 for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
[2]26 possible to build a UCS4 version of Python (most recent Linux distributions come
27 with UCS4 builds of Python). These builds then use a 32-bit type for
[391]28 :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29 where :c:type:`wchar_t` is available and compatible with the chosen Python
30 Unicode build variant, :c:type:`Py_UNICODE` is a typedef alias for
31 :c:type:`wchar_t` to enhance native platform compatibility. On all other
32 platforms, :c:type:`Py_UNICODE` is a typedef alias for either :c:type:`unsigned
33 short` (UCS2) or :c:type:`unsigned long` (UCS4).
[2]34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
[391]39.. c:type:: PyUnicodeObject
[2]40
[391]41 This subtype of :c:type:`PyObject` represents a Python Unicode object.
[2]42
43
[391]44.. c:var:: PyTypeObject PyUnicode_Type
[2]45
[391]46 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It
[2]47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
[391]53.. c:function:: int PyUnicode_Check(PyObject *o)
[2]54
55 Return true if the object *o* is a Unicode object or an instance of a Unicode
56 subtype.
57
58 .. versionchanged:: 2.2
59 Allowed subtypes to be accepted.
60
61
[391]62.. c:function:: int PyUnicode_CheckExact(PyObject *o)
[2]63
64 Return true if the object *o* is a Unicode object, but not an instance of a
65 subtype.
66
67 .. versionadded:: 2.2
68
69
[391]70.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
[2]71
[391]72 Return the size of the object. *o* has to be a :c:type:`PyUnicodeObject` (not
[2]73 checked).
74
75 .. versionchanged:: 2.5
[391]76 This function returned an :c:type:`int` type. This might require changes
[2]77 in your code for properly supporting 64-bit systems.
78
79
[391]80.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
[2]81
82 Return the size of the object's internal buffer in bytes. *o* has to be a
[391]83 :c:type:`PyUnicodeObject` (not checked).
[2]84
85 .. versionchanged:: 2.5
[391]86 This function returned an :c:type:`int` type. This might require changes
[2]87 in your code for properly supporting 64-bit systems.
88
89
[391]90.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
[2]91
[391]92 Return a pointer to the internal :c:type:`Py_UNICODE` buffer of the object. *o*
93 has to be a :c:type:`PyUnicodeObject` (not checked).
[2]94
95
[391]96.. c:function:: const char* PyUnicode_AS_DATA(PyObject *o)
[2]97
98 Return a pointer to the internal buffer of the object. *o* has to be a
[391]99 :c:type:`PyUnicodeObject` (not checked).
[2]100
101
[391]102.. c:function:: int PyUnicode_ClearFreeList()
[2]103
104 Clear the free list. Return the total number of freed items.
105
106 .. versionadded:: 2.6
107
108
[391]109Unicode Character Properties
110""""""""""""""""""""""""""""
111
[2]112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
116
[391]117.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
[2]118
119 Return 1 or 0 depending on whether *ch* is a whitespace character.
120
121
[391]122.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
[2]123
124 Return 1 or 0 depending on whether *ch* is a lowercase character.
125
126
[391]127.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
[2]128
129 Return 1 or 0 depending on whether *ch* is an uppercase character.
130
131
[391]132.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
[2]133
134 Return 1 or 0 depending on whether *ch* is a titlecase character.
135
136
[391]137.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
[2]138
139 Return 1 or 0 depending on whether *ch* is a linebreak character.
140
141
[391]142.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
[2]143
144 Return 1 or 0 depending on whether *ch* is a decimal character.
145
146
[391]147.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
[2]148
149 Return 1 or 0 depending on whether *ch* is a digit character.
150
151
[391]152.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
[2]153
154 Return 1 or 0 depending on whether *ch* is a numeric character.
155
156
[391]157.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
[2]158
159 Return 1 or 0 depending on whether *ch* is an alphabetic character.
160
161
[391]162.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
[2]163
164 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
[391]169.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
[2]170
171 Return the character *ch* converted to lower case.
172
173
[391]174.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
[2]175
176 Return the character *ch* converted to upper case.
177
178
[391]179.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
[2]180
181 Return the character *ch* converted to title case.
182
183
[391]184.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
[2]185
186 Return the character *ch* converted to a decimal positive integer. Return
187 ``-1`` if this is not possible. This macro does not raise exceptions.
188
189
[391]190.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
[2]191
192 Return the character *ch* converted to a single digit integer. Return ``-1`` if
193 this is not possible. This macro does not raise exceptions.
194
195
[391]196.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
[2]197
198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199 possible. This macro does not raise exceptions.
200
[391]201
202Plain Py_UNICODE
203""""""""""""""""
204
[2]205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
208
[391]209.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
[2]210
[391]211 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
[2]212 may be *NULL* which causes the contents to be undefined. It is the user's
213 responsibility to fill in the needed data. The buffer is copied into the new
214 object. If the buffer is not *NULL*, the return value might be a shared object.
215 Therefore, modification of the resulting Unicode object is only allowed when *u*
216 is *NULL*.
217
218 .. versionchanged:: 2.5
[391]219 This function used an :c:type:`int` type for *size*. This might require
[2]220 changes in your code for properly supporting 64-bit systems.
221
222
[391]223.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
[2]224
[391]225 Create a Unicode object from the char buffer *u*. The bytes will be interpreted
226 as being UTF-8 encoded. *u* may also be *NULL* which
227 causes the contents to be undefined. It is the user's responsibility to fill in
228 the needed data. The buffer is copied into the new object. If the buffer is not
229 *NULL*, the return value might be a shared object. Therefore, modification of
230 the resulting Unicode object is only allowed when *u* is *NULL*.
[2]231
[391]232 .. versionadded:: 2.6
[2]233
234
[391]235.. c:function:: PyObject *PyUnicode_FromString(const char *u)
236
237 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
238 *u*.
239
240 .. versionadded:: 2.6
241
242
243.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
244
245 Take a C :c:func:`printf`\ -style *format* string and a variable number of
246 arguments, calculate the size of the resulting Python unicode string and return
247 a string with the values formatted into it. The variable arguments must be C
248 types and must correspond exactly to the format characters in the *format*
249 string. The following format characters are allowed:
250
251 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
252 .. % because not all compilers support the %z width modifier -- we fake it
253 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
254
255 .. tabularcolumns:: |l|l|L|
256
257 +-------------------+---------------------+--------------------------------+
258 | Format Characters | Type | Comment |
259 +===================+=====================+================================+
260 | :attr:`%%` | *n/a* | The literal % character. |
261 +-------------------+---------------------+--------------------------------+
262 | :attr:`%c` | int | A single character, |
263 | | | represented as an C int. |
264 +-------------------+---------------------+--------------------------------+
265 | :attr:`%d` | int | Exactly equivalent to |
266 | | | ``printf("%d")``. |
267 +-------------------+---------------------+--------------------------------+
268 | :attr:`%u` | unsigned int | Exactly equivalent to |
269 | | | ``printf("%u")``. |
270 +-------------------+---------------------+--------------------------------+
271 | :attr:`%ld` | long | Exactly equivalent to |
272 | | | ``printf("%ld")``. |
273 +-------------------+---------------------+--------------------------------+
274 | :attr:`%lu` | unsigned long | Exactly equivalent to |
275 | | | ``printf("%lu")``. |
276 +-------------------+---------------------+--------------------------------+
277 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
278 | | | ``printf("%zd")``. |
279 +-------------------+---------------------+--------------------------------+
280 | :attr:`%zu` | size_t | Exactly equivalent to |
281 | | | ``printf("%zu")``. |
282 +-------------------+---------------------+--------------------------------+
283 | :attr:`%i` | int | Exactly equivalent to |
284 | | | ``printf("%i")``. |
285 +-------------------+---------------------+--------------------------------+
286 | :attr:`%x` | int | Exactly equivalent to |
287 | | | ``printf("%x")``. |
288 +-------------------+---------------------+--------------------------------+
289 | :attr:`%s` | char\* | A null-terminated C character |
290 | | | array. |
291 +-------------------+---------------------+--------------------------------+
292 | :attr:`%p` | void\* | The hex representation of a C |
293 | | | pointer. Mostly equivalent to |
294 | | | ``printf("%p")`` except that |
295 | | | it is guaranteed to start with |
296 | | | the literal ``0x`` regardless |
297 | | | of what the platform's |
298 | | | ``printf`` yields. |
299 +-------------------+---------------------+--------------------------------+
300 | :attr:`%U` | PyObject\* | A unicode object. |
301 +-------------------+---------------------+--------------------------------+
302 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
303 | | | *NULL*) and a null-terminated |
304 | | | C character array as a second |
305 | | | parameter (which will be used, |
306 | | | if the first parameter is |
307 | | | *NULL*). |
308 +-------------------+---------------------+--------------------------------+
309 | :attr:`%S` | PyObject\* | The result of calling |
310 | | | :func:`PyObject_Unicode`. |
311 +-------------------+---------------------+--------------------------------+
312 | :attr:`%R` | PyObject\* | The result of calling |
313 | | | :func:`PyObject_Repr`. |
314 +-------------------+---------------------+--------------------------------+
315
316 An unrecognized format character causes all the rest of the format string to be
317 copied as-is to the result string, and any extra arguments discarded.
318
319 .. versionadded:: 2.6
320
321
322.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
323
324 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
325 arguments.
326
327 .. versionadded:: 2.6
328
329
330.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
331
332 Return a read-only pointer to the Unicode object's internal
333 :c:type:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object.
334 Note that the resulting :c:type:`Py_UNICODE*` string may contain embedded
335 null characters, which would cause the string to be truncated when used in
336 most C functions.
337
338
339.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
340
[2]341 Return the length of the Unicode object.
342
343 .. versionchanged:: 2.5
[391]344 This function returned an :c:type:`int` type. This might require changes
[2]345 in your code for properly supporting 64-bit systems.
346
347
[391]348.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
[2]349
350 Coerce an encoded object *obj* to an Unicode object and return a reference with
351 incremented refcount.
352
353 String and other char buffer compatible objects are decoded according to the
354 given encoding and using the error handling defined by errors. Both can be
355 *NULL* to have the interface use the default values (see the next section for
356 details).
357
358 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
359 set.
360
361 The API returns *NULL* if there was an error. The caller is responsible for
362 decref'ing the returned objects.
363
364
[391]365.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
[2]366
367 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
368 throughout the interpreter whenever coercion to Unicode is needed.
369
[391]370If the platform supports :c:type:`wchar_t` and provides a header file wchar.h,
[2]371Python can interface directly to this type using the following functions.
[391]372Support is optimized if Python's own :c:type:`Py_UNICODE` type is identical to
373the system's :c:type:`wchar_t`.
[2]374
375
[391]376wchar_t Support
377"""""""""""""""
[2]378
[391]379:c:type:`wchar_t` support for platforms which support it:
[2]380
[391]381.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
382
383 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
[2]384 Return *NULL* on failure.
385
386 .. versionchanged:: 2.5
[391]387 This function used an :c:type:`int` type for *size*. This might require
[2]388 changes in your code for properly supporting 64-bit systems.
389
390
[391]391.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
[2]392
[391]393 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most
394 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
395 0-termination character). Return the number of :c:type:`wchar_t` characters
396 copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t`
[2]397 string may or may not be 0-terminated. It is the responsibility of the caller
[391]398 to make sure that the :c:type:`wchar_t` string is 0-terminated in case this is
399 required by the application. Also, note that the :c:type:`wchar_t*` string
400 might contain null characters, which would cause the string to be truncated
401 when used with most C functions.
[2]402
403 .. versionchanged:: 2.5
[391]404 This function returned an :c:type:`int` type and used an :c:type:`int`
[2]405 type for *size*. This might require changes in your code for properly
406 supporting 64-bit systems.
407
408
409.. _builtincodecs:
410
411Built-in Codecs
412^^^^^^^^^^^^^^^
413
414Python provides a set of built-in codecs which are written in C for speed. All of
415these codecs are directly usable via the following functions.
416
[391]417Many of the following APIs take two arguments encoding and errors, and they
418have the same semantics as the ones of the built-in :func:`unicode` Unicode
419object constructor.
[2]420
421Setting encoding to *NULL* causes the default encoding to be used which is
[391]422ASCII. The file system calls should use :c:data:`Py_FileSystemDefaultEncoding`
423as the encoding for file names. This variable should be treated as read-only: on
[2]424some systems, it will be a pointer to a static string, on others, it will change
425at run-time (such as when the application invokes setlocale).
426
427Error handling is set by errors which may also be set to *NULL* meaning to use
428the default handling defined for the codec. Default error handling for all
429built-in codecs is "strict" (:exc:`ValueError` is raised).
430
431The codecs all use a similar interface. Only deviation from the following
432generic ones are documented for simplicity.
433
[391]434
435Generic Codecs
436""""""""""""""
437
[2]438These are the generic codec APIs:
439
440
[391]441.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
[2]442
443 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
444 *encoding* and *errors* have the same meaning as the parameters of the same name
445 in the :func:`unicode` built-in function. The codec to be used is looked up
446 using the Python codec registry. Return *NULL* if an exception was raised by
447 the codec.
448
449 .. versionchanged:: 2.5
[391]450 This function used an :c:type:`int` type for *size*. This might require
[2]451 changes in your code for properly supporting 64-bit systems.
452
453
[391]454.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
[2]455
[391]456 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
[2]457 string object. *encoding* and *errors* have the same meaning as the parameters
[391]458 of the same name in the Unicode :meth:`~unicode.encode` method. The codec
459 to be used is looked up using the Python codec registry. Return *NULL* if
460 an exception was raised by the codec.
[2]461
462 .. versionchanged:: 2.5
[391]463 This function used an :c:type:`int` type for *size*. This might require
[2]464 changes in your code for properly supporting 64-bit systems.
465
466
[391]467.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
[2]468
469 Encode a Unicode object and return the result as Python string object.
470 *encoding* and *errors* have the same meaning as the parameters of the same name
471 in the Unicode :meth:`encode` method. The codec to be used is looked up using
472 the Python codec registry. Return *NULL* if an exception was raised by the
473 codec.
474
[391]475
476UTF-8 Codecs
477""""""""""""
478
[2]479These are the UTF-8 codec APIs:
480
481
[391]482.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
[2]483
484 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
485 *s*. Return *NULL* if an exception was raised by the codec.
486
487 .. versionchanged:: 2.5
[391]488 This function used an :c:type:`int` type for *size*. This might require
[2]489 changes in your code for properly supporting 64-bit systems.
490
491
[391]492.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
[2]493
[391]494 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF8`. If
[2]495 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
496 treated as an error. Those bytes will not be decoded and the number of bytes
497 that have been decoded will be stored in *consumed*.
498
499 .. versionadded:: 2.4
500
501 .. versionchanged:: 2.5
[391]502 This function used an :c:type:`int` type for *size*. This might require
[2]503 changes in your code for properly supporting 64-bit systems.
504
505
[391]506.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
[2]507
[391]508 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a
[2]509 Python string object. Return *NULL* if an exception was raised by the codec.
510
511 .. versionchanged:: 2.5
[391]512 This function used an :c:type:`int` type for *size*. This might require
[2]513 changes in your code for properly supporting 64-bit systems.
514
515
[391]516.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
[2]517
518 Encode a Unicode object using UTF-8 and return the result as Python string
519 object. Error handling is "strict". Return *NULL* if an exception was raised
520 by the codec.
521
[391]522
523UTF-32 Codecs
524"""""""""""""
525
[2]526These are the UTF-32 codec APIs:
527
528
[391]529.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
[2]530
[391]531 Decode *size* bytes from a UTF-32 encoded buffer string and return the
[2]532 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
533 handling. It defaults to "strict".
534
535 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
536 order::
537
538 *byteorder == -1: little endian
539 *byteorder == 0: native order
540 *byteorder == 1: big endian
541
542 If ``*byteorder`` is zero, and the first four bytes of the input data are a
543 byte order mark (BOM), the decoder switches to this byte order and the BOM is
544 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
545 ``1``, any byte order mark is copied to the output.
546
547 After completion, *\*byteorder* is set to the current byte order at the end
548 of input data.
549
550 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
551
552 If *byteorder* is *NULL*, the codec starts in native order mode.
553
554 Return *NULL* if an exception was raised by the codec.
555
556 .. versionadded:: 2.6
557
558
[391]559.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
[2]560
[391]561 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF32`. If
562 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
[2]563 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
564 by four) as an error. Those bytes will not be decoded and the number of bytes
565 that have been decoded will be stored in *consumed*.
566
567 .. versionadded:: 2.6
568
569
[391]570.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
[2]571
572 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
573 data in *s*. Output is written according to the following byte order::
574
575 byteorder == -1: little endian
576 byteorder == 0: native byte order (writes a BOM mark)
577 byteorder == 1: big endian
578
579 If byteorder is ``0``, the output string will always start with the Unicode BOM
580 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
581
582 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
583 as a single codepoint.
584
585 Return *NULL* if an exception was raised by the codec.
586
587 .. versionadded:: 2.6
588
589
[391]590.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
[2]591
592 Return a Python string using the UTF-32 encoding in native byte order. The
593 string always starts with a BOM mark. Error handling is "strict". Return
594 *NULL* if an exception was raised by the codec.
595
596 .. versionadded:: 2.6
597
598
[391]599UTF-16 Codecs
600"""""""""""""
601
[2]602These are the UTF-16 codec APIs:
603
604
[391]605.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
[2]606
[391]607 Decode *size* bytes from a UTF-16 encoded buffer string and return the
[2]608 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
609 handling. It defaults to "strict".
610
611 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
612 order::
613
614 *byteorder == -1: little endian
615 *byteorder == 0: native order
616 *byteorder == 1: big endian
617
618 If ``*byteorder`` is zero, and the first two bytes of the input data are a
619 byte order mark (BOM), the decoder switches to this byte order and the BOM is
620 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
621 ``1``, any byte order mark is copied to the output (where it will result in
622 either a ``\ufeff`` or a ``\ufffe`` character).
623
624 After completion, *\*byteorder* is set to the current byte order at the end
625 of input data.
626
627 If *byteorder* is *NULL*, the codec starts in native order mode.
628
629 Return *NULL* if an exception was raised by the codec.
630
631 .. versionchanged:: 2.5
[391]632 This function used an :c:type:`int` type for *size*. This might require
[2]633 changes in your code for properly supporting 64-bit systems.
634
635
[391]636.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
[2]637
[391]638 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF16`. If
639 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
[2]640 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
641 split surrogate pair) as an error. Those bytes will not be decoded and the
642 number of bytes that have been decoded will be stored in *consumed*.
643
644 .. versionadded:: 2.4
645
646 .. versionchanged:: 2.5
[391]647 This function used an :c:type:`int` type for *size* and an :c:type:`int *`
[2]648 type for *consumed*. This might require changes in your code for
649 properly supporting 64-bit systems.
650
651
[391]652.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
[2]653
654 Return a Python string object holding the UTF-16 encoded value of the Unicode
655 data in *s*. Output is written according to the following byte order::
656
657 byteorder == -1: little endian
658 byteorder == 0: native byte order (writes a BOM mark)
659 byteorder == 1: big endian
660
661 If byteorder is ``0``, the output string will always start with the Unicode BOM
662 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
663
[391]664 If *Py_UNICODE_WIDE* is defined, a single :c:type:`Py_UNICODE` value may get
665 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
[2]666 values is interpreted as an UCS-2 character.
667
668 Return *NULL* if an exception was raised by the codec.
669
670 .. versionchanged:: 2.5
[391]671 This function used an :c:type:`int` type for *size*. This might require
[2]672 changes in your code for properly supporting 64-bit systems.
673
674
[391]675.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
[2]676
677 Return a Python string using the UTF-16 encoding in native byte order. The
678 string always starts with a BOM mark. Error handling is "strict". Return
679 *NULL* if an exception was raised by the codec.
680
[391]681
682UTF-7 Codecs
683""""""""""""
684
685These are the UTF-7 codec APIs:
686
687
688.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
689
690 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
691 *s*. Return *NULL* if an exception was raised by the codec.
692
693
694.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
695
696 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF7`. If
697 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
698 be treated as an error. Those bytes will not be decoded and the number of
699 bytes that have been decoded will be stored in *consumed*.
700
701
702.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
703
704 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
705 return a Python bytes object. Return *NULL* if an exception was raised by
706 the codec.
707
708 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
709 special meaning) will be encoded in base-64. If *base64WhiteSpace* is
710 nonzero, whitespace will be encoded in base-64. Both are set to zero for the
711 Python "utf-7" codec.
712
713
714Unicode-Escape Codecs
715"""""""""""""""""""""
716
[2]717These are the "Unicode Escape" codec APIs:
718
719
[391]720.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
[2]721
722 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
723 string *s*. Return *NULL* if an exception was raised by the codec.
724
725 .. versionchanged:: 2.5
[391]726 This function used an :c:type:`int` type for *size*. This might require
[2]727 changes in your code for properly supporting 64-bit systems.
728
729
[391]730.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
[2]731
[391]732 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
[2]733 return a Python string object. Return *NULL* if an exception was raised by the
734 codec.
735
736 .. versionchanged:: 2.5
[391]737 This function used an :c:type:`int` type for *size*. This might require
[2]738 changes in your code for properly supporting 64-bit systems.
739
740
[391]741.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
[2]742
743 Encode a Unicode object using Unicode-Escape and return the result as Python
744 string object. Error handling is "strict". Return *NULL* if an exception was
745 raised by the codec.
746
[391]747
748Raw-Unicode-Escape Codecs
749"""""""""""""""""""""""""
750
[2]751These are the "Raw Unicode Escape" codec APIs:
752
753
[391]754.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
[2]755
756 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
757 encoded string *s*. Return *NULL* if an exception was raised by the codec.
758
759 .. versionchanged:: 2.5
[391]760 This function used an :c:type:`int` type for *size*. This might require
[2]761 changes in your code for properly supporting 64-bit systems.
762
763
[391]764.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
[2]765
[391]766 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
[2]767 and return a Python string object. Return *NULL* if an exception was raised by
768 the codec.
769
770 .. versionchanged:: 2.5
[391]771 This function used an :c:type:`int` type for *size*. This might require
[2]772 changes in your code for properly supporting 64-bit systems.
773
774
[391]775.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
[2]776
777 Encode a Unicode object using Raw-Unicode-Escape and return the result as
778 Python string object. Error handling is "strict". Return *NULL* if an exception
779 was raised by the codec.
780
[391]781
782Latin-1 Codecs
783""""""""""""""
784
[2]785These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
786ordinals and only these are accepted by the codecs during encoding.
787
788
[391]789.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
[2]790
791 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
792 *s*. Return *NULL* if an exception was raised by the codec.
793
794 .. versionchanged:: 2.5
[391]795 This function used an :c:type:`int` type for *size*. This might require
[2]796 changes in your code for properly supporting 64-bit systems.
797
798
[391]799.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
[2]800
[391]801 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and return
[2]802 a Python string object. Return *NULL* if an exception was raised by the codec.
803
804 .. versionchanged:: 2.5
[391]805 This function used an :c:type:`int` type for *size*. This might require
[2]806 changes in your code for properly supporting 64-bit systems.
807
808
[391]809.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
[2]810
811 Encode a Unicode object using Latin-1 and return the result as Python string
812 object. Error handling is "strict". Return *NULL* if an exception was raised
813 by the codec.
814
[391]815
816ASCII Codecs
817""""""""""""
818
[2]819These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
820codes generate errors.
821
822
[391]823.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
[2]824
825 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
826 *s*. Return *NULL* if an exception was raised by the codec.
827
828 .. versionchanged:: 2.5
[391]829 This function used an :c:type:`int` type for *size*. This might require
[2]830 changes in your code for properly supporting 64-bit systems.
831
832
[391]833.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
[2]834
[391]835 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and return a
[2]836 Python string object. Return *NULL* if an exception was raised by the codec.
837
838 .. versionchanged:: 2.5
[391]839 This function used an :c:type:`int` type for *size*. This might require
[2]840 changes in your code for properly supporting 64-bit systems.
841
842
[391]843.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
[2]844
845 Encode a Unicode object using ASCII and return the result as Python string
846 object. Error handling is "strict". Return *NULL* if an exception was raised
847 by the codec.
848
849
[391]850Character Map Codecs
851""""""""""""""""""""
[2]852
853This codec is special in that it can be used to implement many different codecs
854(and this is in fact what was done to obtain most of the standard codecs
855included in the :mod:`encodings` package). The codec uses mapping to encode and
856decode characters.
857
858Decoding mappings must map single string characters to single Unicode
859characters, integers (which are then interpreted as Unicode ordinals) or None
860(meaning "undefined mapping" and causing an error).
861
862Encoding mappings must map single Unicode characters to single string
863characters, integers (which are then interpreted as Latin-1 ordinals) or None
864(meaning "undefined mapping" and causing an error).
865
866The mapping objects provided must only support the __getitem__ mapping
867interface.
868
869If a character lookup fails with a LookupError, the character is copied as-is
870meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
871resp. Because of this, mappings only need to contain those mappings which map
872characters to different code points.
873
[391]874These are the mapping codec APIs:
[2]875
[391]876.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
[2]877
878 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
879 the given *mapping* object. Return *NULL* if an exception was raised by the
880 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
881 dictionary mapping byte or a unicode string, which is treated as a lookup table.
882 Byte values greater that the length of the string and U+FFFE "characters" are
883 treated as "undefined mapping".
884
885 .. versionchanged:: 2.4
886 Allowed unicode string as mapping argument.
887
888 .. versionchanged:: 2.5
[391]889 This function used an :c:type:`int` type for *size*. This might require
[2]890 changes in your code for properly supporting 64-bit systems.
891
892
[391]893.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
[2]894
[391]895 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
[2]896 *mapping* object and return a Python string object. Return *NULL* if an
897 exception was raised by the codec.
898
899 .. versionchanged:: 2.5
[391]900 This function used an :c:type:`int` type for *size*. This might require
[2]901 changes in your code for properly supporting 64-bit systems.
902
903
[391]904.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
[2]905
906 Encode a Unicode object using the given *mapping* object and return the result
907 as Python string object. Error handling is "strict". Return *NULL* if an
908 exception was raised by the codec.
909
910The following codec API is special in that maps Unicode to Unicode.
911
912
[391]913.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
[2]914
[391]915 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
[2]916 character mapping *table* to it and return the resulting Unicode object. Return
917 *NULL* when an exception was raised by the codec.
918
919 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
920 integers or None (causing deletion of the character).
921
922 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
923 and sequences work well. Unmapped character ordinals (ones which cause a
924 :exc:`LookupError`) are left untouched and are copied as-is.
925
926 .. versionchanged:: 2.5
[391]927 This function used an :c:type:`int` type for *size*. This might require
[2]928 changes in your code for properly supporting 64-bit systems.
929
[391]930
931MBCS codecs for Windows
932"""""""""""""""""""""""
933
[2]934These are the MBCS codec APIs. They are currently only available on Windows and
935use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
936DBCS) is a class of encodings, not just one. The target encoding is defined by
937the user settings on the machine running the codec.
938
939
[391]940.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
[2]941
942 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
943 Return *NULL* if an exception was raised by the codec.
944
945 .. versionchanged:: 2.5
[391]946 This function used an :c:type:`int` type for *size*. This might require
[2]947 changes in your code for properly supporting 64-bit systems.
948
949
[391]950.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
[2]951
[391]952 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeMBCS`. If
953 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
[2]954 trailing lead byte and the number of bytes that have been decoded will be stored
955 in *consumed*.
956
957 .. versionadded:: 2.5
958
959
[391]960.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
[2]961
[391]962 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return a
[2]963 Python string object. Return *NULL* if an exception was raised by the codec.
964
965 .. versionchanged:: 2.5
[391]966 This function used an :c:type:`int` type for *size*. This might require
[2]967 changes in your code for properly supporting 64-bit systems.
968
969
[391]970.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
[2]971
972 Encode a Unicode object using MBCS and return the result as Python string
973 object. Error handling is "strict". Return *NULL* if an exception was raised
974 by the codec.
975
976
[391]977Methods & Slots
978"""""""""""""""
[2]979
980.. _unicodemethodsandslots:
981
982Methods and Slot Functions
983^^^^^^^^^^^^^^^^^^^^^^^^^^
984
985The following APIs are capable of handling Unicode objects and strings on input
986(we refer to them as strings in the descriptions) and return Unicode objects or
987integers as appropriate.
988
989They all return *NULL* or ``-1`` if an exception occurs.
990
991
[391]992.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
[2]993
994 Concat two strings giving a new Unicode string.
995
996
[391]997.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
[2]998
[391]999 Split a string giving a list of Unicode strings. If *sep* is *NULL*, splitting
[2]1000 will be done at all whitespace substrings. Otherwise, splits occur at the given
1001 separator. At most *maxsplit* splits will be done. If negative, no limit is
1002 set. Separators are not included in the resulting list.
1003
1004 .. versionchanged:: 2.5
[391]1005 This function used an :c:type:`int` type for *maxsplit*. This might require
[2]1006 changes in your code for properly supporting 64-bit systems.
1007
1008
[391]1009.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
[2]1010
1011 Split a Unicode string at line breaks, returning a list of Unicode strings.
1012 CRLF is considered to be one line break. If *keepend* is 0, the Line break
1013 characters are not included in the resulting strings.
1014
1015
[391]1016.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
[2]1017
1018 Translate a string by applying a character mapping table to it and return the
1019 resulting Unicode object.
1020
1021 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1022 or None (causing deletion of the character).
1023
1024 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1025 and sequences work well. Unmapped character ordinals (ones which cause a
1026 :exc:`LookupError`) are left untouched and are copied as-is.
1027
1028 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
1029 use the default error handling.
1030
1031
[391]1032.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
[2]1033
[391]1034 Join a sequence of strings using the given *separator* and return the resulting
[2]1035 Unicode string.
1036
1037
[391]1038.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
[2]1039
[391]1040 Return 1 if *substr* matches ``str[start:end]`` at the given tail end
[2]1041 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
1042 0 otherwise. Return ``-1`` if an error occurred.
1043
1044 .. versionchanged:: 2.5
[391]1045 This function used an :c:type:`int` type for *start* and *end*. This
[2]1046 might require changes in your code for properly supporting 64-bit
1047 systems.
1048
1049
[391]1050.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
[2]1051
[391]1052 Return the first position of *substr* in ``str[start:end]`` using the given
[2]1053 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
1054 backward search). The return value is the index of the first match; a value of
1055 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1056 occurred and an exception has been set.
1057
1058 .. versionchanged:: 2.5
[391]1059 This function used an :c:type:`int` type for *start* and *end*. This
[2]1060 might require changes in your code for properly supporting 64-bit
1061 systems.
1062
1063
[391]1064.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
[2]1065
1066 Return the number of non-overlapping occurrences of *substr* in
1067 ``str[start:end]``. Return ``-1`` if an error occurred.
1068
1069 .. versionchanged:: 2.5
[391]1070 This function returned an :c:type:`int` type and used an :c:type:`int`
[2]1071 type for *start* and *end*. This might require changes in your code for
1072 properly supporting 64-bit systems.
1073
1074
[391]1075.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
[2]1076
1077 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1078 return the resulting Unicode object. *maxcount* == -1 means replace all
1079 occurrences.
1080
1081 .. versionchanged:: 2.5
[391]1082 This function used an :c:type:`int` type for *maxcount*. This might
[2]1083 require changes in your code for properly supporting 64-bit systems.
1084
1085
[391]1086.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
[2]1087
1088 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
1089 respectively.
1090
1091
[391]1092.. c:function:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
[2]1093
1094 Rich compare two unicode strings and return one of the following:
1095
1096 * ``NULL`` in case an exception was raised
1097 * :const:`Py_True` or :const:`Py_False` for successful comparisons
1098 * :const:`Py_NotImplemented` in case the type combination is unknown
1099
1100 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1101 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1102 with a :exc:`UnicodeDecodeError`.
1103
1104 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1105 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1106
1107
[391]1108.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
[2]1109
1110 Return a new string object from *format* and *args*; this is analogous to
1111 ``format % args``. The *args* argument must be a tuple.
1112
1113
[391]1114.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
[2]1115
1116 Check whether *element* is contained in *container* and return true or false
1117 accordingly.
1118
1119 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1120 there was an error.
Note: See TracBrowser for help on using the repository browser.