[2] | 1 | .. highlightlang:: c
|
---|
| 2 |
|
---|
| 3 | .. _unicodeobjects:
|
---|
| 4 |
|
---|
| 5 | Unicode Objects and Codecs
|
---|
| 6 | --------------------------
|
---|
| 7 |
|
---|
| 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
|
---|
| 9 |
|
---|
| 10 | Unicode Objects
|
---|
| 11 | ^^^^^^^^^^^^^^^
|
---|
| 12 |
|
---|
| 13 |
|
---|
[391] | 14 | Unicode Type
|
---|
| 15 | """"""""""""
|
---|
| 16 |
|
---|
[2] | 17 | These are the basic Unicode object types used for the Unicode implementation in
|
---|
| 18 | Python:
|
---|
| 19 |
|
---|
| 20 |
|
---|
[391] | 21 | .. c:type:: Py_UNICODE
|
---|
[2] | 22 |
|
---|
| 23 | This type represents the storage type which is used by Python internally as
|
---|
| 24 | basis for holding Unicode ordinals. Python's default builds use a 16-bit type
|
---|
[391] | 25 | for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
|
---|
[2] | 26 | possible to build a UCS4 version of Python (most recent Linux distributions come
|
---|
| 27 | with UCS4 builds of Python). These builds then use a 32-bit type for
|
---|
[391] | 28 | :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
|
---|
| 29 | where :c:type:`wchar_t` is available and compatible with the chosen Python
|
---|
| 30 | Unicode build variant, :c:type:`Py_UNICODE` is a typedef alias for
|
---|
| 31 | :c:type:`wchar_t` to enhance native platform compatibility. On all other
|
---|
| 32 | platforms, :c:type:`Py_UNICODE` is a typedef alias for either :c:type:`unsigned
|
---|
| 33 | short` (UCS2) or :c:type:`unsigned long` (UCS4).
|
---|
[2] | 34 |
|
---|
| 35 | Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
|
---|
| 36 | this in mind when writing extensions or interfaces.
|
---|
| 37 |
|
---|
| 38 |
|
---|
[391] | 39 | .. c:type:: PyUnicodeObject
|
---|
[2] | 40 |
|
---|
[391] | 41 | This subtype of :c:type:`PyObject` represents a Python Unicode object.
|
---|
[2] | 42 |
|
---|
| 43 |
|
---|
[391] | 44 | .. c:var:: PyTypeObject PyUnicode_Type
|
---|
[2] | 45 |
|
---|
[391] | 46 | This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It
|
---|
[2] | 47 | is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
|
---|
| 48 |
|
---|
| 49 | The following APIs are really C macros and can be used to do fast checks and to
|
---|
| 50 | access internal read-only data of Unicode objects:
|
---|
| 51 |
|
---|
| 52 |
|
---|
[391] | 53 | .. c:function:: int PyUnicode_Check(PyObject *o)
|
---|
[2] | 54 |
|
---|
| 55 | Return true if the object *o* is a Unicode object or an instance of a Unicode
|
---|
| 56 | subtype.
|
---|
| 57 |
|
---|
| 58 | .. versionchanged:: 2.2
|
---|
| 59 | Allowed subtypes to be accepted.
|
---|
| 60 |
|
---|
| 61 |
|
---|
[391] | 62 | .. c:function:: int PyUnicode_CheckExact(PyObject *o)
|
---|
[2] | 63 |
|
---|
| 64 | Return true if the object *o* is a Unicode object, but not an instance of a
|
---|
| 65 | subtype.
|
---|
| 66 |
|
---|
| 67 | .. versionadded:: 2.2
|
---|
| 68 |
|
---|
| 69 |
|
---|
[391] | 70 | .. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
|
---|
[2] | 71 |
|
---|
[391] | 72 | Return the size of the object. *o* has to be a :c:type:`PyUnicodeObject` (not
|
---|
[2] | 73 | checked).
|
---|
| 74 |
|
---|
| 75 | .. versionchanged:: 2.5
|
---|
[391] | 76 | This function returned an :c:type:`int` type. This might require changes
|
---|
[2] | 77 | in your code for properly supporting 64-bit systems.
|
---|
| 78 |
|
---|
| 79 |
|
---|
[391] | 80 | .. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
|
---|
[2] | 81 |
|
---|
| 82 | Return the size of the object's internal buffer in bytes. *o* has to be a
|
---|
[391] | 83 | :c:type:`PyUnicodeObject` (not checked).
|
---|
[2] | 84 |
|
---|
| 85 | .. versionchanged:: 2.5
|
---|
[391] | 86 | This function returned an :c:type:`int` type. This might require changes
|
---|
[2] | 87 | in your code for properly supporting 64-bit systems.
|
---|
| 88 |
|
---|
| 89 |
|
---|
[391] | 90 | .. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
|
---|
[2] | 91 |
|
---|
[391] | 92 | Return a pointer to the internal :c:type:`Py_UNICODE` buffer of the object. *o*
|
---|
| 93 | has to be a :c:type:`PyUnicodeObject` (not checked).
|
---|
[2] | 94 |
|
---|
| 95 |
|
---|
[391] | 96 | .. c:function:: const char* PyUnicode_AS_DATA(PyObject *o)
|
---|
[2] | 97 |
|
---|
| 98 | Return a pointer to the internal buffer of the object. *o* has to be a
|
---|
[391] | 99 | :c:type:`PyUnicodeObject` (not checked).
|
---|
[2] | 100 |
|
---|
| 101 |
|
---|
[391] | 102 | .. c:function:: int PyUnicode_ClearFreeList()
|
---|
[2] | 103 |
|
---|
| 104 | Clear the free list. Return the total number of freed items.
|
---|
| 105 |
|
---|
| 106 | .. versionadded:: 2.6
|
---|
| 107 |
|
---|
| 108 |
|
---|
[391] | 109 | Unicode Character Properties
|
---|
| 110 | """"""""""""""""""""""""""""
|
---|
| 111 |
|
---|
[2] | 112 | Unicode provides many different character properties. The most often needed ones
|
---|
| 113 | are available through these macros which are mapped to C functions depending on
|
---|
| 114 | the Python configuration.
|
---|
| 115 |
|
---|
| 116 |
|
---|
[391] | 117 | .. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
|
---|
[2] | 118 |
|
---|
| 119 | Return 1 or 0 depending on whether *ch* is a whitespace character.
|
---|
| 120 |
|
---|
| 121 |
|
---|
[391] | 122 | .. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
|
---|
[2] | 123 |
|
---|
| 124 | Return 1 or 0 depending on whether *ch* is a lowercase character.
|
---|
| 125 |
|
---|
| 126 |
|
---|
[391] | 127 | .. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
|
---|
[2] | 128 |
|
---|
| 129 | Return 1 or 0 depending on whether *ch* is an uppercase character.
|
---|
| 130 |
|
---|
| 131 |
|
---|
[391] | 132 | .. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
|
---|
[2] | 133 |
|
---|
| 134 | Return 1 or 0 depending on whether *ch* is a titlecase character.
|
---|
| 135 |
|
---|
| 136 |
|
---|
[391] | 137 | .. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
|
---|
[2] | 138 |
|
---|
| 139 | Return 1 or 0 depending on whether *ch* is a linebreak character.
|
---|
| 140 |
|
---|
| 141 |
|
---|
[391] | 142 | .. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
|
---|
[2] | 143 |
|
---|
| 144 | Return 1 or 0 depending on whether *ch* is a decimal character.
|
---|
| 145 |
|
---|
| 146 |
|
---|
[391] | 147 | .. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
|
---|
[2] | 148 |
|
---|
| 149 | Return 1 or 0 depending on whether *ch* is a digit character.
|
---|
| 150 |
|
---|
| 151 |
|
---|
[391] | 152 | .. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
|
---|
[2] | 153 |
|
---|
| 154 | Return 1 or 0 depending on whether *ch* is a numeric character.
|
---|
| 155 |
|
---|
| 156 |
|
---|
[391] | 157 | .. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
|
---|
[2] | 158 |
|
---|
| 159 | Return 1 or 0 depending on whether *ch* is an alphabetic character.
|
---|
| 160 |
|
---|
| 161 |
|
---|
[391] | 162 | .. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
|
---|
[2] | 163 |
|
---|
| 164 | Return 1 or 0 depending on whether *ch* is an alphanumeric character.
|
---|
| 165 |
|
---|
| 166 | These APIs can be used for fast direct character conversions:
|
---|
| 167 |
|
---|
| 168 |
|
---|
[391] | 169 | .. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
|
---|
[2] | 170 |
|
---|
| 171 | Return the character *ch* converted to lower case.
|
---|
| 172 |
|
---|
| 173 |
|
---|
[391] | 174 | .. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
|
---|
[2] | 175 |
|
---|
| 176 | Return the character *ch* converted to upper case.
|
---|
| 177 |
|
---|
| 178 |
|
---|
[391] | 179 | .. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
|
---|
[2] | 180 |
|
---|
| 181 | Return the character *ch* converted to title case.
|
---|
| 182 |
|
---|
| 183 |
|
---|
[391] | 184 | .. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
|
---|
[2] | 185 |
|
---|
| 186 | Return the character *ch* converted to a decimal positive integer. Return
|
---|
| 187 | ``-1`` if this is not possible. This macro does not raise exceptions.
|
---|
| 188 |
|
---|
| 189 |
|
---|
[391] | 190 | .. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
|
---|
[2] | 191 |
|
---|
| 192 | Return the character *ch* converted to a single digit integer. Return ``-1`` if
|
---|
| 193 | this is not possible. This macro does not raise exceptions.
|
---|
| 194 |
|
---|
| 195 |
|
---|
[391] | 196 | .. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
|
---|
[2] | 197 |
|
---|
| 198 | Return the character *ch* converted to a double. Return ``-1.0`` if this is not
|
---|
| 199 | possible. This macro does not raise exceptions.
|
---|
| 200 |
|
---|
[391] | 201 |
|
---|
| 202 | Plain Py_UNICODE
|
---|
| 203 | """"""""""""""""
|
---|
| 204 |
|
---|
[2] | 205 | To create Unicode objects and access their basic sequence properties, use these
|
---|
| 206 | APIs:
|
---|
| 207 |
|
---|
| 208 |
|
---|
[391] | 209 | .. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
|
---|
[2] | 210 |
|
---|
[391] | 211 | Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
|
---|
[2] | 212 | may be *NULL* which causes the contents to be undefined. It is the user's
|
---|
| 213 | responsibility to fill in the needed data. The buffer is copied into the new
|
---|
| 214 | object. If the buffer is not *NULL*, the return value might be a shared object.
|
---|
| 215 | Therefore, modification of the resulting Unicode object is only allowed when *u*
|
---|
| 216 | is *NULL*.
|
---|
| 217 |
|
---|
| 218 | .. versionchanged:: 2.5
|
---|
[391] | 219 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 220 | changes in your code for properly supporting 64-bit systems.
|
---|
| 221 |
|
---|
| 222 |
|
---|
[391] | 223 | .. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
|
---|
[2] | 224 |
|
---|
[391] | 225 | Create a Unicode object from the char buffer *u*. The bytes will be interpreted
|
---|
| 226 | as being UTF-8 encoded. *u* may also be *NULL* which
|
---|
| 227 | causes the contents to be undefined. It is the user's responsibility to fill in
|
---|
| 228 | the needed data. The buffer is copied into the new object. If the buffer is not
|
---|
| 229 | *NULL*, the return value might be a shared object. Therefore, modification of
|
---|
| 230 | the resulting Unicode object is only allowed when *u* is *NULL*.
|
---|
[2] | 231 |
|
---|
[391] | 232 | .. versionadded:: 2.6
|
---|
[2] | 233 |
|
---|
| 234 |
|
---|
[391] | 235 | .. c:function:: PyObject *PyUnicode_FromString(const char *u)
|
---|
| 236 |
|
---|
| 237 | Create a Unicode object from an UTF-8 encoded null-terminated char buffer
|
---|
| 238 | *u*.
|
---|
| 239 |
|
---|
| 240 | .. versionadded:: 2.6
|
---|
| 241 |
|
---|
| 242 |
|
---|
| 243 | .. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
|
---|
| 244 |
|
---|
| 245 | Take a C :c:func:`printf`\ -style *format* string and a variable number of
|
---|
| 246 | arguments, calculate the size of the resulting Python unicode string and return
|
---|
| 247 | a string with the values formatted into it. The variable arguments must be C
|
---|
| 248 | types and must correspond exactly to the format characters in the *format*
|
---|
| 249 | string. The following format characters are allowed:
|
---|
| 250 |
|
---|
| 251 | .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
|
---|
| 252 | .. % because not all compilers support the %z width modifier -- we fake it
|
---|
| 253 | .. % when necessary via interpolating PY_FORMAT_SIZE_T.
|
---|
| 254 |
|
---|
| 255 | .. tabularcolumns:: |l|l|L|
|
---|
| 256 |
|
---|
| 257 | +-------------------+---------------------+--------------------------------+
|
---|
| 258 | | Format Characters | Type | Comment |
|
---|
| 259 | +===================+=====================+================================+
|
---|
| 260 | | :attr:`%%` | *n/a* | The literal % character. |
|
---|
| 261 | +-------------------+---------------------+--------------------------------+
|
---|
| 262 | | :attr:`%c` | int | A single character, |
|
---|
| 263 | | | | represented as an C int. |
|
---|
| 264 | +-------------------+---------------------+--------------------------------+
|
---|
| 265 | | :attr:`%d` | int | Exactly equivalent to |
|
---|
| 266 | | | | ``printf("%d")``. |
|
---|
| 267 | +-------------------+---------------------+--------------------------------+
|
---|
| 268 | | :attr:`%u` | unsigned int | Exactly equivalent to |
|
---|
| 269 | | | | ``printf("%u")``. |
|
---|
| 270 | +-------------------+---------------------+--------------------------------+
|
---|
| 271 | | :attr:`%ld` | long | Exactly equivalent to |
|
---|
| 272 | | | | ``printf("%ld")``. |
|
---|
| 273 | +-------------------+---------------------+--------------------------------+
|
---|
| 274 | | :attr:`%lu` | unsigned long | Exactly equivalent to |
|
---|
| 275 | | | | ``printf("%lu")``. |
|
---|
| 276 | +-------------------+---------------------+--------------------------------+
|
---|
| 277 | | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
|
---|
| 278 | | | | ``printf("%zd")``. |
|
---|
| 279 | +-------------------+---------------------+--------------------------------+
|
---|
| 280 | | :attr:`%zu` | size_t | Exactly equivalent to |
|
---|
| 281 | | | | ``printf("%zu")``. |
|
---|
| 282 | +-------------------+---------------------+--------------------------------+
|
---|
| 283 | | :attr:`%i` | int | Exactly equivalent to |
|
---|
| 284 | | | | ``printf("%i")``. |
|
---|
| 285 | +-------------------+---------------------+--------------------------------+
|
---|
| 286 | | :attr:`%x` | int | Exactly equivalent to |
|
---|
| 287 | | | | ``printf("%x")``. |
|
---|
| 288 | +-------------------+---------------------+--------------------------------+
|
---|
| 289 | | :attr:`%s` | char\* | A null-terminated C character |
|
---|
| 290 | | | | array. |
|
---|
| 291 | +-------------------+---------------------+--------------------------------+
|
---|
| 292 | | :attr:`%p` | void\* | The hex representation of a C |
|
---|
| 293 | | | | pointer. Mostly equivalent to |
|
---|
| 294 | | | | ``printf("%p")`` except that |
|
---|
| 295 | | | | it is guaranteed to start with |
|
---|
| 296 | | | | the literal ``0x`` regardless |
|
---|
| 297 | | | | of what the platform's |
|
---|
| 298 | | | | ``printf`` yields. |
|
---|
| 299 | +-------------------+---------------------+--------------------------------+
|
---|
| 300 | | :attr:`%U` | PyObject\* | A unicode object. |
|
---|
| 301 | +-------------------+---------------------+--------------------------------+
|
---|
| 302 | | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
|
---|
| 303 | | | | *NULL*) and a null-terminated |
|
---|
| 304 | | | | C character array as a second |
|
---|
| 305 | | | | parameter (which will be used, |
|
---|
| 306 | | | | if the first parameter is |
|
---|
| 307 | | | | *NULL*). |
|
---|
| 308 | +-------------------+---------------------+--------------------------------+
|
---|
| 309 | | :attr:`%S` | PyObject\* | The result of calling |
|
---|
| 310 | | | | :func:`PyObject_Unicode`. |
|
---|
| 311 | +-------------------+---------------------+--------------------------------+
|
---|
| 312 | | :attr:`%R` | PyObject\* | The result of calling |
|
---|
| 313 | | | | :func:`PyObject_Repr`. |
|
---|
| 314 | +-------------------+---------------------+--------------------------------+
|
---|
| 315 |
|
---|
| 316 | An unrecognized format character causes all the rest of the format string to be
|
---|
| 317 | copied as-is to the result string, and any extra arguments discarded.
|
---|
| 318 |
|
---|
| 319 | .. versionadded:: 2.6
|
---|
| 320 |
|
---|
| 321 |
|
---|
| 322 | .. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
|
---|
| 323 |
|
---|
| 324 | Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
|
---|
| 325 | arguments.
|
---|
| 326 |
|
---|
| 327 | .. versionadded:: 2.6
|
---|
| 328 |
|
---|
| 329 |
|
---|
| 330 | .. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
|
---|
| 331 |
|
---|
| 332 | Return a read-only pointer to the Unicode object's internal
|
---|
| 333 | :c:type:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object.
|
---|
| 334 | Note that the resulting :c:type:`Py_UNICODE*` string may contain embedded
|
---|
| 335 | null characters, which would cause the string to be truncated when used in
|
---|
| 336 | most C functions.
|
---|
| 337 |
|
---|
| 338 |
|
---|
| 339 | .. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
|
---|
| 340 |
|
---|
[2] | 341 | Return the length of the Unicode object.
|
---|
| 342 |
|
---|
| 343 | .. versionchanged:: 2.5
|
---|
[391] | 344 | This function returned an :c:type:`int` type. This might require changes
|
---|
[2] | 345 | in your code for properly supporting 64-bit systems.
|
---|
| 346 |
|
---|
| 347 |
|
---|
[391] | 348 | .. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
|
---|
[2] | 349 |
|
---|
| 350 | Coerce an encoded object *obj* to an Unicode object and return a reference with
|
---|
| 351 | incremented refcount.
|
---|
| 352 |
|
---|
| 353 | String and other char buffer compatible objects are decoded according to the
|
---|
| 354 | given encoding and using the error handling defined by errors. Both can be
|
---|
| 355 | *NULL* to have the interface use the default values (see the next section for
|
---|
| 356 | details).
|
---|
| 357 |
|
---|
| 358 | All other objects, including Unicode objects, cause a :exc:`TypeError` to be
|
---|
| 359 | set.
|
---|
| 360 |
|
---|
| 361 | The API returns *NULL* if there was an error. The caller is responsible for
|
---|
| 362 | decref'ing the returned objects.
|
---|
| 363 |
|
---|
| 364 |
|
---|
[391] | 365 | .. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
|
---|
[2] | 366 |
|
---|
| 367 | Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
|
---|
| 368 | throughout the interpreter whenever coercion to Unicode is needed.
|
---|
| 369 |
|
---|
[391] | 370 | If the platform supports :c:type:`wchar_t` and provides a header file wchar.h,
|
---|
[2] | 371 | Python can interface directly to this type using the following functions.
|
---|
[391] | 372 | Support is optimized if Python's own :c:type:`Py_UNICODE` type is identical to
|
---|
| 373 | the system's :c:type:`wchar_t`.
|
---|
[2] | 374 |
|
---|
| 375 |
|
---|
[391] | 376 | wchar_t Support
|
---|
| 377 | """""""""""""""
|
---|
[2] | 378 |
|
---|
[391] | 379 | :c:type:`wchar_t` support for platforms which support it:
|
---|
[2] | 380 |
|
---|
[391] | 381 | .. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
|
---|
| 382 |
|
---|
| 383 | Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
|
---|
[2] | 384 | Return *NULL* on failure.
|
---|
| 385 |
|
---|
| 386 | .. versionchanged:: 2.5
|
---|
[391] | 387 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 388 | changes in your code for properly supporting 64-bit systems.
|
---|
| 389 |
|
---|
| 390 |
|
---|
[391] | 391 | .. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
|
---|
[2] | 392 |
|
---|
[391] | 393 | Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most
|
---|
| 394 | *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
|
---|
| 395 | 0-termination character). Return the number of :c:type:`wchar_t` characters
|
---|
| 396 | copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t`
|
---|
[2] | 397 | string may or may not be 0-terminated. It is the responsibility of the caller
|
---|
[391] | 398 | to make sure that the :c:type:`wchar_t` string is 0-terminated in case this is
|
---|
| 399 | required by the application. Also, note that the :c:type:`wchar_t*` string
|
---|
| 400 | might contain null characters, which would cause the string to be truncated
|
---|
| 401 | when used with most C functions.
|
---|
[2] | 402 |
|
---|
| 403 | .. versionchanged:: 2.5
|
---|
[391] | 404 | This function returned an :c:type:`int` type and used an :c:type:`int`
|
---|
[2] | 405 | type for *size*. This might require changes in your code for properly
|
---|
| 406 | supporting 64-bit systems.
|
---|
| 407 |
|
---|
| 408 |
|
---|
| 409 | .. _builtincodecs:
|
---|
| 410 |
|
---|
| 411 | Built-in Codecs
|
---|
| 412 | ^^^^^^^^^^^^^^^
|
---|
| 413 |
|
---|
| 414 | Python provides a set of built-in codecs which are written in C for speed. All of
|
---|
| 415 | these codecs are directly usable via the following functions.
|
---|
| 416 |
|
---|
[391] | 417 | Many of the following APIs take two arguments encoding and errors, and they
|
---|
| 418 | have the same semantics as the ones of the built-in :func:`unicode` Unicode
|
---|
| 419 | object constructor.
|
---|
[2] | 420 |
|
---|
| 421 | Setting encoding to *NULL* causes the default encoding to be used which is
|
---|
[391] | 422 | ASCII. The file system calls should use :c:data:`Py_FileSystemDefaultEncoding`
|
---|
| 423 | as the encoding for file names. This variable should be treated as read-only: on
|
---|
[2] | 424 | some systems, it will be a pointer to a static string, on others, it will change
|
---|
| 425 | at run-time (such as when the application invokes setlocale).
|
---|
| 426 |
|
---|
| 427 | Error handling is set by errors which may also be set to *NULL* meaning to use
|
---|
| 428 | the default handling defined for the codec. Default error handling for all
|
---|
| 429 | built-in codecs is "strict" (:exc:`ValueError` is raised).
|
---|
| 430 |
|
---|
| 431 | The codecs all use a similar interface. Only deviation from the following
|
---|
| 432 | generic ones are documented for simplicity.
|
---|
| 433 |
|
---|
[391] | 434 |
|
---|
| 435 | Generic Codecs
|
---|
| 436 | """"""""""""""
|
---|
| 437 |
|
---|
[2] | 438 | These are the generic codec APIs:
|
---|
| 439 |
|
---|
| 440 |
|
---|
[391] | 441 | .. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
|
---|
[2] | 442 |
|
---|
| 443 | Create a Unicode object by decoding *size* bytes of the encoded string *s*.
|
---|
| 444 | *encoding* and *errors* have the same meaning as the parameters of the same name
|
---|
| 445 | in the :func:`unicode` built-in function. The codec to be used is looked up
|
---|
| 446 | using the Python codec registry. Return *NULL* if an exception was raised by
|
---|
| 447 | the codec.
|
---|
| 448 |
|
---|
| 449 | .. versionchanged:: 2.5
|
---|
[391] | 450 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 451 | changes in your code for properly supporting 64-bit systems.
|
---|
| 452 |
|
---|
| 453 |
|
---|
[391] | 454 | .. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
|
---|
[2] | 455 |
|
---|
[391] | 456 | Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
|
---|
[2] | 457 | string object. *encoding* and *errors* have the same meaning as the parameters
|
---|
[391] | 458 | of the same name in the Unicode :meth:`~unicode.encode` method. The codec
|
---|
| 459 | to be used is looked up using the Python codec registry. Return *NULL* if
|
---|
| 460 | an exception was raised by the codec.
|
---|
[2] | 461 |
|
---|
| 462 | .. versionchanged:: 2.5
|
---|
[391] | 463 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 464 | changes in your code for properly supporting 64-bit systems.
|
---|
| 465 |
|
---|
| 466 |
|
---|
[391] | 467 | .. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
|
---|
[2] | 468 |
|
---|
| 469 | Encode a Unicode object and return the result as Python string object.
|
---|
| 470 | *encoding* and *errors* have the same meaning as the parameters of the same name
|
---|
| 471 | in the Unicode :meth:`encode` method. The codec to be used is looked up using
|
---|
| 472 | the Python codec registry. Return *NULL* if an exception was raised by the
|
---|
| 473 | codec.
|
---|
| 474 |
|
---|
[391] | 475 |
|
---|
| 476 | UTF-8 Codecs
|
---|
| 477 | """"""""""""
|
---|
| 478 |
|
---|
[2] | 479 | These are the UTF-8 codec APIs:
|
---|
| 480 |
|
---|
| 481 |
|
---|
[391] | 482 | .. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 483 |
|
---|
| 484 | Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
|
---|
| 485 | *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 486 |
|
---|
| 487 | .. versionchanged:: 2.5
|
---|
[391] | 488 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 489 | changes in your code for properly supporting 64-bit systems.
|
---|
| 490 |
|
---|
| 491 |
|
---|
[391] | 492 | .. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
|
---|
[2] | 493 |
|
---|
[391] | 494 | If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF8`. If
|
---|
[2] | 495 | *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
|
---|
| 496 | treated as an error. Those bytes will not be decoded and the number of bytes
|
---|
| 497 | that have been decoded will be stored in *consumed*.
|
---|
| 498 |
|
---|
| 499 | .. versionadded:: 2.4
|
---|
| 500 |
|
---|
| 501 | .. versionchanged:: 2.5
|
---|
[391] | 502 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 503 | changes in your code for properly supporting 64-bit systems.
|
---|
| 504 |
|
---|
| 505 |
|
---|
[391] | 506 | .. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 507 |
|
---|
[391] | 508 | Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a
|
---|
[2] | 509 | Python string object. Return *NULL* if an exception was raised by the codec.
|
---|
| 510 |
|
---|
| 511 | .. versionchanged:: 2.5
|
---|
[391] | 512 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 513 | changes in your code for properly supporting 64-bit systems.
|
---|
| 514 |
|
---|
| 515 |
|
---|
[391] | 516 | .. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
|
---|
[2] | 517 |
|
---|
| 518 | Encode a Unicode object using UTF-8 and return the result as Python string
|
---|
| 519 | object. Error handling is "strict". Return *NULL* if an exception was raised
|
---|
| 520 | by the codec.
|
---|
| 521 |
|
---|
[391] | 522 |
|
---|
| 523 | UTF-32 Codecs
|
---|
| 524 | """""""""""""
|
---|
| 525 |
|
---|
[2] | 526 | These are the UTF-32 codec APIs:
|
---|
| 527 |
|
---|
| 528 |
|
---|
[391] | 529 | .. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
|
---|
[2] | 530 |
|
---|
[391] | 531 | Decode *size* bytes from a UTF-32 encoded buffer string and return the
|
---|
[2] | 532 | corresponding Unicode object. *errors* (if non-*NULL*) defines the error
|
---|
| 533 | handling. It defaults to "strict".
|
---|
| 534 |
|
---|
| 535 | If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
|
---|
| 536 | order::
|
---|
| 537 |
|
---|
| 538 | *byteorder == -1: little endian
|
---|
| 539 | *byteorder == 0: native order
|
---|
| 540 | *byteorder == 1: big endian
|
---|
| 541 |
|
---|
| 542 | If ``*byteorder`` is zero, and the first four bytes of the input data are a
|
---|
| 543 | byte order mark (BOM), the decoder switches to this byte order and the BOM is
|
---|
| 544 | not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
|
---|
| 545 | ``1``, any byte order mark is copied to the output.
|
---|
| 546 |
|
---|
| 547 | After completion, *\*byteorder* is set to the current byte order at the end
|
---|
| 548 | of input data.
|
---|
| 549 |
|
---|
| 550 | In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
|
---|
| 551 |
|
---|
| 552 | If *byteorder* is *NULL*, the codec starts in native order mode.
|
---|
| 553 |
|
---|
| 554 | Return *NULL* if an exception was raised by the codec.
|
---|
| 555 |
|
---|
| 556 | .. versionadded:: 2.6
|
---|
| 557 |
|
---|
| 558 |
|
---|
[391] | 559 | .. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
|
---|
[2] | 560 |
|
---|
[391] | 561 | If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF32`. If
|
---|
| 562 | *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
|
---|
[2] | 563 | trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
|
---|
| 564 | by four) as an error. Those bytes will not be decoded and the number of bytes
|
---|
| 565 | that have been decoded will be stored in *consumed*.
|
---|
| 566 |
|
---|
| 567 | .. versionadded:: 2.6
|
---|
| 568 |
|
---|
| 569 |
|
---|
[391] | 570 | .. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
|
---|
[2] | 571 |
|
---|
| 572 | Return a Python bytes object holding the UTF-32 encoded value of the Unicode
|
---|
| 573 | data in *s*. Output is written according to the following byte order::
|
---|
| 574 |
|
---|
| 575 | byteorder == -1: little endian
|
---|
| 576 | byteorder == 0: native byte order (writes a BOM mark)
|
---|
| 577 | byteorder == 1: big endian
|
---|
| 578 |
|
---|
| 579 | If byteorder is ``0``, the output string will always start with the Unicode BOM
|
---|
| 580 | mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
---|
| 581 |
|
---|
| 582 | If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
|
---|
| 583 | as a single codepoint.
|
---|
| 584 |
|
---|
| 585 | Return *NULL* if an exception was raised by the codec.
|
---|
| 586 |
|
---|
| 587 | .. versionadded:: 2.6
|
---|
| 588 |
|
---|
| 589 |
|
---|
[391] | 590 | .. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
|
---|
[2] | 591 |
|
---|
| 592 | Return a Python string using the UTF-32 encoding in native byte order. The
|
---|
| 593 | string always starts with a BOM mark. Error handling is "strict". Return
|
---|
| 594 | *NULL* if an exception was raised by the codec.
|
---|
| 595 |
|
---|
| 596 | .. versionadded:: 2.6
|
---|
| 597 |
|
---|
| 598 |
|
---|
[391] | 599 | UTF-16 Codecs
|
---|
| 600 | """""""""""""
|
---|
| 601 |
|
---|
[2] | 602 | These are the UTF-16 codec APIs:
|
---|
| 603 |
|
---|
| 604 |
|
---|
[391] | 605 | .. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
|
---|
[2] | 606 |
|
---|
[391] | 607 | Decode *size* bytes from a UTF-16 encoded buffer string and return the
|
---|
[2] | 608 | corresponding Unicode object. *errors* (if non-*NULL*) defines the error
|
---|
| 609 | handling. It defaults to "strict".
|
---|
| 610 |
|
---|
| 611 | If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
|
---|
| 612 | order::
|
---|
| 613 |
|
---|
| 614 | *byteorder == -1: little endian
|
---|
| 615 | *byteorder == 0: native order
|
---|
| 616 | *byteorder == 1: big endian
|
---|
| 617 |
|
---|
| 618 | If ``*byteorder`` is zero, and the first two bytes of the input data are a
|
---|
| 619 | byte order mark (BOM), the decoder switches to this byte order and the BOM is
|
---|
| 620 | not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
|
---|
| 621 | ``1``, any byte order mark is copied to the output (where it will result in
|
---|
| 622 | either a ``\ufeff`` or a ``\ufffe`` character).
|
---|
| 623 |
|
---|
| 624 | After completion, *\*byteorder* is set to the current byte order at the end
|
---|
| 625 | of input data.
|
---|
| 626 |
|
---|
| 627 | If *byteorder* is *NULL*, the codec starts in native order mode.
|
---|
| 628 |
|
---|
| 629 | Return *NULL* if an exception was raised by the codec.
|
---|
| 630 |
|
---|
| 631 | .. versionchanged:: 2.5
|
---|
[391] | 632 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 633 | changes in your code for properly supporting 64-bit systems.
|
---|
| 634 |
|
---|
| 635 |
|
---|
[391] | 636 | .. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
|
---|
[2] | 637 |
|
---|
[391] | 638 | If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF16`. If
|
---|
| 639 | *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
|
---|
[2] | 640 | trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
|
---|
| 641 | split surrogate pair) as an error. Those bytes will not be decoded and the
|
---|
| 642 | number of bytes that have been decoded will be stored in *consumed*.
|
---|
| 643 |
|
---|
| 644 | .. versionadded:: 2.4
|
---|
| 645 |
|
---|
| 646 | .. versionchanged:: 2.5
|
---|
[391] | 647 | This function used an :c:type:`int` type for *size* and an :c:type:`int *`
|
---|
[2] | 648 | type for *consumed*. This might require changes in your code for
|
---|
| 649 | properly supporting 64-bit systems.
|
---|
| 650 |
|
---|
| 651 |
|
---|
[391] | 652 | .. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
|
---|
[2] | 653 |
|
---|
| 654 | Return a Python string object holding the UTF-16 encoded value of the Unicode
|
---|
| 655 | data in *s*. Output is written according to the following byte order::
|
---|
| 656 |
|
---|
| 657 | byteorder == -1: little endian
|
---|
| 658 | byteorder == 0: native byte order (writes a BOM mark)
|
---|
| 659 | byteorder == 1: big endian
|
---|
| 660 |
|
---|
| 661 | If byteorder is ``0``, the output string will always start with the Unicode BOM
|
---|
| 662 | mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
---|
| 663 |
|
---|
[391] | 664 | If *Py_UNICODE_WIDE* is defined, a single :c:type:`Py_UNICODE` value may get
|
---|
| 665 | represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
|
---|
[2] | 666 | values is interpreted as an UCS-2 character.
|
---|
| 667 |
|
---|
| 668 | Return *NULL* if an exception was raised by the codec.
|
---|
| 669 |
|
---|
| 670 | .. versionchanged:: 2.5
|
---|
[391] | 671 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 672 | changes in your code for properly supporting 64-bit systems.
|
---|
| 673 |
|
---|
| 674 |
|
---|
[391] | 675 | .. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
|
---|
[2] | 676 |
|
---|
| 677 | Return a Python string using the UTF-16 encoding in native byte order. The
|
---|
| 678 | string always starts with a BOM mark. Error handling is "strict". Return
|
---|
| 679 | *NULL* if an exception was raised by the codec.
|
---|
| 680 |
|
---|
[391] | 681 |
|
---|
| 682 | UTF-7 Codecs
|
---|
| 683 | """"""""""""
|
---|
| 684 |
|
---|
| 685 | These are the UTF-7 codec APIs:
|
---|
| 686 |
|
---|
| 687 |
|
---|
| 688 | .. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
|
---|
| 689 |
|
---|
| 690 | Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
|
---|
| 691 | *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 692 |
|
---|
| 693 |
|
---|
| 694 | .. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
|
---|
| 695 |
|
---|
| 696 | If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF7`. If
|
---|
| 697 | *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
|
---|
| 698 | be treated as an error. Those bytes will not be decoded and the number of
|
---|
| 699 | bytes that have been decoded will be stored in *consumed*.
|
---|
| 700 |
|
---|
| 701 |
|
---|
| 702 | .. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
|
---|
| 703 |
|
---|
| 704 | Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
|
---|
| 705 | return a Python bytes object. Return *NULL* if an exception was raised by
|
---|
| 706 | the codec.
|
---|
| 707 |
|
---|
| 708 | If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
|
---|
| 709 | special meaning) will be encoded in base-64. If *base64WhiteSpace* is
|
---|
| 710 | nonzero, whitespace will be encoded in base-64. Both are set to zero for the
|
---|
| 711 | Python "utf-7" codec.
|
---|
| 712 |
|
---|
| 713 |
|
---|
| 714 | Unicode-Escape Codecs
|
---|
| 715 | """""""""""""""""""""
|
---|
| 716 |
|
---|
[2] | 717 | These are the "Unicode Escape" codec APIs:
|
---|
| 718 |
|
---|
| 719 |
|
---|
[391] | 720 | .. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 721 |
|
---|
| 722 | Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
|
---|
| 723 | string *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 724 |
|
---|
| 725 | .. versionchanged:: 2.5
|
---|
[391] | 726 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 727 | changes in your code for properly supporting 64-bit systems.
|
---|
| 728 |
|
---|
| 729 |
|
---|
[391] | 730 | .. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
|
---|
[2] | 731 |
|
---|
[391] | 732 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
|
---|
[2] | 733 | return a Python string object. Return *NULL* if an exception was raised by the
|
---|
| 734 | codec.
|
---|
| 735 |
|
---|
| 736 | .. versionchanged:: 2.5
|
---|
[391] | 737 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 738 | changes in your code for properly supporting 64-bit systems.
|
---|
| 739 |
|
---|
| 740 |
|
---|
[391] | 741 | .. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
|
---|
[2] | 742 |
|
---|
| 743 | Encode a Unicode object using Unicode-Escape and return the result as Python
|
---|
| 744 | string object. Error handling is "strict". Return *NULL* if an exception was
|
---|
| 745 | raised by the codec.
|
---|
| 746 |
|
---|
[391] | 747 |
|
---|
| 748 | Raw-Unicode-Escape Codecs
|
---|
| 749 | """""""""""""""""""""""""
|
---|
| 750 |
|
---|
[2] | 751 | These are the "Raw Unicode Escape" codec APIs:
|
---|
| 752 |
|
---|
| 753 |
|
---|
[391] | 754 | .. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 755 |
|
---|
| 756 | Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
|
---|
| 757 | encoded string *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 758 |
|
---|
| 759 | .. versionchanged:: 2.5
|
---|
[391] | 760 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 761 | changes in your code for properly supporting 64-bit systems.
|
---|
| 762 |
|
---|
| 763 |
|
---|
[391] | 764 | .. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 765 |
|
---|
[391] | 766 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
|
---|
[2] | 767 | and return a Python string object. Return *NULL* if an exception was raised by
|
---|
| 768 | the codec.
|
---|
| 769 |
|
---|
| 770 | .. versionchanged:: 2.5
|
---|
[391] | 771 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 772 | changes in your code for properly supporting 64-bit systems.
|
---|
| 773 |
|
---|
| 774 |
|
---|
[391] | 775 | .. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
|
---|
[2] | 776 |
|
---|
| 777 | Encode a Unicode object using Raw-Unicode-Escape and return the result as
|
---|
| 778 | Python string object. Error handling is "strict". Return *NULL* if an exception
|
---|
| 779 | was raised by the codec.
|
---|
| 780 |
|
---|
[391] | 781 |
|
---|
| 782 | Latin-1 Codecs
|
---|
| 783 | """"""""""""""
|
---|
| 784 |
|
---|
[2] | 785 | These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
|
---|
| 786 | ordinals and only these are accepted by the codecs during encoding.
|
---|
| 787 |
|
---|
| 788 |
|
---|
[391] | 789 | .. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 790 |
|
---|
| 791 | Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
|
---|
| 792 | *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 793 |
|
---|
| 794 | .. versionchanged:: 2.5
|
---|
[391] | 795 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 796 | changes in your code for properly supporting 64-bit systems.
|
---|
| 797 |
|
---|
| 798 |
|
---|
[391] | 799 | .. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 800 |
|
---|
[391] | 801 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and return
|
---|
[2] | 802 | a Python string object. Return *NULL* if an exception was raised by the codec.
|
---|
| 803 |
|
---|
| 804 | .. versionchanged:: 2.5
|
---|
[391] | 805 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 806 | changes in your code for properly supporting 64-bit systems.
|
---|
| 807 |
|
---|
| 808 |
|
---|
[391] | 809 | .. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
|
---|
[2] | 810 |
|
---|
| 811 | Encode a Unicode object using Latin-1 and return the result as Python string
|
---|
| 812 | object. Error handling is "strict". Return *NULL* if an exception was raised
|
---|
| 813 | by the codec.
|
---|
| 814 |
|
---|
[391] | 815 |
|
---|
| 816 | ASCII Codecs
|
---|
| 817 | """"""""""""
|
---|
| 818 |
|
---|
[2] | 819 | These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
|
---|
| 820 | codes generate errors.
|
---|
| 821 |
|
---|
| 822 |
|
---|
[391] | 823 | .. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 824 |
|
---|
| 825 | Create a Unicode object by decoding *size* bytes of the ASCII encoded string
|
---|
| 826 | *s*. Return *NULL* if an exception was raised by the codec.
|
---|
| 827 |
|
---|
| 828 | .. versionchanged:: 2.5
|
---|
[391] | 829 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 830 | changes in your code for properly supporting 64-bit systems.
|
---|
| 831 |
|
---|
| 832 |
|
---|
[391] | 833 | .. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 834 |
|
---|
[391] | 835 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and return a
|
---|
[2] | 836 | Python string object. Return *NULL* if an exception was raised by the codec.
|
---|
| 837 |
|
---|
| 838 | .. versionchanged:: 2.5
|
---|
[391] | 839 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 840 | changes in your code for properly supporting 64-bit systems.
|
---|
| 841 |
|
---|
| 842 |
|
---|
[391] | 843 | .. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
|
---|
[2] | 844 |
|
---|
| 845 | Encode a Unicode object using ASCII and return the result as Python string
|
---|
| 846 | object. Error handling is "strict". Return *NULL* if an exception was raised
|
---|
| 847 | by the codec.
|
---|
| 848 |
|
---|
| 849 |
|
---|
[391] | 850 | Character Map Codecs
|
---|
| 851 | """"""""""""""""""""
|
---|
[2] | 852 |
|
---|
| 853 | This codec is special in that it can be used to implement many different codecs
|
---|
| 854 | (and this is in fact what was done to obtain most of the standard codecs
|
---|
| 855 | included in the :mod:`encodings` package). The codec uses mapping to encode and
|
---|
| 856 | decode characters.
|
---|
| 857 |
|
---|
| 858 | Decoding mappings must map single string characters to single Unicode
|
---|
| 859 | characters, integers (which are then interpreted as Unicode ordinals) or None
|
---|
| 860 | (meaning "undefined mapping" and causing an error).
|
---|
| 861 |
|
---|
| 862 | Encoding mappings must map single Unicode characters to single string
|
---|
| 863 | characters, integers (which are then interpreted as Latin-1 ordinals) or None
|
---|
| 864 | (meaning "undefined mapping" and causing an error).
|
---|
| 865 |
|
---|
| 866 | The mapping objects provided must only support the __getitem__ mapping
|
---|
| 867 | interface.
|
---|
| 868 |
|
---|
| 869 | If a character lookup fails with a LookupError, the character is copied as-is
|
---|
| 870 | meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
|
---|
| 871 | resp. Because of this, mappings only need to contain those mappings which map
|
---|
| 872 | characters to different code points.
|
---|
| 873 |
|
---|
[391] | 874 | These are the mapping codec APIs:
|
---|
[2] | 875 |
|
---|
[391] | 876 | .. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
|
---|
[2] | 877 |
|
---|
| 878 | Create a Unicode object by decoding *size* bytes of the encoded string *s* using
|
---|
| 879 | the given *mapping* object. Return *NULL* if an exception was raised by the
|
---|
| 880 | codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
|
---|
| 881 | dictionary mapping byte or a unicode string, which is treated as a lookup table.
|
---|
| 882 | Byte values greater that the length of the string and U+FFFE "characters" are
|
---|
| 883 | treated as "undefined mapping".
|
---|
| 884 |
|
---|
| 885 | .. versionchanged:: 2.4
|
---|
| 886 | Allowed unicode string as mapping argument.
|
---|
| 887 |
|
---|
| 888 | .. versionchanged:: 2.5
|
---|
[391] | 889 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 890 | changes in your code for properly supporting 64-bit systems.
|
---|
| 891 |
|
---|
| 892 |
|
---|
[391] | 893 | .. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
|
---|
[2] | 894 |
|
---|
[391] | 895 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
|
---|
[2] | 896 | *mapping* object and return a Python string object. Return *NULL* if an
|
---|
| 897 | exception was raised by the codec.
|
---|
| 898 |
|
---|
| 899 | .. versionchanged:: 2.5
|
---|
[391] | 900 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 901 | changes in your code for properly supporting 64-bit systems.
|
---|
| 902 |
|
---|
| 903 |
|
---|
[391] | 904 | .. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
|
---|
[2] | 905 |
|
---|
| 906 | Encode a Unicode object using the given *mapping* object and return the result
|
---|
| 907 | as Python string object. Error handling is "strict". Return *NULL* if an
|
---|
| 908 | exception was raised by the codec.
|
---|
| 909 |
|
---|
| 910 | The following codec API is special in that maps Unicode to Unicode.
|
---|
| 911 |
|
---|
| 912 |
|
---|
[391] | 913 | .. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
|
---|
[2] | 914 |
|
---|
[391] | 915 | Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
|
---|
[2] | 916 | character mapping *table* to it and return the resulting Unicode object. Return
|
---|
| 917 | *NULL* when an exception was raised by the codec.
|
---|
| 918 |
|
---|
| 919 | The *mapping* table must map Unicode ordinal integers to Unicode ordinal
|
---|
| 920 | integers or None (causing deletion of the character).
|
---|
| 921 |
|
---|
| 922 | Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
|
---|
| 923 | and sequences work well. Unmapped character ordinals (ones which cause a
|
---|
| 924 | :exc:`LookupError`) are left untouched and are copied as-is.
|
---|
| 925 |
|
---|
| 926 | .. versionchanged:: 2.5
|
---|
[391] | 927 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 928 | changes in your code for properly supporting 64-bit systems.
|
---|
| 929 |
|
---|
[391] | 930 |
|
---|
| 931 | MBCS codecs for Windows
|
---|
| 932 | """""""""""""""""""""""
|
---|
| 933 |
|
---|
[2] | 934 | These are the MBCS codec APIs. They are currently only available on Windows and
|
---|
| 935 | use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
|
---|
| 936 | DBCS) is a class of encodings, not just one. The target encoding is defined by
|
---|
| 937 | the user settings on the machine running the codec.
|
---|
| 938 |
|
---|
| 939 |
|
---|
[391] | 940 | .. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 941 |
|
---|
| 942 | Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
|
---|
| 943 | Return *NULL* if an exception was raised by the codec.
|
---|
| 944 |
|
---|
| 945 | .. versionchanged:: 2.5
|
---|
[391] | 946 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 947 | changes in your code for properly supporting 64-bit systems.
|
---|
| 948 |
|
---|
| 949 |
|
---|
[391] | 950 | .. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
|
---|
[2] | 951 |
|
---|
[391] | 952 | If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeMBCS`. If
|
---|
| 953 | *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
|
---|
[2] | 954 | trailing lead byte and the number of bytes that have been decoded will be stored
|
---|
| 955 | in *consumed*.
|
---|
| 956 |
|
---|
| 957 | .. versionadded:: 2.5
|
---|
| 958 |
|
---|
| 959 |
|
---|
[391] | 960 | .. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
|
---|
[2] | 961 |
|
---|
[391] | 962 | Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return a
|
---|
[2] | 963 | Python string object. Return *NULL* if an exception was raised by the codec.
|
---|
| 964 |
|
---|
| 965 | .. versionchanged:: 2.5
|
---|
[391] | 966 | This function used an :c:type:`int` type for *size*. This might require
|
---|
[2] | 967 | changes in your code for properly supporting 64-bit systems.
|
---|
| 968 |
|
---|
| 969 |
|
---|
[391] | 970 | .. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
|
---|
[2] | 971 |
|
---|
| 972 | Encode a Unicode object using MBCS and return the result as Python string
|
---|
| 973 | object. Error handling is "strict". Return *NULL* if an exception was raised
|
---|
| 974 | by the codec.
|
---|
| 975 |
|
---|
| 976 |
|
---|
[391] | 977 | Methods & Slots
|
---|
| 978 | """""""""""""""
|
---|
[2] | 979 |
|
---|
| 980 | .. _unicodemethodsandslots:
|
---|
| 981 |
|
---|
| 982 | Methods and Slot Functions
|
---|
| 983 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
---|
| 984 |
|
---|
| 985 | The following APIs are capable of handling Unicode objects and strings on input
|
---|
| 986 | (we refer to them as strings in the descriptions) and return Unicode objects or
|
---|
| 987 | integers as appropriate.
|
---|
| 988 |
|
---|
| 989 | They all return *NULL* or ``-1`` if an exception occurs.
|
---|
| 990 |
|
---|
| 991 |
|
---|
[391] | 992 | .. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
|
---|
[2] | 993 |
|
---|
| 994 | Concat two strings giving a new Unicode string.
|
---|
| 995 |
|
---|
| 996 |
|
---|
[391] | 997 | .. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
|
---|
[2] | 998 |
|
---|
[391] | 999 | Split a string giving a list of Unicode strings. If *sep* is *NULL*, splitting
|
---|
[2] | 1000 | will be done at all whitespace substrings. Otherwise, splits occur at the given
|
---|
| 1001 | separator. At most *maxsplit* splits will be done. If negative, no limit is
|
---|
| 1002 | set. Separators are not included in the resulting list.
|
---|
| 1003 |
|
---|
| 1004 | .. versionchanged:: 2.5
|
---|
[391] | 1005 | This function used an :c:type:`int` type for *maxsplit*. This might require
|
---|
[2] | 1006 | changes in your code for properly supporting 64-bit systems.
|
---|
| 1007 |
|
---|
| 1008 |
|
---|
[391] | 1009 | .. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
|
---|
[2] | 1010 |
|
---|
| 1011 | Split a Unicode string at line breaks, returning a list of Unicode strings.
|
---|
| 1012 | CRLF is considered to be one line break. If *keepend* is 0, the Line break
|
---|
| 1013 | characters are not included in the resulting strings.
|
---|
| 1014 |
|
---|
| 1015 |
|
---|
[391] | 1016 | .. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
|
---|
[2] | 1017 |
|
---|
| 1018 | Translate a string by applying a character mapping table to it and return the
|
---|
| 1019 | resulting Unicode object.
|
---|
| 1020 |
|
---|
| 1021 | The mapping table must map Unicode ordinal integers to Unicode ordinal integers
|
---|
| 1022 | or None (causing deletion of the character).
|
---|
| 1023 |
|
---|
| 1024 | Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
|
---|
| 1025 | and sequences work well. Unmapped character ordinals (ones which cause a
|
---|
| 1026 | :exc:`LookupError`) are left untouched and are copied as-is.
|
---|
| 1027 |
|
---|
| 1028 | *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
|
---|
| 1029 | use the default error handling.
|
---|
| 1030 |
|
---|
| 1031 |
|
---|
[391] | 1032 | .. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
|
---|
[2] | 1033 |
|
---|
[391] | 1034 | Join a sequence of strings using the given *separator* and return the resulting
|
---|
[2] | 1035 | Unicode string.
|
---|
| 1036 |
|
---|
| 1037 |
|
---|
[391] | 1038 | .. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
|
---|
[2] | 1039 |
|
---|
[391] | 1040 | Return 1 if *substr* matches ``str[start:end]`` at the given tail end
|
---|
[2] | 1041 | (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
|
---|
| 1042 | 0 otherwise. Return ``-1`` if an error occurred.
|
---|
| 1043 |
|
---|
| 1044 | .. versionchanged:: 2.5
|
---|
[391] | 1045 | This function used an :c:type:`int` type for *start* and *end*. This
|
---|
[2] | 1046 | might require changes in your code for properly supporting 64-bit
|
---|
| 1047 | systems.
|
---|
| 1048 |
|
---|
| 1049 |
|
---|
[391] | 1050 | .. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
|
---|
[2] | 1051 |
|
---|
[391] | 1052 | Return the first position of *substr* in ``str[start:end]`` using the given
|
---|
[2] | 1053 | *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
|
---|
| 1054 | backward search). The return value is the index of the first match; a value of
|
---|
| 1055 | ``-1`` indicates that no match was found, and ``-2`` indicates that an error
|
---|
| 1056 | occurred and an exception has been set.
|
---|
| 1057 |
|
---|
| 1058 | .. versionchanged:: 2.5
|
---|
[391] | 1059 | This function used an :c:type:`int` type for *start* and *end*. This
|
---|
[2] | 1060 | might require changes in your code for properly supporting 64-bit
|
---|
| 1061 | systems.
|
---|
| 1062 |
|
---|
| 1063 |
|
---|
[391] | 1064 | .. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
|
---|
[2] | 1065 |
|
---|
| 1066 | Return the number of non-overlapping occurrences of *substr* in
|
---|
| 1067 | ``str[start:end]``. Return ``-1`` if an error occurred.
|
---|
| 1068 |
|
---|
| 1069 | .. versionchanged:: 2.5
|
---|
[391] | 1070 | This function returned an :c:type:`int` type and used an :c:type:`int`
|
---|
[2] | 1071 | type for *start* and *end*. This might require changes in your code for
|
---|
| 1072 | properly supporting 64-bit systems.
|
---|
| 1073 |
|
---|
| 1074 |
|
---|
[391] | 1075 | .. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
|
---|
[2] | 1076 |
|
---|
| 1077 | Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
|
---|
| 1078 | return the resulting Unicode object. *maxcount* == -1 means replace all
|
---|
| 1079 | occurrences.
|
---|
| 1080 |
|
---|
| 1081 | .. versionchanged:: 2.5
|
---|
[391] | 1082 | This function used an :c:type:`int` type for *maxcount*. This might
|
---|
[2] | 1083 | require changes in your code for properly supporting 64-bit systems.
|
---|
| 1084 |
|
---|
| 1085 |
|
---|
[391] | 1086 | .. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
|
---|
[2] | 1087 |
|
---|
| 1088 | Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
|
---|
| 1089 | respectively.
|
---|
| 1090 |
|
---|
| 1091 |
|
---|
[391] | 1092 | .. c:function:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
|
---|
[2] | 1093 |
|
---|
| 1094 | Rich compare two unicode strings and return one of the following:
|
---|
| 1095 |
|
---|
| 1096 | * ``NULL`` in case an exception was raised
|
---|
| 1097 | * :const:`Py_True` or :const:`Py_False` for successful comparisons
|
---|
| 1098 | * :const:`Py_NotImplemented` in case the type combination is unknown
|
---|
| 1099 |
|
---|
| 1100 | Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
|
---|
| 1101 | :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
|
---|
| 1102 | with a :exc:`UnicodeDecodeError`.
|
---|
| 1103 |
|
---|
| 1104 | Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
|
---|
| 1105 | :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
|
---|
| 1106 |
|
---|
| 1107 |
|
---|
[391] | 1108 | .. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
|
---|
[2] | 1109 |
|
---|
| 1110 | Return a new string object from *format* and *args*; this is analogous to
|
---|
| 1111 | ``format % args``. The *args* argument must be a tuple.
|
---|
| 1112 |
|
---|
| 1113 |
|
---|
[391] | 1114 | .. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
|
---|
[2] | 1115 |
|
---|
| 1116 | Check whether *element* is contained in *container* and return true or false
|
---|
| 1117 | accordingly.
|
---|
| 1118 |
|
---|
| 1119 | *element* has to coerce to a one element Unicode string. ``-1`` is returned if
|
---|
| 1120 | there was an error.
|
---|