[2] | 1 |
|
---|
| 2 | :mod:`unicodedata` --- Unicode Database
|
---|
| 3 | =======================================
|
---|
| 4 |
|
---|
| 5 | .. module:: unicodedata
|
---|
| 6 | :synopsis: Access the Unicode Database.
|
---|
| 7 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
|
---|
| 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
|
---|
| 9 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
|
---|
| 10 |
|
---|
| 11 |
|
---|
| 12 | .. index::
|
---|
| 13 | single: Unicode
|
---|
| 14 | single: character
|
---|
| 15 | pair: Unicode; database
|
---|
| 16 |
|
---|
| 17 | This module provides access to the Unicode Character Database which defines
|
---|
| 18 | character properties for all Unicode characters. The data in this database is
|
---|
[391] | 19 | based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly
|
---|
[2] | 20 | available from ftp://ftp.unicode.org/.
|
---|
| 21 |
|
---|
| 22 | The module uses the same names and symbols as defined by the UnicodeData File
|
---|
[391] | 23 | Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
|
---|
| 24 | It defines the following functions:
|
---|
[2] | 25 |
|
---|
| 26 |
|
---|
| 27 | .. function:: lookup(name)
|
---|
| 28 |
|
---|
| 29 | Look up character by name. If a character with the given name is found, return
|
---|
| 30 | the corresponding Unicode character. If not found, :exc:`KeyError` is raised.
|
---|
| 31 |
|
---|
| 32 |
|
---|
| 33 | .. function:: name(unichr[, default])
|
---|
| 34 |
|
---|
| 35 | Returns the name assigned to the Unicode character *unichr* as a string. If no
|
---|
| 36 | name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
|
---|
| 37 | raised.
|
---|
| 38 |
|
---|
| 39 |
|
---|
| 40 | .. function:: decimal(unichr[, default])
|
---|
| 41 |
|
---|
| 42 | Returns the decimal value assigned to the Unicode character *unichr* as integer.
|
---|
| 43 | If no such value is defined, *default* is returned, or, if not given,
|
---|
| 44 | :exc:`ValueError` is raised.
|
---|
| 45 |
|
---|
| 46 |
|
---|
| 47 | .. function:: digit(unichr[, default])
|
---|
| 48 |
|
---|
| 49 | Returns the digit value assigned to the Unicode character *unichr* as integer.
|
---|
| 50 | If no such value is defined, *default* is returned, or, if not given,
|
---|
| 51 | :exc:`ValueError` is raised.
|
---|
| 52 |
|
---|
| 53 |
|
---|
| 54 | .. function:: numeric(unichr[, default])
|
---|
| 55 |
|
---|
| 56 | Returns the numeric value assigned to the Unicode character *unichr* as float.
|
---|
| 57 | If no such value is defined, *default* is returned, or, if not given,
|
---|
| 58 | :exc:`ValueError` is raised.
|
---|
| 59 |
|
---|
| 60 |
|
---|
| 61 | .. function:: category(unichr)
|
---|
| 62 |
|
---|
| 63 | Returns the general category assigned to the Unicode character *unichr* as
|
---|
| 64 | string.
|
---|
| 65 |
|
---|
| 66 |
|
---|
| 67 | .. function:: bidirectional(unichr)
|
---|
| 68 |
|
---|
[391] | 69 | Returns the bidirectional class assigned to the Unicode character *unichr* as
|
---|
[2] | 70 | string. If no such value is defined, an empty string is returned.
|
---|
| 71 |
|
---|
| 72 |
|
---|
| 73 | .. function:: combining(unichr)
|
---|
| 74 |
|
---|
| 75 | Returns the canonical combining class assigned to the Unicode character *unichr*
|
---|
| 76 | as integer. Returns ``0`` if no combining class is defined.
|
---|
| 77 |
|
---|
| 78 |
|
---|
| 79 | .. function:: east_asian_width(unichr)
|
---|
| 80 |
|
---|
| 81 | Returns the east asian width assigned to the Unicode character *unichr* as
|
---|
| 82 | string.
|
---|
| 83 |
|
---|
| 84 | .. versionadded:: 2.4
|
---|
| 85 |
|
---|
| 86 |
|
---|
| 87 | .. function:: mirrored(unichr)
|
---|
| 88 |
|
---|
| 89 | Returns the mirrored property assigned to the Unicode character *unichr* as
|
---|
| 90 | integer. Returns ``1`` if the character has been identified as a "mirrored"
|
---|
| 91 | character in bidirectional text, ``0`` otherwise.
|
---|
| 92 |
|
---|
| 93 |
|
---|
| 94 | .. function:: decomposition(unichr)
|
---|
| 95 |
|
---|
| 96 | Returns the character decomposition mapping assigned to the Unicode character
|
---|
| 97 | *unichr* as string. An empty string is returned in case no such mapping is
|
---|
| 98 | defined.
|
---|
| 99 |
|
---|
| 100 |
|
---|
| 101 | .. function:: normalize(form, unistr)
|
---|
| 102 |
|
---|
| 103 | Return the normal form *form* for the Unicode string *unistr*. Valid values for
|
---|
| 104 | *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
|
---|
| 105 |
|
---|
| 106 | The Unicode standard defines various normalization forms of a Unicode string,
|
---|
| 107 | based on the definition of canonical equivalence and compatibility equivalence.
|
---|
| 108 | In Unicode, several characters can be expressed in various way. For example, the
|
---|
| 109 | character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
|
---|
[391] | 110 | the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
|
---|
[2] | 111 |
|
---|
| 112 | For each character, there are two normal forms: normal form C and normal form D.
|
---|
| 113 | Normal form D (NFD) is also known as canonical decomposition, and translates
|
---|
| 114 | each character into its decomposed form. Normal form C (NFC) first applies a
|
---|
| 115 | canonical decomposition, then composes pre-combined characters again.
|
---|
| 116 |
|
---|
| 117 | In addition to these two forms, there are two additional normal forms based on
|
---|
| 118 | compatibility equivalence. In Unicode, certain characters are supported which
|
---|
| 119 | normally would be unified with other characters. For example, U+2160 (ROMAN
|
---|
| 120 | NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
|
---|
| 121 | However, it is supported in Unicode for compatibility with existing character
|
---|
| 122 | sets (e.g. gb2312).
|
---|
| 123 |
|
---|
| 124 | The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
|
---|
| 125 | replace all compatibility characters with their equivalents. The normal form KC
|
---|
| 126 | (NFKC) first applies the compatibility decomposition, followed by the canonical
|
---|
| 127 | composition.
|
---|
| 128 |
|
---|
| 129 | Even if two unicode strings are normalized and look the same to
|
---|
| 130 | a human reader, if one has combining characters and the other
|
---|
| 131 | doesn't, they may not compare equal.
|
---|
| 132 |
|
---|
| 133 | .. versionadded:: 2.3
|
---|
| 134 |
|
---|
| 135 | In addition, the module exposes the following constant:
|
---|
| 136 |
|
---|
| 137 |
|
---|
| 138 | .. data:: unidata_version
|
---|
| 139 |
|
---|
| 140 | The version of the Unicode database used in this module.
|
---|
| 141 |
|
---|
| 142 | .. versionadded:: 2.3
|
---|
| 143 |
|
---|
| 144 |
|
---|
| 145 | .. data:: ucd_3_2_0
|
---|
| 146 |
|
---|
| 147 | This is an object that has the same methods as the entire module, but uses the
|
---|
| 148 | Unicode database version 3.2 instead, for applications that require this
|
---|
| 149 | specific version of the Unicode database (such as IDNA).
|
---|
| 150 |
|
---|
| 151 | .. versionadded:: 2.5
|
---|
| 152 |
|
---|
| 153 | Examples:
|
---|
| 154 |
|
---|
| 155 | >>> import unicodedata
|
---|
| 156 | >>> unicodedata.lookup('LEFT CURLY BRACKET')
|
---|
| 157 | u'{'
|
---|
| 158 | >>> unicodedata.name(u'/')
|
---|
| 159 | 'SOLIDUS'
|
---|
| 160 | >>> unicodedata.decimal(u'9')
|
---|
| 161 | 9
|
---|
| 162 | >>> unicodedata.decimal(u'a')
|
---|
| 163 | Traceback (most recent call last):
|
---|
| 164 | File "<stdin>", line 1, in ?
|
---|
| 165 | ValueError: not a decimal
|
---|
| 166 | >>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase
|
---|
| 167 | 'Lu'
|
---|
| 168 | >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
|
---|
| 169 | 'AN'
|
---|
| 170 |
|
---|