Changeset 391 for python/trunk/Doc/howto/unicode.rst
- Timestamp:
- Mar 19, 2014, 11:31:01 PM (11 years ago)
- Location:
- python/trunk
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
python/trunk
-
Property svn:mergeinfo
set to
/python/vendor/Python-2.7.6 merged eligible /python/vendor/current merged eligible
-
Property svn:mergeinfo
set to
-
python/trunk/Doc/howto/unicode.rst
r2 r391 3 3 ***************** 4 4 5 :Release: 1.02 6 7 This HOWTO discusses Python's support for Unicode, and explains various problems 8 that people commonly encounter when trying to work with Unicode. 5 :Release: 1.03 6 7 This HOWTO discusses Python 2.x's support for Unicode, and explains 8 various problems that people commonly encounter when trying to work 9 with Unicode. For the Python 3 version, see 10 <http://docs.python.org/py3k/howto/unicode.html>. 9 11 10 12 Introduction to Unicode … … 145 147 handle content with embedded zero bytes. 146 148 147 Generally people don't use this encoding, instead choosing other encodings that 148 are more efficient and convenient. 149 Generally people don't use this encoding, instead choosing other 150 encodings that are more efficient and convenient. UTF-8 is probably 151 the most commonly supported encoding; it will be discussed below. 149 152 150 153 Encodings don't have to handle every possible Unicode character, and most … … 223 226 224 227 225 Python 's Unicode Support226 ======================== 228 Python 2.x's Unicode Support 229 ============================ 227 230 228 231 Now that you've learned the rudiments of Unicode, we can look at Python's … … 251 254 >>> type(s) 252 255 <type 'unicode'> 253 >>> unicode('abcdef' + chr(255)) 256 >>> unicode('abcdef' + chr(255)) #doctest: +NORMALIZE_WHITESPACE 254 257 Traceback (most recent call last): 255 File "<stdin>", line 1, in ?258 ... 256 259 UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: 257 260 ordinal not in range(128) 258 261 259 262 The ``errors`` argument specifies the response when the input string can't be … … 263 266 Unicode result). The following examples show the differences:: 264 267 265 >>> unicode('\x80abc', errors='strict') 268 >>> unicode('\x80abc', errors='strict') #doctest: +NORMALIZE_WHITESPACE 266 269 Traceback (most recent call last): 267 File "<stdin>", line 1, in ?270 ... 268 271 UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: 269 272 ordinal not in range(128) 270 273 >>> unicode('\x80abc', errors='replace') 271 274 u'\ufffdabc' … … 273 276 u'abc' 274 277 275 Encodings are specified as strings containing the encoding's name. Python 2. 4278 Encodings are specified as strings containing the encoding's name. Python 2.7 276 279 comes with roughly 100 different encodings; see the Python Library Reference at 277 280 :ref:`standard-encodings` for a list. Some encodings … … 310 313 than 127 will cause an exception:: 311 314 312 >>> s.find('Was\x9f') 315 >>> s.find('Was\x9f') #doctest: +NORMALIZE_WHITESPACE 313 316 Traceback (most recent call last): 314 File "<stdin>", line 1, in ? 315 UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128) 317 ... 318 UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: 319 ordinal not in range(128) 316 320 >>> s.find(u'Was\x9f') 317 321 -1 … … 331 335 >>> u.encode('utf-8') 332 336 '\xea\x80\x80abcd\xde\xb4' 333 >>> u.encode('ascii') 337 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE 334 338 Traceback (most recent call last): 335 File "<stdin>", line 1, in ? 336 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) 339 ... 340 UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in 341 position 0: ordinal not in range(128) 337 342 >>> u.encode('ascii', 'ignore') 338 343 'abcd' … … 382 387 383 388 >>> s = u"a\xac\u1234\u20ac\U00008000" 384 385 386 389 ... # ^^^^ two-digit hex escape 390 ... # ^^^^^^ four-digit Unicode escape 391 ... # ^^^^^^^^^^ eight-digit Unicode escape 387 392 >>> for c in s: print ord(c), 388 393 ... … … 428 433 When you run it with Python 2.4, it will output the following warning:: 429 434 430 amk:~$ python p263.py435 amk:~$ python2.4 p263.py 431 436 sys:1: DeprecationWarning: Non-ASCII character '\xe9' 432 437 in file p263.py on line 2, but no encoding declared; 433 438 see http://www.python.org/peps/pep-0263.html for details 439 440 Python 2.5 and higher are stricter and will produce a syntax error:: 441 442 amk:~$ python2.5 p263.py 443 File "/tmp/p263.py", line 2 444 SyntaxError: Non-ASCII character '\xc3' in file /tmp/p263.py 445 on line 2, but no encoding declared; see 446 http://www.python.org/peps/pep-0263.html for details 434 447 435 448 … … 473 486 "Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, 474 487 other". See 475 <http:// unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values> for a488 <http://www.unicode.org/reports/tr44/#General_Category_Values> for a 476 489 list of category codes. 477 490 … … 694 707 Version 1.02: posted August 16 2005. Corrects factual errors. 695 708 696 709 Version 1.03: posted June 20 2010. Notes that Python 3.x is not covered, 710 and that the HOWTO only covers 2.x. 711 712 713 .. comment Describe Python 3.x support (new section? new document?) 697 714 .. comment Additional topic: building Python w/ UCS2 or UCS4 support 698 715 .. comment Describe obscure -U switch somewhere?
Note:
See TracChangeset
for help on using the changeset viewer.