Unicode® 6.0.0
Released: 2010 October 11 (Announcement)
Version 6.0.0 has been superseded by the latest version of the Unicode Standard.
Unicode 6.0.0 is a
major version of the Unicode Standard. This page summarizes the important changes for the Unicode Standard, Version 6.0.0. In the discussion below, shortened references to "Unicode 6.0" or "Version 6.0" specifically refer to Version 6.0.0.
A. Summary
B. Version Information
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Unicode Character Database
Changes
G. Unicode Standard Annex Changes
The Unicode Standard, Version 6.0 is the first major version of the Unicode
Standard to be published solely in online format.
Version 6.0:
- adds 2,088 characters, including
- over 1,000 additional symbols—chief among them the additional emoji symbols,
which are especially important for mobile phones
- the new official Indian currency symbol: the Indian Rupee
Sign
- 222 additional CJK Unified Ideographs in common use in China,
Taiwan, and Japan
- 603 additional characters for African language support, including extensions to the Tifinagh, Ethiopic, and Bamum scripts
- three additional scripts: Mandaic, Batak, and Brahmi
- adds new properties and data files
- a data file, EmojiSources.txt, which maps the emoji
symbols to their original Japanese telco source sets
- two provisional properties for support of Indic scripts: IndicMatraCategory and IndicSyllabicCategory
- provisional script extension data for use in segmentation, regular
expressions, and spoof detection
- corrects character properties for existing characters
- property value updates to 36 non-CJK characters
- numerous improvements to provisional properties for CJK
Unified Ideographs
- format updates for many normative IRG source tags, to better synchronize with ISO/IEC 10646 (see UAX #38, Unicode Han Database, for details)
- amends the text of the Standard
- many changes to the core specification, listed in
D. Textual Changes and Character Additions
- small clarifications of the conformance clauses in UAX #9, The Unicode
Bidirectional Algorithm, but no significant changes to conformance requirements
- major editorial revisions of UAX #44, Unicode Character Database, and UAX #15,
Unicode Normalization Forms, but no significant changes to conformance
requirements
- provides format improvements, including
- charts for CJK Compatibility Ideographs are now laid out in
a multicolumn format showing sources, comparable to the
structure of the charts for the CJK Unified Ideographs
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for
Version 6.0:
This version of the Unicode Standard is synchronized
with the Second Edition of 10646: ISO/IEC 10646:2011.
That Second Edition represents the republication of ISO/IEC 10646:2003 plus the
rolled-up content additions from Amendments 1 through 8. The repertoire for
Unicode Version 6.0 includes all the characters
of the Second Edition, plus one additional character
U+20B9 INDIAN RUPEE SIGN, which is still in the process
of addition to 10646.
Version 6.0 of the Unicode Standard consists of the core specification,
the delta and archival code charts for this version, the Unicode Standard Annexes, and
the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
Version 6.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium,
2011. ISBN
978-1-936213-01-6)
http://www.unicode.org/versions/Unicode6.0.0/
A complete specification of the contributory files for Unicode
6.0 is found on the page
Components for 6.0.0.That page also provides the recommended reference format for Unicode Standard Annexes.
The navigation bar on the left of this page provides links to
both the core specification as a single file,
as well as to individual chapters, and
the appendices.
Also provided are links to the
code charts, the
radical-stroke indices to CJK
ideographs, the Unicode Standard Annexes
and the data files for Version 6.0 of the Unicode Character Database.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard are available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 6.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing only the
new blocks for Unicode 6.0.0 and any existing blocks for
which new characters were added in Unicode 6.0.0. All
new characters are visually highlighted in those charts.
- A set of archival code charts that represent
the entire set of characters, names and representative glyphs at the time of publication of Unicode 6.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Errata incorporated into Unicode 6.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 6.0, see the list of current
Updates and Errata.
In the Unicode 6.0 timeframe, the
Property Alias Uniqueness stability policy has been updated, to make it clear that that uniqueness is defined specifically by the UAX44-LM3 matching rule, rather than by a generic reference to all of the UAX #44 matching rules. Also, the UAX44-LM3 matching rule has been clarified regarding the status of any property aliases beginning with the sequence of characters "is"
(or "Is" or "IS"), because of the prevalence of implementations of Unicode character properties or property values with APIs prefixed with "Is", as for example IsNumeric() for the Unicode Numeric property, and so on.
Another
Property Value Stability constraint has been added, to make it clear that all decimal digits (Numeric_Type=Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9.
Note: The
Unicode Character Encoding Stability Policy restricts possible future changes to the Unicode Standard, but is not formally a part of the standard itself.
2,088 new character assignments were made to the Unicode Standard, Version 6.0
Character Assignment Overview
230 characters have been added to the BMP, while 1,858 characters
have been added in the supplementary planes. For the first time
in the history of the Unicode Standard, the majority of the
regular encoded characters (graphic and format) are not in
the BMP.
Most character additions are in new blocks, but there are also
character additions to a number of existing blocks.
The following table shows the allocation of code points in Unicode 6.0, by
character type. It highlights the numbers for the BMP and the supplementary
planes separately. For more information on the specific characters newly
assigned in Unicode 6.0, see the file
DerivedAge.txt in the
Unicode Character Database. For more details
regarding character counts, see
Appendix D, Changes from Previous Versions.
Type |
BMP |
Supplementary |
Total |
Graphic |
54,496 |
54,746 |
109,242 |
Format* |
36 |
106 |
142 |
Control |
65 |
|
65 |
Private Use |
6,400 |
131,068 |
137,468 |
Surrogate |
2,048 |
|
2,048 |
Noncharacter |
34 |
32 |
66 |
Reserved |
2,457 |
862,624 |
865,081 |
* Format characters include U+2028 LINE SEPARATOR and U+2029 PARAGRAPH
SEPARATOR.
New Blocks
The newly-defined blocks in Version 6.0 are:
0840..085F |
Mandaic |
1BC0..1BFF |
Batak |
AB00..AB2F |
Ethiopic Extended-A |
11000..1107F |
Brahmi |
16800..16A3F |
Bamum Supplement |
1B000..1B0FF |
Kana Supplement |
1F0A0..1F0FF |
Playing Cards |
1F300..1F5FF |
Miscellaneous Symbols And Pictographs |
1F600..1F64F |
Emoticons |
1F680..1F6FF |
Transport And Map Symbols |
1F700..1F77F |
Alchemical Symbols |
2B740..2B81F |
CJK Unified Ideographs Extension D |
Text Changes and Additions
Numbers indicate the chapter or section in the Unicode 6.0 core
specification where there
are some significant changes
or additions. This list is not exhaustive. Select changes for
Chapter 3, Conformance, are listed separately under
E. Conformance Changes. Many
figures have been updated or added throughout.
- Preface: Rewrote extensively
- 5.17: Updated shift/rotate in UTF8/UTF16 binary order algorithm
- 6.2: Documented dandas
- 7.1: Added new text on Latvian (and Sorbian)
letters in Latin Extended-D
- 8.2: Updates to Arabic, including Arabic pedagogical symbols (nuktas)
and Kashmiri additions for Arabic
- 9: Various updates to Indic, including additions to tables of vowel letters
- 9.1: Updates to Devanagari, including Kashmiri additions
for Devanagari
- 9.5: Added text on Oriya fraction signs
- 9.6: Improvements to Tamil
- 9.9: Added text on new Malayalam characters, including Dot Reph
- 10.2: Various updates to Tibetan
- 11.13: Various updates to Balinese
- 11.14: Various updates to Javanese
- 12.1: Various updates to Han; added new section on CJK
Extension D
- 12.4: Added new subsection on Kana Supplement in Hiragana and Katakana
- 12.6: Various updates to Hangul
- 13.1: New text on Ethiopic additions in Ethiopic Extended-A
- 13.4: Added new text on Tifinagh bi-consonants
- 13.7: Added new text for Bamum Supplement
- 15: New text on Emoji affecting the description of the
following code ranges:
- 2300-23FF Miscellaneous Technical
- 2700-27BF Dingbats
- 1F0A0-1F0FF Playing Cards
- 1F100-1F1FF Enclosed Alphanumeric Supplement
- 1F300-1F5FF Miscellaneous Symbols And Pictographs
- 1F600-1F64F Emoticons
- 1F680-1F6FF Transport And Map symbols
- 15.1: New text on U+20B9 INDIAN RUPEE SIGN
- 15.8: New subsection on Alchemical Symbols
- 16.8: Updates regarding annotation characters and bidi
- 17.2: Updated to note the new presentation format for compatibility ideographs
- Appendices and Back Matter: various updates
- Han Radical-Stroke Index now online only; introductory material moved to
chapter 12
There are several changes to conformance requirements in Unicode
6.0 that impact implementations. The most important of these are:
- Clarification of the explanatory text of D92 in Section 3.9 regarding
maximal subpart and addition of an example table with subrange
restrictions.
- Clarification that the
star operator used in the regex for Final_Sigma in Table 3-15 is
"greedy" and that Cased and Case_Ignorable properties are not
disjoint.
- Update of D136 in Section 3.13 to highlight that Case_Ignorable is intended only for use in the specification of Table 3-15 in the
Standard.
- Rewrite of the definitions and descriptions for caseless matching.
- Addition of text in Chapter 3 after C4 to clarify the implications of
conformance to Unicode semantics, regarding what other parts of the
Standard have normative status.
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 6.0 can be found in
UAX #44, Unicode Character Database. The changes listed there include a number of important property revisions to existing characters that will affect implementations.
- A general category change to two Kannada characters (U+0CF1, U+0CF2), which has the effect of making them newly eligible for inclusion in identifiers
- A general category change to one New Tai Lue numeric character (U+19DA), which would have the effect of disqualifying it from inclusion in identifiers unless grandfathering measures are in place for the defining identifier syntax
- Changes to ten characters affecting the determination of script runs
- The formal deprecation of one Arabic character
- Reversal of the default grapheme cluster boundary determination for Thai and Lao to the behavior specified in Unicode 5.0
Other significant changes include:
- Addition of the EmojiSources.txt data file, detailing source mapping information for the emoji characters
- Addition of the provisional ScriptExtensions.txt data file, providing information about use of certain characters with multiple scripts
- Addition of new provisional properties related to the structure of syllables in Indic scripts
- Deprecation of several derived properties related to Unicode normalization
- Improvement of the LineBreakTest.txt and BidiTest.txt files
In Version 6.0, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
Added informative text on alternative ways of
detecting paragraph direction and recommendations for conversion to
plain text |
UAX
#11 East Asian Width |
No significant changes in this version |
UAX
#14 Unicode Line Breaking Algorithm |
Incorporated fix for Corrigendum #7; revised the
description of the SHY character; removed Ideographic Space
from the list of spaces that may be compressed or expanded |
UAX
#15 Unicode Normalization Forms
|
Restructured for better document flow; corrected definitions of classes of characters in the Composition Exclusion Table; corrected statement of the algorithm for guaranteeing process stability
|
UAX
#24 Unicode Script Property
|
Added discussion of multiple script values; added
documentation regarding the new provisional data file
ScriptExtensions.txt |
UAX
#29 Unicode Text Segmentation |
Updated the Default Grapheme Cluster Boundary specification for Thai and Lao,
and added an informative note on tailoring aksaras |
UAX
#31 Unicode Identifier and Pattern Syntax
|
Added new scripts to the tables categorizing script usage;
clarified text in Section 2.3, Layout and Format Control
Characters; restructured tables; updated discussion of case
folding |
UAX
#34 Unicode Named Character Sequences |
Clarified the scope of use for character sequence notation and the format used in the data files |
UAX
#38 Unicode Han Database (Unihan) |
Added a history section; clarified the status of on and kun Japanese readings; provided URI for
interactive access to the contents of the Unihan database; updated the regular expressions and descriptions for various Unihan data fields |
UAX
#41 Common References for Unicode Standard Annexes |
No significant changes in this version |
UAX
#42 Unicode Character Database in XML |
Added attributes for new properties and values; updated the patterns for Unihan properties |
UAX
#44 Unicode Character Database
|
Added tables listing Deprecated and Stabilized properties; updated the Matching Rules; added documentation for new properties and data files; many other clarifications regarding particular properties
|