Unicode® 9.0.0
Version 9.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 9.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.
The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including:
- Osage, a Native American language
- Nepal Bhasa, a language of Nepal
- Fulani and other African languages
- The Bravanese dialect of Swahili, used in Somalia
- The Warsh orthography for Arabic, used in North and West Africa
- Tangut, a major historic script of China
Important symbol additions include:
- 19 symbols for the new 4K TV standard
- 72 emoji characters, such as new smilies and people, animals and nature, and food and drink
For the full list of emoji, see emoji additions for Unicode 9.0. For a detailed description of support for emoji characters by the Unicode Standard, see UTR #51, Unicode Emoji.
Other important updates in Unicode Version 9.0 include:
- Significant updates to segmentation algorithms
- Improvements in the charts for the Mongolian script
Synchronization
Three other important Unicode specifications have been updated for Version 9.0:
Some of the changes in Version 9.0 and associated Unicode technical standards and reports may require modifications in implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTR #51.
This version of the Unicode Standard is synchronized with 10646:2015, fourth edition, plus Amd. 1 and Amd. 2, and 273 characters from forthcoming 10646, fifth edition.
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 9.0 of the Unicode Standard consists of the core specification (download),
the delta and archival code charts for this version, the Unicode Standard Annexes, and
the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
A complete specification of the contributory files for Unicode
9.0 is found on the page Components for 9.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
The navigation bar on the left of this page provides links to
both the core specification as a single file,
as well as to individual chapters, and
the appendices.
Also provided are links to the code charts, the radical-stroke indices to CJK
ideographs, the Unicode Standard Annexes and the data files for Version 9.0 of the Unicode Character Database.
Version 9.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 9.0.0, (Mountain View, CA: The Unicode Consortium,
2016. ISBN 978-1-936213-13-9)
http://www.unicode.org/versions/Unicode9.0.0/
The terms “Version 9.0” or “Unicode 9.0” are abbreviations for the full version reference, Version 9.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 9.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 9.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 9.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Errata incorporated into Unicode 9.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 9.0, see the list of current
Updates and Errata.
There were no significant changes to the Stability Policy of the core specification between Unicode 8.0 and Unicode 9.0.
Six new
scripts were added with accompanying new block descriptions:
Script |
Number of Characters |
Adlam |
87 |
Bhaiksuki |
97 |
Marchen |
68 |
Newa |
92 |
Osage |
72 |
Tangut |
6881 |
This version adds 72 additional emoji and 19 television symbols.
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
7,500 characters have been added.
Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see
Delta Code Charts.
There were no significant changes to the conformance clauses of the core specification for Unicode 9.0.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 9.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 9.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
Minor updates have been made to the description of the equivalence between explicit formatting characters and HTML5 and CSS markup. |
UAX
#11 East Asian Width |
Important changes to this annex include acknowledging the presence of emoji characters in legacy East Asian standards and adding a new recommendation that treats emoji style standardized variation sequences as if they are East Asian Wide, regardless of their assigned East_Asian_Width property value. In addition, in EastAsianWidth.txt, 799 characters that default to emoji presentation style have been changed to also be East_Asian_Width=Wide. |
UAX
#14 Unicode Line Breaking Algorithm |
This annex introduces new property values and new algorithm rules. These changes ensure that the various types of character sequences that represent emoji are handled as indivisible units in line breaking. The rules for numeric prefixes and postfixes were refined to prevent line breaking within currency symbols such as CA$ or JP¥, or within the stylized names of some artists such as "Travi$ Scott", "Ke$ha" or "Curren$y". |
UAX
#15 Unicode Normalization Forms
|
No significant changes in this version. |
UAX
#24 Unicode Script Property
|
This annex has been extensively rewritten, with a new introduction, more discussion of the values for Script_Extensions, and with a reorganization of the text for better flow. |
UAX
#29 Unicode Text Segmentation |
This annex introduces new property values and new algorithm rules. These address sequences that represent emoji, to ensure they are handled as indivisible units in the formation of grapheme clusters and word segments. It redefines the formerly empty Prepend class in Table 2, Grapheme_Cluster_Break Property Values and it adds U+202F NARROW NO-BREAK SPACE (NNBSP) to ExtendNumLet in Table 3, Word_Break Property Values. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
Table 6, Aspirational Use Scripts and Table 7, Limited Use Scripts have been updated by adding all six new scripts in Unicode 9.0.0. A recommended syntax for Unicode hashtags, including emoji, has been added. Furthermore, the text has been rewritten to emphasis the preference for XID_Start/XID_Continue over ID_Start/ID_Continue properties. |
UAX
#34 Unicode Named Character Sequences |
No significant changes in this version. |
UAX
#38 Unicode Han Database (Unihan) |
This annex now clarifies the use of the two fields for Korean pronunciation and the relationship between them. The syntax for radical-stroke data was extended to allow for negative "additional" strokes. |
UAX
#41 Common References for Unicode Standard Annexes |
The references in this annex have been updated for 9.0 and to
account for the withdrawal of UTR #20. |
UAX
#42 Unicode Character Database in XML |
Added new code point attributes, values, and patterns. |
UAX
#44
Unicode Character Database |
This annex has been updated to reflect the obsolescence of the file StandardizedVariants.html and the addition of a new property, Prepended_Concatenation_Mark. Recommendations have been added regarding which normalization-related properties are appropriate for exposure in public APIs. |
UAX
#45
U-Source Ideographs |
Added additional values for the status field and the status of the various characters updated. New entries for over 1,000 characters were added. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
UTS #10 modifies the implicit primary weight algorithm and the syntax of allkeys.txt to provide for the new siniform ideographic script, Tangut. |
UTS #46 Unicode IDNA Compatibility Processing |
UTS #46 fixes the missing xn-- prefix in Processing Step 3. |
UTS #39, Unicode Security Mechanisms, has also been updated for Version 9.0. It has a new section, Email Security Profiles for Identifiers, as well as additional text on the use of Script_Extensions. The data file confusablesWholeScript.txt has been withdrawn.
There are a significant number of changes in Unicode 9.0 which may impact implementations which are upgrading to Version 9.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Version 9.0 adds six new scripts, so implementations which process script data should be carefully checked. All of these additions are on Plane 1. Some of these scripts have particular attributes which may cause issues for implementations.
Two of the newly encoded scripts, Osage and Adlam, are bicameral. This means that support will require addition of case mapping and case folding tables for them. In addition, Adlam is a right-to-left script with cursive joining, so it requires bidirectional support and has rendering issues similar to those of the Arabic script.
Tangut is a very large siniform ideographic script. It is the first siniform ideographic script encoded after the Han (CJK) script. Its implementation requires technology support similar to that used for CJK, including very large fonts and radical/stroke input methods. Special adjustments have also been made to the Unicode Collation Algorithm to account for the introduction of another large ideographic repertoire.
The repertoire of Tangut characters consists of ideographs, components, and one iteration mark. In UnicodeData.txt, the range of Tangut ideographs U+17000..U+187EC uses the same syntax as that for other large ranges of characters with algorithmically derived names, with the identifiers <Tangut Ideograph, First> and <Tangut Ideograph, Last>. The derived character names for Tangut ideographs are TANGUT IDEOGRAPH-17000 through TANGUT IDEOGRAPH-187EC. Parsers of UnicodeData.txt may need to be updated to handle this new range.
The Script_Extensions property values of more than 200 ideographic symbols, which formerly contained multiple Script values such as Bopomofo, Hangul, Hiragana, Katakana, as well as Han, were reduced to single-script set values, Script_Extensions={Hani}.
Casing-related Issues
A set of nine historic Cyrillic letter forms (U+1C80..U+1C88) used in Old Church Slavonic were added. These letters are lowercase and have asymmetric case mappings to existing uppercase letters, similar to the asymmetric case mapping of Greek final sigma to capital sigma. Case folding for these nine Cyrillic letters needs to be implemented with care.
An uppercase Latin letter was added, U+A7AE LATIN CAPITAL LETTER SMALL CAPITAL I, forming a case pair with an existing lowercase letter, U+026A LATIN LETTER SMALL CAPITAL I, for which a different uppercase counterpart had been recommended, but not formally mapped, prior to Unicode 9.0.
Numeric-related Issues
Some of the newly encoded Malayalam fractions have numeric values which were not part of the UCD data in versions prior to Unicode 9.0. Implementations that process numeric values should be prepared to handle new fractional values, such as 1/20 or 1/40.
The newly added script Bhaiksuki has both script-specific decimal digits and non-decimal unit numerals.
Segmentation-related Changes
Updates for line breaking include the addition of Emoji Base and Emoji Modifier character classes and associated rule changes to address their behavior. Regional indicator characters are now handled as pairs, to prevent inappropriate breaking within pairs intended to display as emoji flag glyphs. Line breaks are also now prevented between numeric prefixes or postfixes and the letters (or numbers) next to them.
Updates for grapheme breaking were made to handle characters of the class Extend in emoji modifier sequences, and the rules for regional indicator characters were renumbered and moved to group together the rules which impact segmentation for emoji.
Word boundaries were updated with new rules to prevent inappropriate breaks internal to emoji sequences. The sentence boundary specification now includes the addition of ZWJ to Extend in Table 4, Sentence Break Property Values and updates to the derivation of STerm.
For forward compatibility, all of the unassigned code points in the range U+1F000..U+1FFFD, whether inside or outside of allocated blocks, were given the default Line_Break property value Ideographic. These default values allow better interoperability between different versions of applications that support emoji.
The Line_Break property values of the halfwidth Katakana and Hangul jamo variants in the Halfwidth and Fullwidth Forms block changed from Alphabetic to Ideographic, to match the established line breaking behavior of those characters in existing implementations.
Given its use in emoji zwj sequences, U+200D ZERO WIDTH JOINER was factored out into its own class in both the line breaking and text segmentation algorithms, and the affected rules were updated accordingly. In line breaking, in particular, ZWJ behaves as a joiner in emoji zwj sequences, but in other respects it behaves as a regular combining mark. Because regular combining marks are handled transparently, implementations need to process ZWJ carefully.
CJK/Unihan Changes
The syntax of the kRS* fields, such as kRSUnicode and kRSKangXi, has been extended to allow for negative values of residual stroke counts. A negative value indicates that strokes which would normally constitute the indexing radical are intentionally missing. The kRSUnicode and kRSKangXi fields of a few CJK ideographs have been updated accordingly. Implementers should be prepared to handle negative values for residual stroke counts. In sorting, negative values should be replaced with zero to prevent characters with such values from sorting before the characters that represent the radical itself.
Many kMandarin readings have been updated. Implementations which depend on the kMandarin readings, such as phonetic sortings of Chinese data, need to be checked against these changes.
Standardized Variation Sequences
The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.
A full set of dotted forms of Myanmar letters for Khamti, Aiton, and Phake were added as standardized variation sequences, to distinguish them from the Burmese and Shan styles. One of these new standardized variation sequences has a spacing combining mark as the initial character of the sequence: <U+1031, U+FE00>.
A set of 278 variation sequences were added to complete the set of text and emoji presentations for all pictographic symbols identified as having a default text presentation. See UTR #51, Unicode Emoji.
New Properties
A new informative binary property was defined: Prepended_Concatenation_Mark, abbreviated as PCM. This property identifies the set of format characters loosely referred to as subtending marks, such as U+0600 ARABIC NUMBER SIGN, U+06DD ARABIC END OF AYAH, or U+070F SYRIAC ABBREVIATION MARK. These marks are used in sequences with other characters, and in rendering they extend under, around, or over the characters they span. These characters are not new to the standard, but the PCM property has been added to simplify the statement of specifications and implementations that refer to them.
The new property is used in the derived formation of extended grapheme clusters in text segmentation.
Code Charts
There have been significant changes to the code charts for Mongolian since the publication of Unicode 8.0. In addition to corrections for omitted glyphs, the charts have been updated to display more as they did in Unicode 7.0, with a summary of all Mongolian standardized variation sequences displayed at the end of the Mongolian block. The names list section now also shows contextual variant glyphs. These appear for each character that also has one or more standardized variation sequences associated with it.
The code charts for Version 9.0 incorporate new fonts for Canadian Aboriginal Syllabics, Cherokee and Egyptian Hieroglyphs. The glyph for the LARI SIGN has been updated, and a few other glyphs have been updated with corrections.