Unicode 9.0.0

Version 9.0.0 has been superseded by the latest version of the Unicode Standard.

This page summarizes the important changes for the Unicode Standard, Version 9.0.0. This version supersedes all previous versions of the Unicode Standard.

A. Summary

Unicode 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.

The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including:

Synchronization

Some of the changes in Version 9.0 and associated Unicode technical standards and reports may require modifications in implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTR #51.

This version of the Unicode Standard is synchronized with 10646:2015, fourth edition, plus Amd. 1 and Amd. 2, and 273 characters from forthcoming 10646, fifth edition.

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Technical Overview

Version 9.0 of the Unicode Standard consists of the core specification (download), the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

A complete specification of the contributory files for Unicode 9.0 is found on the page Components for 9.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

Version Specification

The terms “Version 9.0” or “Unicode 9.0” are abbreviations for the full version reference, Version 9.0.0.

The citation and permalink for the latest published version of the Unicode Standard is:

Code Charts

For Unicode 9.0.0 in particular two additional sets of code chart pages are provided:

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Errata

Errata incorporated into Unicode 9.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 9.0, see the list of current Updates and Errata.

C. Stability Policy Update

There were no significant changes to the Stability Policy of the core specification between Unicode 8.0 and Unicode 9.0.

D. Textual Changes and Character Additions

Character Assignment Overview

7,500 characters have been added. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see Delta Code Charts.

E. Conformance Changes

There were no significant changes to the conformance clauses of the core specification for Unicode 9.0.

F. Changes in the Unicode Character Database

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 9.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

G. Changes in the Unicode Standard Annexes

In Version 9.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex	Changes
UAX #9 Unicode Bidirectional Algorithm	Minor updates have been made to the description of the equivalence between explicit formatting characters and HTML5 and CSS markup.
UAX #11 East Asian Width	Important changes to this annex include acknowledging the presence of emoji characters in legacy East Asian standards and adding a new recommendation that treats emoji style standardized variation sequences as if they are East Asian Wide, regardless of their assigned East_Asian_Width property value. In addition, in EastAsianWidth.txt, 799 characters that default to emoji presentation style have been changed to also be East_Asian_Width=Wide.
UAX #14 Unicode Line Breaking Algorithm	This annex introduces new property values and new algorithm rules. These changes ensure that the various types of character sequences that represent emoji are handled as indivisible units in line breaking. The rules for numeric prefixes and postfixes were refined to prevent line breaking within currency symbols such as CA$ or JP¥, or within the stylized names of some artists such as "Travi$ Scott", "Ke$ha" or "Curren$y".
UAX #15 Unicode Normalization Forms	No significant changes in this version.
UAX #24 Unicode Script Property	This annex has been extensively rewritten, with a new introduction, more discussion of the values for Script_Extensions, and with a reorganization of the text for better flow.
UAX #29 Unicode Text Segmentation	This annex introduces new property values and new algorithm rules. These address sequences that represent emoji, to ensure they are handled as indivisible units in the formation of grapheme clusters and word segments. It redefines the formerly empty Prepend class in Table 2, Grapheme_Cluster_Break Property Values and it adds U+202F NARROW NO-BREAK SPACE (NNBSP) to ExtendNumLet in Table 3, Word_Break Property Values.
UAX #31 Unicode Identifier and Pattern Syntax	Table 6, Aspirational Use Scripts and Table 7, Limited Use Scripts have been updated by adding all six new scripts in Unicode 9.0.0. A recommended syntax for Unicode hashtags, including emoji, has been added. Furthermore, the text has been rewritten to emphasis the preference for XID_Start/XID_Continue over ID_Start/ID_Continue properties.
UAX #34 Unicode Named Character Sequences	No significant changes in this version.
UAX #38 Unicode Han Database (Unihan)	This annex now clarifies the use of the two fields for Korean pronunciation and the relationship between them. The syntax for radical-stroke data was extended to allow for negative "additional" strokes.
UAX #41 Common References for Unicode Standard Annexes	The references in this annex have been updated for 9.0 and to account for the withdrawal of UTR #20.
UAX #42 Unicode Character Database in XML	Added new code point attributes, values, and patterns.
UAX #44 Unicode Character Database	This annex has been updated to reflect the obsolescence of the file StandardizedVariants.html and the addition of a new property, Prepended_Concatenation_Mark. Recommendations have been added regarding which normalization-related properties are appropriate for exposure in public APIs.
UAX #45 U-Source Ideographs	Added additional values for the status field and the status of the various characters updated. New entries for over 1,000 characters were added.

H. Changes in Synchronized Unicode Technical Standards

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

UTS #39, Unicode Security Mechanisms, has also been updated for Version 9.0. It has a new section, Email Security Profiles for Identifiers, as well as additional text on the use of Script_Extensions. The data file confusablesWholeScript.txt has been withdrawn.

M. Implications for Migration

There are a significant number of changes in Unicode 9.0 which may impact implementations which are upgrading to Version 9.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Script-related Changes

Version 9.0 adds six new scripts, so implementations which process script data should be carefully checked. All of these additions are on Plane 1. Some of these scripts have particular attributes which may cause issues for implementations.

Two of the newly encoded scripts, Osage and Adlam, are bicameral. This means that support will require addition of case mapping and case folding tables for them. In addition, Adlam is a right-to-left script with cursive joining, so it requires bidirectional support and has rendering issues similar to those of the Arabic script.

Tangut is a very large siniform ideographic script. It is the first siniform ideographic script encoded after the Han (CJK) script. Its implementation requires technology support similar to that used for CJK, including very large fonts and radical/stroke input methods. Special adjustments have also been made to the Unicode Collation Algorithm to account for the introduction of another large ideographic repertoire.

The repertoire of Tangut characters consists of ideographs, components, and one iteration mark. In UnicodeData.txt, the range of Tangut ideographs U+17000..U+187EC uses the same syntax as that for other large ranges of characters with algorithmically derived names, with the identifiers <Tangut Ideograph, First> and <Tangut Ideograph, Last>. The derived character names for Tangut ideographs are TANGUT IDEOGRAPH-17000 through TANGUT IDEOGRAPH-187EC. Parsers of UnicodeData.txt may need to be updated to handle this new range.

The Script_Extensions property values of more than 200 ideographic symbols, which formerly contained multiple Script values such as Bopomofo, Hangul, Hiragana, Katakana, as well as Han, were reduced to single-script set values, Script_Extensions={Hani}.

Casing-related Issues

A set of nine historic Cyrillic letter forms (U+1C80..U+1C88) used in Old Church Slavonic were added. These letters are lowercase and have asymmetric case mappings to existing uppercase letters, similar to the asymmetric case mapping of Greek final sigma to capital sigma. Case folding for these nine Cyrillic letters needs to be implemented with care.

An uppercase Latin letter was added, U+A7AE LATIN CAPITAL LETTER SMALL CAPITAL I, forming a case pair with an existing lowercase letter, U+026A LATIN LETTER SMALL CAPITAL I, for which a different uppercase counterpart had been recommended, but not formally mapped, prior to Unicode 9.0.

Numeric-related Issues

Some of the newly encoded Malayalam fractions have numeric values which were not part of the UCD data in versions prior to Unicode 9.0. Implementations that process numeric values should be prepared to handle new fractional values, such as 1/20 or 1/40.

The newly added script Bhaiksuki has both script-specific decimal digits and non-decimal unit numerals.

Segmentation-related Changes

Updates for line breaking include the addition of Emoji Base and Emoji Modifier character classes and associated rule changes to address their behavior. Regional indicator characters are now handled as pairs, to prevent inappropriate breaking within pairs intended to display as emoji flag glyphs. Line breaks are also now prevented between numeric prefixes or postfixes and the letters (or numbers) next to them.

Updates for grapheme breaking were made to handle characters of the class Extend in emoji modifier sequences, and the rules for regional indicator characters were renumbered and moved to group together the rules which impact segmentation for emoji.

Word boundaries were updated with new rules to prevent inappropriate breaks internal to emoji sequences. The sentence boundary specification now includes the addition of ZWJ to Extend in Table 4, Sentence Break Property Values and updates to the derivation of STerm.

For forward compatibility, all of the unassigned code points in the range U+1F000..U+1FFFD, whether inside or outside of allocated blocks, were given the default Line_Break property value Ideographic. These default values allow better interoperability between different versions of applications that support emoji.

The Line_Break property values of the halfwidth Katakana and Hangul jamo variants in the Halfwidth and Fullwidth Forms block changed from Alphabetic to Ideographic, to match the established line breaking behavior of those characters in existing implementations.

Given its use in emoji zwj sequences, U+200D ZERO WIDTH JOINER was factored out into its own class in both the line breaking and text segmentation algorithms, and the affected rules were updated accordingly. In line breaking, in particular, ZWJ behaves as a joiner in emoji zwj sequences, but in other respects it behaves as a regular combining mark. Because regular combining marks are handled transparently, implementations need to process ZWJ carefully.

CJK/Unihan Changes

The syntax of the kRS* fields, such as kRSUnicode and kRSKangXi, has been extended to allow for negative values of residual stroke counts. A negative value indicates that strokes which would normally constitute the indexing radical are intentionally missing. The kRSUnicode and kRSKangXi fields of a few CJK ideographs have been updated accordingly. Implementers should be prepared to handle negative values for residual stroke counts. In sorting, negative values should be replaced with zero to prevent characters with such values from sorting before the characters that represent the radical itself.

Many kMandarin readings have been updated. Implementations which depend on the kMandarin readings, such as phonetic sortings of Chinese data, need to be checked against these changes.

Standardized Variation Sequences

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.

A full set of dotted forms of Myanmar letters for Khamti, Aiton, and Phake were added as standardized variation sequences, to distinguish them from the Burmese and Shan styles. One of these new standardized variation sequences has a spacing combining mark as the initial character of the sequence: <U+1031, U+FE00>.

A set of 278 variation sequences were added to complete the set of text and emoji presentations for all pictographic symbols identified as having a default text presentation. See UTR #51, Unicode Emoji.

New Properties

A new informative binary property was defined: Prepended_Concatenation_Mark, abbreviated as PCM. This property identifies the set of format characters loosely referred to as subtending marks, such as U+0600 ARABIC NUMBER SIGN, U+06DD ARABIC END OF AYAH, or U+070F SYRIAC ABBREVIATION MARK. These marks are used in sequences with other characters, and in rendering they extend under, around, or over the characters they span. These characters are not new to the standard, but the PCM property has been added to simplify the statement of specifications and implementations that refer to them.

The new property is used in the derived formation of extended grapheme clusters in text segmentation.

Code Charts

There have been significant changes to the code charts for Mongolian since the publication of Unicode 8.0. In addition to corrections for omitted glyphs, the charts have been updated to display more as they did in Unicode 7.0, with a summary of all Mongolian standardized variation sequences displayed at the end of the Mongolian block. The names list section now also shows contextual variant glyphs. These appear for each character that also has one or more standardized variation sequences associated with it.

The code charts for Version 9.0 incorporate new fonts for Canadian Aboriginal Syllabics, Cherokee and Egyptian Hieroglyphs. The glyph for the LARI SIGN has been updated, and a few other glyphs have been updated with corrections.

Unicode® 9.0.0

2016 June 21 (Announcement)

A. Summary

Synchronization

B. Technical Overview

Version Specification

Code Charts

Errata

C. Stability Policy Update

D. Textual Changes and Character Additions

Character Assignment Overview

E. Conformance Changes

F. Changes in the Unicode Character Database

G. Changes in the Unicode Standard Annexes

H. Changes in Synchronized Unicode Technical Standards

M. Implications for Migration

Script-related Changes

Casing-related Issues

Numeric-related Issues

Segmentation-related Changes

CJK/Unihan Changes

Standardized Variation Sequences

New Properties

Code Charts