UNICODE CHARACTER DATABASE

Revision	5.1.0
Authors	Mark Davis and Ken Whistler
Date	2008-03-25
This Version	http://www.unicode.org/Public/5.1.0/ucd/UCD.html
Previous Version	http://www.unicode.org/Public/5.0.0/ucd/UCD.html
Latest Version	http://www.unicode.org/Public/UNIDATA/UCD.html

Summary

This document describes the format and content of the Unicode Character Database (UCD)

Status

This file and the files described herein are part of the Unicode Character Database and are governed by the terms of use at http://www.unicode.org/terms_of_use.html.

The References provide related information that is useful in understanding this document.

Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 5.0.0 of the standard unless otherwise indicated.

Introduction
Conformance
UCD File Format
UCD Property Files
Auxiliary Property Files
Derived Extracted Property Files
Other UCD Files
Properties
Property and Property Value Matching
Property Invariants
Property Values
References
Modification History
UCD Terms of Use

Introduction

The Unicode Character Database (UCD) is a set of files that define the Unicode character properties and internal mappings. This document describes the properties and files that are part of The Unicode Standard, Version 5.1.0 [U5.1.0]. For a description of the changes in this version, see Modification History.

The file structure for the UCD changed in Version 4.1.0. From that point on, the successive versions of the UCD are complete versions, so that users of the standard do not need to assemble the correct version of each file from different update directories for previous versions in order to have a complete set of files for a version. Each version is in a directory of the following form:

http://www.unicode.org/Public/5.1.0/ucd/

Within this directory the structure is the same as in versions prior to 4.1.0, with two changes:

The file names are unversioned in the final release (although they may be versioned during beta review of the UCD data). This allows people using the files to not worry about removing the release versions from the individual files, and allows the html files in the release to link to specific files.
An auxiliary directory has been added. In Version 5.1.0 it contains properties associated with UAX #29: Text Boundaries [Breaks].

Conformance

For information on the meaning and application of the terms normative, informative, and provisional, see Section 3.5, "Properties" in the Unicode Standard, Version 5.0.

UCD File Format

Files in the UCD use the following format, unless otherwise specified.

Each line of data consists of fields separated by semicolons. The fields are numbered starting with zero. Code points are expressed as hexadecimal numbers with four to six digits. They are written without "U+". Within a sequence of code points, spaces are used for separation. Leading and trailing spaces within a field are not significant.

The first field (0) of each line in the Unicode Character Database files represents a code point or range. The remaining fields (1..n) are properties associated with that code point.

A range of code points is specified by the form "X..Y". Each code point from X to Y has the associated property value. For example (from Blocks.txt):
```
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
```
Property values may be omitted if they have a "default" value. For string properties, including the definition of foldings, the default value is the character itself. For others, the default value is listed in a comment. For example (from Scripts.txt):
```
#  All code points not explicitly listed for Script
#  have the value Common (Zyyy).
```
Where a file contains values for multiple properties, the second field will contain the name of the property and the third field will contain the property value. For example (from DerivedNormalizationProps.txt):
```
03D2  ; FC_NFKC; 03C5           # L&  GREEK UPSILON WITH HOOK SYMBOL
03D3  ; FC_NFKC; 03CD           # L&  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
```
For binary properties, the second field given is the name of the applicable property, with the implied value of the property being "True". Only the ranges of characters with the binary property value of True are listed. For example (from PropList.txt):
```
1680       ; White_Space # Zs      OGHAM SPACE MARK
180E       ; White_Space # Zs      MONGOLIAN VOWEL SEPARATOR
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
```
For backwards compatibility, in the file UnicodeData.txt a range is specified not by the form "X..Y", but by an entry for the start and end characters of the range. Instead of a character name in the name field, a range identifier, followed by a comma and the string "First", in angle brackets, is given for the start character: <CJK Ideograph, First>. The end character is indicated with the same range identifier, followed by a comma and the string "Last", in angle brackets: <CJK Ideograph, Last>. In such cases, the names of all characters in the range are algorithmically derivable. See [U5.0] for more information on derivation of character names for such ranges.
Surrogate code points and private-use characters have no names.
Hash marks ("#") are used to indicate comments: all characters from the hash mark to the end of the line are comments, and disregarded when parsing data. In many files, the comments on data lines use a common format.
```
00BC..00BE ; numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
```
The first part of the comment is generally the UCD general category. The symbol "L&" indicates characters of type Lu, Ll, or Lt. This is the same as the LC property in PropertyValueAliases. The code point ranges are calculated so that they all have the same General Category (or LC). While this results in more ranges than are strictly necessary, it makes the contents of the ranges clearer. The second part of the comment (in square brackets), indicates the number of items in a range, if there is one. The third part is the name of the character in field zero: if it is a range, then the character names for the ends of the range are separated by "..".
- However, the comments are purely informational, and may change format or be omitted in the future. They should not be parsed for content.
In the QuickCheck property table, NF* refers to one of NFD, NFC, NFKC, or NFKD.
The Unihan data format differs from the standard format, and is described in [UAX38]. That UAX also describes which properties are informative, which are normative, and which are provisional.
In some cases, segments of a data file are distinguished by a line starting with an "@" sign.
The files use UTF-8, with the exception of NamesList.txt, which is encoded in Latin-1. Unless otherwise noted, non-ASCII characters only appear in comments.

UCD Property Files

The following table describes the format and meaning of each property data file in the main directory of the UCD. (An index by property name, rather than file, is found at Properties.) The first column lists the files and the properties for which they contain data. The second column indicates the type of the property: String, Numeric, Enumeration (non-binary), Binary, Catalog, or Miscellaneous. Catalog properties have enumerated values which are expected to be regularly extended with successive versions of the Unicode Standard. This distinguishes them from Enumeration properties, whose enumerated values constitute a logical partition space, for which new values will generally not be added in successive versions of the standard. An example of a Catalog property is the Block property. Miscellaneous properties do not fit into the other property categories, and currently include character names, comments about characters, or the Unicode_Radical_Stroke property (a combination of numeric values). The third column indicates the status (Normative, Informative, or Provisional), and the fourth column provides a description of the data.

The files with a small number of properties are listed first, followed by the files with a large number of properties: DerivedCoreProperties.txt, DerivedNormalizationProps.txt, PropList.txt, and UnicodeData.txt. For UnicodeData, the field numbers are supplied in the description. In a number of cases, fields in a data file only contribute to a UCD property; for example, the name field in UnicodeData.txt does not provide all the values for the Name property; Jamo.txt must be used as well.

None of these properties should be used without consulting the relevant discussions in the Unicode Standard.

Where a data file does not explicitly list property values for all code points, the code points are given default property values. These default property values are documented in the data files, with the exception of UnicodeData.txt. For that case the default property values are listed below in parentheses after the property name, with (=) indicating the code point itself. The default property values are also documented in any corresponding extracted data file.

ArabicShaping.txt
Joining_Type Joining_Group	E	N	Basic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2 in [Unicode].
BidiMirroring.txt
Bidi_Mirroring_Glyph	S	I	Properties for substituting characters in an implementation of bidirectional mirroring. See UAX #9: The Bidirectional Algorithm [BIDI]. Do not confuse this with the Bidi_Mirrored property.
Blocks.txt
Block	C	N	List of block names, which are arbitrary names for ranges of code points. See Chapter 17 in [Unicode].
CompositionExclusions.txt
Composition_Exclusion	B	N	Properties for normalization. See UAX #15: Unicode Normalization Forms [Norm]. Unlike other files, CompositionExclusions simply lists the relevant code points.
CaseFolding.txt
Simple_Case_Folding Case_Folding	S	N	Mapping from characters to their case-folded forms. This is an informative file containing normative derived properties. Derived from UnicodeData and SpecialCasing. Note: The case foldings are omitted in the data file if they are the same as the code point itself.
DerivedAge.txt
Age	C	N/I	This file shows when various code points were designated/assigned in successive versions of the Unicode standard.
EastAsianWidth.txt
East_Asian_Width	E	I	Properties for determining the choice of wide vs. narrow glyphs in East Asian contexts. Property values are described in UAX #11: East Asian Width [Width].
HangulSyllableType.txt
Hangul_Syllable_Type	E	N	The values L, V, T, LV, and LVT used in Chapter 3 in [Unicode].
Jamo.txt
Jamo_Short_Name	M	N	The Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3 in [Unicode].
LineBreak.txt
Line_Break	E	N	Properties for line breaking. For more information, see UAX #14: Line Breaking Properties [Line].
NameAliases.txt
Name_Alias	M	N	Normative formal aliases for character with erroneous names as described in Chapter 4. These aliases match exactly the formal aliases published in the code charts of the Unicode Standard.
NormalizationCorrections.txt
used in Decomposition Mappings	S	N	NormalizationCorrections lists code point differences for Normalization Corrigenda. For more information, see UAX #15: Unicode Normalization Forms [Norm].
PropertyAliases.txt
n/a	S	N/I	Property names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.
PropertyValueAliases.txt
n/a	S	N/I	Property value names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.
Scripts.txt
Script	C	I	Default script values for use in regular expressions. For more information, see UAX #24: Script Names [Script].
SpecialCasing.txt
Uppercase_Mapping Lowercase_Mapping Titlecase_Mapping	S	I	Data for producing (in combination with Unicode Data) the full case mappings.
Unihan.txt (for more information, see [UAX38])
Numeric_Type Numeric_Value	E	I	The characters tagged with kPrimaryNumeric, kAccountingNumeric, and kOtherNumeric are given the Numeric_Type numeric, and the values indicated. Most characters have these properties based on values from the UnicodeData.txt data file. See Numeric_Type.
Unicode_Radical_Stroke	M	I	The Unicode radical stroke count, based on the tag kRSUnicode.
DerivedCoreProperties.txt
Alphabetic	B	I	Characters with the Alphabetic property. For more information, see Chapter 4 in [Unicode]. Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
Default_Ignorable_Code_Point	B	N	For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQ Display of Unsupported Characters, and Section 5.20 in [Unicode]. Generated from Other_Default_Ignorable_Code_Point + Cf (format characters) + Variation_Selector - White_Space - FFF9..FFFB (annotation characters) - 0600..0603, 06DD, 070F (exceptional Cf characters that should be visible)
Lowercase	B	I	Characters with the Lowercase property. For more information, see Chapter 4 in [Unicode]. Generated from: Ll + Other_Lowercase
Grapheme_Base	B	I	For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries [Breaks]. Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme_Extend
Grapheme_Extend	B	I	For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries [Breaks]. Generated from: Me + Mn + Other_Grapheme_Extend Note: depending on an application's interpretation of Co (private use), they may be either in Grapheme_Base, or in Grapheme_Extend, or in neither.
Grapheme_Link	B	I	Deprecated property, once proposed for programmatic determination of grapheme cluster boundaries. Generated from: Canonical_Combining_Class=Virama
ID_Start	B	I	Used to determine programming identifiers, as described in UAX #31: Identifier and Pattern Syntax [Pattern]
ID_Continue	B	I
Math	B	I	Characters with the Math property. For more information, see Chapter 4 in [Unicode]. Generated from: Sm + Other_Math
Uppercase	B	I	Characters with the Uppercase property. For more information, see Chapter 4 in [Unicode]. Generated from: Lu + Other_Uppercase
XID_Start	B	I	Used to determine programming identifiers, as described in UAX #31: Identifier and Pattern Syntax [Pattern]
XID_Continue	B	I
DerivedNormalizationProps.txt
Full_Composition_Exclusion	B	N	Characters that are excluded from composition: those explicitly in CompositionExclusions.txt, plus: (3) Singleton Decompositions (4) Non-Starter Decompositions
Expands_On_NFC Expands_On_NFD Expands_On_NFKC Expands_On_NFKD	B	N	Characters that expand to more than one character in the specified normalization form.
FC_NFKC_Closure	S	N	Characters that require extra mappings for closure under Case Folding plus Normalization Form KC. Characters marked with this property have a third field with the mapping in it. Generated with the following, where Fold is the default fold operation (not Turkic): b = NFKC(Fold(a)); c = NFKC(Fold(b)); if (c != b) add mapping from a to c to the set of mappings that constitute the FC_NFKC_Closure list Note: The FC_NFKC_Closure value is omitted in the data file if it is the same as the code point itself.
NFD_Quick_Check NFKD_Quick_Check NFC_Quick_Check NFKC_Quick_Check	E	N	For property values, see Decompositions and Normalization. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC)
PropList.txt
ASCII_Hex_Digit	B	N	ASCII characters commonly used for the representation of hexadecimal numbers.
Bidi_Control	B	N	Those format control characters which have specific functions in the Bidirectional Algorithm.
Dash	B	I	Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics.
Deprecated	B	N	For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged.
Diacritic	B	I	Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
Extender	B	I	Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.
Hex_Digit	B	I	Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.
Hyphen (Stabilized as of 3.2)	B	I	Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.
Ideographic	B	I	Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.
IDS_Binary_Operator	B	N	Used in Ideographic Description Sequences.
IDS_Trinary_Operator	B	N	Used in Ideographic Description Sequences.
Join_Control	B	N	Those format control characters which have specific functions for control of cursive joining and ligation.
Logical_Order_Exception	B	N	There are a small number of characters that do not use logical order. These characters require special handling in most processing.
Noncharacter_Code_Point	B	N	Code points that are permanently reserved for internal use.
Other_Alphabetic	B	I	Used in deriving the Alphabetic property.
Other_Default_Ignorable_Code_Point	B	N	Used in deriving the Default_Ignorable_Code_Point property.
Other_Grapheme_Extend	B	N	Used in deriving the Grapheme_Extend property.
Other_ID_Continue	B	N	Used for backwards compatibility of ID_Continue
Other_ID_Start	B	N	Used for backwards compatibility of ID_Start
Other_Lowercase	B	I	Used in deriving the Lowercase property.
Other_Math	B	I	Used in deriving the Math property.
Other_Uppercase	B	I	Used in deriving the Uppercase property.
Pattern_Syntax	B	N	Used for pattern syntax as described in UAX #31: Identifier and Pattern Syntax [Pattern].
Pattern_White_Space	B	N
Quotation_Mark	B	I	Those punctuation characters that function as quotation marks.
Radical	B	N	Used in Ideographic Description Sequences.
Soft_Dotted	B	N	Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian.
STerm	B	I	Sentence Terminal. Used in UAX #29: Text Boundaries [Breaks].
Terminal_Punctuation	B	I	Those punctuation characters that generally mark the end of textual units.
Unified_Ideograph	B	N	Used in Ideographic Description Sequences.
Variation_Selector	B	N	Indicates all those characters that qualify as Variation Selectors. For details on the behavior of these characters, see StandardizedVariants.html and Section 16.4, Variation Selectors in [Unicode].
White_Space	B	N	Those separator characters and control characters which should be treated by programming languages as "white space" for the purpose of parsing elements. Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect. Note: There are other senses of "whitespace" that encompass a different set of characters.
UnicodeData.txt
Name (<none>)	M	N	(1) These names match exactly the names published in the code charts of the Unicode Standard. The Hangul Syllable names are omitted from this file; see Jamo.txt. For UnicodeData.txt, default values for the property are shown in parentheses after the property name in this table. See PropertyValueAliases.txt for information on all default values.
General_Category (Cn)	E	N	(2) This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, see General Category Values.
Canonical_Combining_Class (0)	N	N	(3) The classes used for the Canonical Ordering Algorithm in the Unicode Standard. This property could be considered either an enumerated property or a numeric property: the principal use of the property is in terms of the numeric values. For the property value names associated with different numeric values, see DerivedCombiningClass.txt and Canonical Combining Class Values.
Bidi_Class (L, AL, R)	E	N	(4) These are the categories required by the Bidirectional Behavior Algorithm in the Unicode Standard. For the property values, see Bidi Class Values. For more information, see UAX #9: The Bidirectional Algorithm [BIDI]. The default property values depend on the code point, and are given in extracted/DerivedBidiClass.txt
Decomposition_Type (None) Decomposition_Mapping (<code point>)	E S	N	(5) This field contains both values, with the type in angle brackets. The decomposition mappings match exactly the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings. Note: The decomposition mapping is omitted in the data file if the decomposition mapping is the same as the code point itself.
Numeric_Type (None) Numeric_Value (NaN)	E N	N	(6) If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, then the value of that digit is represented with an integer value in fields 6, 7, and 8.
	E N	N	(7) If the character has the digit property, but is not a decimal digit, then the value of that digit is represented with an integer value in fields 7 and 8. This covers digits that need special handling, such as the compatibility superscript digits.
	E N	N	(8) If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with a positive or negative integer or rational number in this field. This includes fractions such as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH. Some characters have these properties based on values from the Unihan data file. See Numeric_Type, Han.
Bidi_Mirrored (N)	B	N	(9) If the character has been identified as a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard. Do not confuse this with the Bidi_Mirroring_Glyph property.
Unicode_1_Name (<none>)	M	I	(10) This is the old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts.
ISO_Comment (<none>)	M	I	(11) This is the ISO 10646 comment field. It appears in parentheses in the 10646 names list, or contains an asterisk to mark an Annex P note.
Simple_Uppercase_Mapping (<code point>)	S	N	(12) Simple uppercase mapping (single character result). If a character is part of an alphabet with case distinctions, and has a simple upper case equivalent, then the upper case equivalent is in this field. See the explanation below on case distinctions. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, see Case Mappings. Note: The simple uppercase is omitted in the data file if the uppercase is the same as the code point itself.
Simple_Lowercase_Mapping (<code point>)	S	N	(13) Simple lowercase mapping (single character result). Similar to Uppercase mapping. Note: The simple lowercase is omitted in the data file if the lowercase is the same as the code point itself.
Simple_Titlecase_Mapping (<code point>)	S	N	(14) Similar to Uppercase mapping (single character result). Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase.

Note:

Stabilized properties are no longer actively maintained, nor are they extended as new characters are added.

Auxiliary Property Files

A number of auxiliary properties are contained in files in the auxiliary subdirectory. They consist of the following:

GraphemeBreakProperty.txt		N/I
Grapheme_Cluster_Break	E	I	See UAX #29: Text Boundaries [Breaks]
SentenceBreakProperty.txt
Sentence_Break	E	I	See UAX #29: Text Boundaries [Breaks]
WordBreakProperty.txt
Word_Break	E	I	See UAX #29: Text Boundaries [Breaks]

Derived Extracted Property Files

The following properties of the UCD have been separated out, reformatted, and listed in range format, one property per file, except as noted. These files are provided purely as a reformatting of existing data, any exceptions are noted in the table below. All files for derived extracted properties are contained in a subdirectory called extracted.

Files

N/I

Definition and Generation

DerivedBidiClass

From UnicodeData.txt, field 4

DerivedBinaryProperties

The Bidi_Mirrored property from UnicodeData.txt, field 9. See Bidi Note.

DerivedCombiningClass

From UnicodeData.txt, field 3

DerivedDecompositionType

From the <tag> in UnicodeData.txt, field 5. For characters with canonical decomposition mappings (no tag), the value "canonical" is used.

* The value "canonical" is normative; the others are informative.

DerivedEastAsianWidth

From EastAsianWidth.txt, field 1

DerivedGeneralCategory

From UnicodeData.txt, field 2

DerivedJoiningGroup

From ArabicShaping.txt, field 2

DerivedJoiningType

From ArabicShaping.txt, field 1

DerivedLineBreak

From LineBreak.txt, field 1. For more information, see UAX #14: Line Breaking Properties [Line].

DerivedNumericType

The property value is based on the contents of UnicodeData.txt, fields 6 through 8:

property value	non-empty fields
decimal	6, 7, & 8
digit	7 & 8
numeric	8

DerivedNumericValues

The numeric value from UnicodeData.txt, field 8

Bidi Note: The BidiMirrored property and the BidiMirroring property are different. The former is a normative property that indicates whether characters are mirrored in a right-to-left context in the Unicode Bidirectional Algorithm. The latter is an informative mapping of a subset of the BidiMirrored characters, to characters that normally have the corresponding mirrored glyph.

Other UCD Files

The following files in the Unicode Character Database are not used directly for Unicode properties. For more information about these files, see the referenced technical report(s), files, or section of Unicode Standard.

".txt" File	Description	N/I	Summary
Index	Chapter 17	I	Index to Unicode characters, as printed in the Unicode Standard.
NamesList	Chapter 17	I	This file duplicates some of the material in the UnicodeData file, and adds annotations used in the character charts.
NormalizationTest	UAX #15	N	Test file for conformance to Unicode Normalization Forms. See UAX #15: Unicode Normalization Forms [Norm]
StandardizedVariants	Chapter 16	N	Lists all the standardized variant sequences that have been defined, plus a description of the desired appearance. StandardizedVariants.html contains this information, plus a sample glyph showing the desired features.
NamedSequences	UAX#34	N	List the names for all approved named sequences.
NamedSequencesProv	UAX#34	P	Lists the names for all provisional named sequences.

Properties

The following table lists the properties in the UCD. They are roughly organized into groups based on the usage of the property (this grouping is purely for convenience, and has no other implications). The link on each property leads to description in the file index. The contributory properties (those of the form Other_XXX) are sets of exceptions used to generate properties in DerivedCoreProperties.txt. They are incomplete by themselves and not intended for independent use, for example an API returning property values would implement the corresponding derived core property instead.

General	Decomposition and Normalization	CJK
Name	Canonical_Combining_Class	Ideographic
Name_Alias	Decomposition_Mapping	Unified_Ideograph
Block	Composition_Exclusion	Radical
Age	Full_Composition_Exclusion	IDS_Binary_Operator
General_Category	Decomposition_Type	IDS_Trinary_Operator
Script	FC_NFKC_Closure	Unicode_Radical_Stroke
White_Space	NFC_Quick_Check	Misc
Alphabetic	NFKC_Quick_Check	Math
Hangul_Syllable_Type	NFD_Quick_Check	Quotation_Mark
Noncharacter_Code_Point	NFKD_Quick_Check	Dash
Default_Ignorable_Code_Point	Expands_On_NFC	Hyphen
Deprecated	Expands_On_NFD	STerm
Logical_Order_Exception	Expands_On_NFKC	Terminal_Punctuation
Variation_Selector	Expands_On_NFKD	Diacritic
Case		Extender
Uppercase	Shaping and Rendering	Grapheme_Base
Lowercase	Join_Control	Grapheme_Extend
Lowercase_Mapping	Joining_Group	Grapheme_Link (deprecated)
Titlecase_Mapping	Joining_Type	Unicode_1_Name
Uppercase_Mapping	Line_Break	ISO_Comment
Case_Folding	Grapheme_Cluster_Break
Simple_Lowercase_Mapping	Sentence_Break
Simple_Titlecase_Mapping	Word_Break
Simple_Uppercase_Mapping	East_Asian_Width
Simple_Case_Folding	Bidi	Contributory Properties
	Bidi_Control	Other_Alphabetic
Soft_Dotted	Bidi_Mirrored	Other_Default_Ignorable_Code_Point
Identifiers	Bidi_Class	Other_Grapheme_Extend
ID_Continue	Bidi_Mirroring_Glyph	Other_ID_Start
ID_Start	Numeric	Other_ID_Continue
XID_Continue	Numeric_Value	Other_Lowercase
XID_Start	Numeric_Type	Other_Math
Pattern_Syntax	Hex_Digit	Other_Uppercase
Pattern_White_Space	ASCII_Hex_Digit	Jamo_Short_Name

Property and Property Value Matching

Properties and property values may have multiple aliases, such as abbreviated names and longer, more descriptive names. For example, one can write either Line_Break or LB for the Line Break property, and either OP or Open_Punctuation for one of its values. When matching property names and values, it is strongly recommended that all aliases in the UCD be recognized, and that loose matching should be applied to all property names and property values according to the following:

For a general discussion of Unicode character properties, see Section 3.5, "Properties" in [Unicode], and UTR #23: The Unicode Character Property Model [UTR23].

Numeric Properties

For all numeric properties, and properties such as Unicode_Radical_Stroke that are combinations of numeric values, use the following loose matching rule:

LM1. Apply numeric equivalences

"01.00" is equivalent to "1".
"1.666667" in the UCD is a repeating fraction, and equivalent to 10/6.

Character Names

LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen in U+1180.

"zero-width space" is equivalent to "zero width space" or "zerowidthspace"
"character -a" is not equivalent to "character a"

Others

For all property names, property value names, and for property values for Enumerated, Binary, or Catalog properties, use the following loose matching rule:

LM3. Ignore case, whitespace, underscore ('_'), and hyphens.

"linebreak" is equivalent to "Line_Break" or "Line-break"

Otherwise loose matching should not be done for the property values of String properties, as case distinctions or other distinctions in those values may be significant.

Property Invariants

Values in the UCD are subject to correction as errors are found; however, some characteristics of the properties and files are considered invariants. Applications may wish to take these invariants into account when choosing how to implement character properties. All formally guaranteed invariants of property values are described in Unicode Policies. The following lists some additional invariants regarding file organization and more detail on a few of the invariants in the Unicode Policies.

UnicodeData Fields

The number of fields in UnicodeData.txt is fixed.
- Any additional information about character properties to be added in the future will appear in separate data files, rather than being added as an additional field or by subdivision or reinterpretation of existing fields.
The order of the fields is also fixed.

Combining Classes

The values of the Canonical_Combining_Class property are invariant.
Combining classes are limited to the values 0 to 255.
- In practice, there are far fewer than 256 values used; Unicode 3.0 used 53 values, and Unicode 4.0 used 54 values total. (For details, see DerivedCombiningClasses.txt in the UCD.) Implementations may take advantage of this fact for compression, since only the ordering of the non-zero values matters for the Canonical Ordering Algorithm. In principle, it would be possible for up to 256 values to be used in the future; however, new combining classes are added very seldom. There are implementation advantages in restricting the number of classes to 128—for example, the ability to use signed bytes without widening to ints in Java.
All characters other than those of General Category M* have the combining class 0.
- Currently, the obverse is also true: all characters other than those of General Category Mn have the value 0. However, some characters of General Category Me or Mc may be given non-zero values in the future.

Decimal Digits

In Unicode 4.0 and thereafter, the General_Category value Decimal_Number (Nd), and the Numeric_Type value Decimal (de) are defined to be co-extensive, that is, the set of characters having Nd will always be the same as the set of characters having de.

Property Values

The following gives a summary of property values for certain properties. Other property values are documented in other locations; for example, the line breaking property values are documented in UAX #14: Line Breaking Properties [Line].

General Category Values

The General_Category property of a code point provides for a most basic classification of that code point. It is usually determined based on the primary characteristic of the assigned character for that code point. For example, is it a letter, a mark, a number, punctuation, or a symbol, and if so, what type? Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value. For more information, see Chapter 4 in [Unicode].

The values in the General_Category field in UnicodeData.txt are abbreviations for the longer descriptions enumerated in the table below.

Abbr.	Description
Lu	Letter, Uppercase
Ll	Letter, Lowercase
Lt	Letter, Titlecase
Lm	Letter, Modifier
Lo	Letter, Other
Mn	Mark, Nonspacing
Mc	Mark, Spacing Combining
Me	Mark, Enclosing
Nd	Number, Decimal Digit
Nl	Number, Letter
No	Number, Other
Pc	Punctuation, Connector
Pd	Punctuation, Dash
Ps	Punctuation, Open
Pe	Punctuation, Close
Pi	Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf	Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po	Punctuation, Other
Sm	Symbol, Math
Sc	Symbol, Currency
Sk	Symbol, Modifier
So	Symbol, Other
Zs	Separator, Space
Zl	Separator, Line
Zp	Separator, Paragraph
Cc	Other, Control
Cf	Other, Format
Cs	Other, Surrogate
Co	Other, Private Use
Cn	Other, Not Assigned (no characters in the file have this property)

Note: The term "L&" is used to stand for Uppercase, Lowercase or Titlecase letters (Lu, Ll, or Lt) in comments. The LC value in PropertyValueAliases.txt also stands for Uppercase, Lowercase or Titlecase letters.

Note: The Unicode Standard does not assign information to control characters (except for certain cases). Implementations will generally also assign categories to certain control characters, notably CR and LF, according to platform conventions. See Section 5.8 "Newline Guidelines" in [Unicode] for more information.

Bidi Class Values

Please refer to UAX #9: The Bidirectional Algorithm [BIDI] for an explanation of the algorithm for Bidirectional Behavior and an explanation of the significance of these categories.

Type	Description
L	Left-to-Right
LRE	Left-to-Right Embedding
LRO	Left-to-Right Override
R	Right-to-Left
AL	Right-to-Left Arabic
RLE	Right-to-Left Embedding
RLO	Right-to-Left Override
PDF	Pop Directional Format
EN	European Number
ES	European Number Separator
ET	European Number Terminator
AN	Arabic Number
CS	Common Number Separator
NSM	Non-Spacing Mark
BN	Boundary Neutral
B	Paragraph Separator
S	Segment Separator
WS	Whitespace
ON	Other Neutrals

Character Decomposition Mapping

The tags supplied with certain decomposition mappings generally indicate formatting information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a formatting tag also indicates that the mapping is a compatibility mapping and not a canonical mapping. In the absence of other formatting information in a compatibility mapping, the tag is used to distinguish it from canonical mappings.

In some instances a canonical mapping or a compatibility mapping may consist of a single character. For a canonical mapping, this indicates that the character is a canonical equivalent of another single character. For a compatibility mapping, this indicates that the character is a compatibility equivalent of another single character. The compatibility formatting tags used are:

Tag	Description
<font>	A font variant (e.g. a blackletter form).
<noBreak>	A no-break version of a space or hyphen.
<initial>	An initial presentation form (Arabic).
<medial>	A medial presentation form (Arabic).
<final>	A final presentation form (Arabic).
<isolated>	An isolated presentation form (Arabic).
<circle>	An encircled form.
<super>	A superscript form.
<sub>	A subscript form.
<vertical>	A vertical layout presentation form.
<wide>	A wide (or zenkaku) compatibility character.
<narrow>	A narrow (or hankaku) compatibility character.
<small>	A small variant form (CNS compatibility).
<square>	A CJK squared font variant.
<fraction>	A vulgar fraction form.
<compat>	Otherwise unspecified compatibility character.

Reminder: There is a difference between decomposition and decomposition mapping. The decomposition mappings are defined in the UnicodeData, while the decomposition (also termed "full decomposition") is defined in Chapter 3 to use those mappings recursively.

The canonical decomposition is formed by recursively applying the canonical mappings, then applying the canonical reordering algorithm.
The compatibility decomposition is formed by recursively applying the canonical and compatibility mappings, then applying the canonical reordering algorithm.

The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic mapping, as specified in Section 3.12, Conjoining Jamo Behavior in [Unicode]. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping (dm) property value for a Hangul syllable is the pairwise decomposition and not the full decomposition.

Each character with the Hangul_Syllable_Type value LVT will have a decomposition mapping consisting of a character with an LV value and a character with a T value. Thus for U+CE31 the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.

Canonical Combining Class Values

Value	Description
0:	Spacing, split, enclosing, reordrant, and Tibetan subjoined
1:	Overlays and interior
7:	Nuktas
8:	Hiragana/Katakana voicing marks
9:	Viramas
10:	Start of fixed position classes
199:	End of fixed position classes
200:	Below left attached
202:	Below attached
204:	Below right attached
208:	Left attached (reordrant around single base character)
210:	Right attached
212:	Above left attached
214:	Above attached
216:	Above right attached
218:	Below left
220:	Below
222:	Below right
224:	Left (reordrant around single base character)
226:	Right
228:	Above left
230:	Above
232:	Above right
233:	Double below
234:	Double above
240:	Below (iota subscript)

Note: some of the combining classes in this list do not currently have members but are specified here for completeness.

Decompositions and Normalization

Decomposition is specified in Chapter 3. UAX #15: Unicode Normalization Forms [Norm] specifies the interaction between decomposition and normalization. That report specifies how the decompositions defined in UnicodeData.txt are used to derive normalized forms of Unicode text.

Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions in the UnicodeData.txt file can be used to recursively derive the full decomposition in canonical order, without the need to separately apply canonical reordering. However, canonical reordering of combining character sequences must still be applied in decomposition when normalizing source text which contains any combining marks.

The QuickCheck property values are as follows:

Property	Value	Description
NF*_QC	No	Characters that cannot ever occur in the respective normalization form. See Decompositions and Normalization.
NFC_QC, NFKC_QC	Maybe	Characters that may occur in the respective normalization, depending on the context. See Decompositions and Normalization.
NF*_QC	Yes	All other characters. This is the default value, and is not listed for individual characters or ranges in the file.

For more information, see Section 14 in UAX #15: Unicode Normalization Forms [Norm].

Case Mappings

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII. For more information, see Chapter 3 in Unicode 5.0.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about context-sensitive case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt.

Unihan Tags

A large number of properties specific to Han ideographs are contained in the Unihan Database, where they are called Unihan tags. The Unihan.txt file is described in [UAX38].

Validating Property Values

Binary properties are expressed in the Unicode files with the values:

Value	Abbr	Alias	Abbr
Yes	Y	True	T
No	N	False	F

The property values for strings and catalog values as expressed in the UCD files can be validated by using the following Regular Expression expressions. These expressions use Perl syntax, but may be translated for use with other regular expression engines. The last column lists the default values for these properties.

**Regular Expressions for Property Values**
Abbr	Name	Regex for Allowable Values		Defaults for Unlisted Values
age	Age	/([0-9]+\.[0-9]\|unassigned)/		unassigned
nv	Numeric_Value	/-?[0-9]+\.[0-9]+/	Field 2	NaN
nv	Numeric_Value	/-?[0-9]+(\[0-9]+)?/	Field 3	NaN
blk	Block	/[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/		No_Block
sc	Script	/[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/		Unknown (Zzzz)
dm	Decomposition_Mapping	/[\x{0}-\x{10FFFF}]+/		The code point itself, but # can be used to represent that in certain circumstances.
FC_NFKC	FC_NFKC_Closure	/[\x{0}-\x{10FFFF}]+/
cf	Case_Folding	/[\x{0}-\x{10FFFF}]+/
lc	Lowercase_Mapping
tc	Titlecase_Mapping
uc	Uppercase_Mapping
sfc	Simple_Case_Folding	/[\x{0}-\x{10FFFF}]/
slc	Simple_Lowercase_Mapping
stc	Simple_Titlecase_Mapping
suc	Simple_Uppercase_Mapping
bmg	Bidi_Mirroring_Glyph	/[\x{0}-\x{10FFFF}]?/		""
isc	ISO_Comment	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*\|\)?/		""
na1	Unicode_1_Name	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*(\ \((CR\|FF\|LF\|NEL)\))?)?/		null or empty string is the default for these property values, however in files the following can be used: <reserved>, <control>, <private-use>, <surrogate>, <noncharacter> The code point can also appear, in a form like <private-use-E000>. In some circumstances, such as a compact XML format, # can be used to stand for the code point to allow for name sharing.
na	Name	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*\|\)?/

References

[BIDI]	UAX #9: The Bidirectional Algorithm Latest version: http://www.unicode.org/reports/tr9/ 5.1.0 version: http://www.unicode.org/reports/tr9/tr9-18.html
[Breaks]	UAX #29: Text Boundaries Latest Version: http://www.unicode.org/reports/tr29/ 5.1.0 version: http://www.unicode.org/reports/tr29/tr29-13.html
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Line]	UAX #14: Line Breaking Properties Latest Version: http://www.unicode.org/reports/tr14/ 5.1.0 version: http://www.unicode.org/reports/tr14/tr14-22.html
[Norm]	UAX #15: Unicode Normalization Forms Latest Version: http://www.unicode.org/reports/tr15/ 5.1.0 version: http://www.unicode.org/reports/tr15/tr15-29.html
[Pattern]	UAX #31: Identifier and Pattern Syntax Latest Version: http://www.unicode.org/reports/tr31/ 5.1.0 version: http://www.unicode.org/reports/tr31/tr31-9.html
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts]	UAX #24 Script Names http://www.unicode.org/reports/tr24/ 5.1.0 version: http://www.unicode.org/reports/tr24/tr24-11.html
[U5.0]	The Unicode Standard Version 5.0 http://www.unicode.org/versions/Unicode5.0.0/
[U5.1.0]	The Unicode Standard Version 5.1.0 http://www.unicode.org/versions/Unicode5.1.0/
[UAX38]	UAX #38: Unicode Han Database (Unihan) Latest version: http://www.unicode.org/reports/tr38/ 5.1.0 version: http://www.unicode.org/reports/tr38/tr38-5.html
[UTR23]	The Unicode Character Property Model http://www.unicode.org/reports/tr23/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.
[Width]	UAX #11: East Asian Width Latest Version: http://www.unicode.org/reports/tr11/ 5.1.0 version: http://www.unicode.org/reports/tr11/tr11-16.html

Modification History

This section provides a summary of the changes between update versions of the Unicode Standard. The modifications prior to Unicode 4.0 only listed changes in UnicodeData.txt. From 4.0 onward, the consolidated modifications include the changes in other files.

Unicode 5.1.0

This document:

Added clarification regarding the Decomposition_Mapping for Hangul syllables.
Added specific documentation about First/Last convention for ranges in UnicodeData.txt.
Improved introduction to General Category Values.
Added reference to UTR #23 and updated other references.
Added note re abbreviation of Quick Check property names.
Added notes re omissions of foldings where the value is the same as the code point itself.
Applied correction for erratum about derivation of Default_Ignorable_Code_Point.
Added the section on Validating Property Values, with string property validation, default values, and boolean values.
Removed Special_Case_Condition. (The property values were never defined clearly enough to be applied.)
Corrected typos for PropList and Composition_Exclusion.
Updated property type for Jamo_Short_Name to Miscellaneous (M).
Added clarification of property type for Canonical_Combining_Class.
Updated listing of default values for UnicodeData fields.
Moved documentation of Grapheme_Link from PropList.txt to DerivedCoreProperties.txt section.
Updated current references to Unihan.html, to refer to UAX38, instead. Removed invalid bookmarks on Unihan property tags.

Changes in specific files:

In some of the following entries, references are made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Appropriate data files were updated to include the 1624 new characters added in Unicode 5.1.

UnicodeData.txt
- The 5 Arabic characters that surround numeral sequences (U+0600..U+0603, U+06DD) were changed from Bidirectional_Class=AL to AN. This has the effect of putting the surrounding sign and the numeral sequence in the same directional run, making them easier to implement correctly.
- 11 directional quotation marks (U+2018..U+201F, U+301D..U+301F) were changed to Bidi_Mirrored=N. This constituted a partial reversion of the change for Version 5.0 related to PRI #91.
- U+05BE was changed from gc=Po to gc=Pd.
- U+02EC and U+0374 were changed from gc=Sk to gc=Lm.
- U+A802 was changed from gc=Mc to gc=Mn.
- 10 compatibility ideographs were given numeric values.
Unihan.txt
- Two existing unified ideographs, U+6F06 and U+9621, were given numeric values.
- One new provisional property was added. Corrections and additions to other properties were made. See [UAX38] for the modification history.
ArabicShaping.txt
- A new joining group, BURUSHASKI YEH BARREE, was added.
BidiMirroring.txt
- Removed glyph mappings for the 11 characters that were changed to Bidi_Mirrored=N.
- Updated glyph mappings for 2278 and 2279 to [BEST FIT].
Blocks.txt
- Added 17 new block definitions.
DerivedNumericValues.txt
- A third field was added to this file, expressing the extracted numeric value as a whole integer, if possible, or as a rational fraction, e.g. 1/6.
LineBreak.txt
- There were numerous updates to linebreaking properties. See the Modification History in UAX #14 for details. Also see PRI #105.
NamedSequences.txt
- Lithuanian named sequences were approved and moved to this file from NamedSequencesProv.txt.
NamedSequencesProv.txt
- A new, complete set of named sequences for Tamil consonants and syllables were added to this file.
PropertyAliases.txt
- Added entry for Jamo_Short_Name.
- Added corrected alias for Simple_Case_Folding.
- Removed entry for Special_Case_Condition.
PropertyValueAliases.txt
- Appropriate aliases were added for new Block and Script values.
- For Block aliases, new values "ASCII", "Latin_1", and "Greek" were added for common use.
- Appropriate aliases were added for new Word_Break and Sentence_Break values.
- Explicit Y/N, T/F aliases were added for all binary properties.
- Additional aliases using underscores were added for aliases that used hyphen-minus.
- Some titlecased aliases were added for consistency.
PropList.txt
- The middle dots (U+00B7, U+0387) were added to identifiers by changing them to Other_ID_Continue=Y. See PRI #100.
- For consistency, the halfwidth Katakana sound marks (U+FF9E, U+FF9F) were added to Grapheme_Extend by making them Other_Grapheme_Extend=Y.
- The tag characters (U+E0001, U+E0020..U+E007F) were changed to Deprecated=Y.
- Other_Math values were adjusted for a number of mathematical symbols.
- U+05BE was changed to Dash=Y, consistent with the change in its General Category.
Scripts.txt
- 11 new Script values were added: Sundanese, Lepcha, Ol_Chiki, Vai, Saurashtra, Kayah_Li, Rejang, Lycian, Carian, Lydian, and Cham.
- 0374 and 0385 were changed from Greek to Common (because of canonical equivalences).
- 0CF1 and 0CF2 were changed from Kannada to Common (because of use in Vedic texts).
- Roman numeral compatibility characters, 2160..2183, were changed from Common to Latin.
- A circled Hangul character, 327E, was changed from Common to Hangul.
- Squared Katakana compatibility characters, 32D0..32FE and 3300..3357, were changed from Common to Katakana.
SpecialCasing.txt
- Clarified the use of language tags for specification of casing contexts.
StandardizedVariants.txt
- Updated documentation to note the existence of ideographic variation sequences and the Ideographic Variation Database (IVD).
GraphemeBreakProperty.txt
- Added Prepend class (for Logical_Order_Exception=Y).
- Added SpacingMark class (for most gc=Mc).
SentenceBreakProperty.txt
- Added Extend and SContinue classes.
- Split 0009 and 000A off from Sep class into CR and LF classes.
- Removed 00A0 from OLetter class.
WordBreakProperty.txt
- Added CR, LF, Newline, Extend, and MidNumLet classes.
- Moved 0027 and 2019 from MidLetter class to MidNumLet class.
- Moved 002E from MidNum class to MidNumLet class.
- Added 060C and 066C to MidNum class.
Text Boundary Test Files
- The existing test files, GraphemeBreakTest.txt, SentenceBreakTest.txt, and WordBreakTest.txt were substantially extended.
- A new test file, LineBreakTest.txt, was added, with test cases for UAX #14.

Unicode 5.0.0

This document:

Added new properties.
Updated property invariants for combining classes.
Reorganized order of sections in the document for clarity.

Common file changes:

In many data files an explicit default property assignment range was added (in a machine-readable comment line), to assist implementations in assigning values for code points not otherwise listed in the data file.

Changes in specific files:

In some of the following entries, references are made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Appropriate data files were updated to include the 1369 new characters added in Unicode 5.0.

Two new data files, NameAliases.txt and NamedSequencesProv.txt, were added to the UCD.

UnicodeData.txt
Note that except for the changes involving U+0294 LATIN LETTER GLOTTAL STOP, changes made to General_Category and Bidirectional_Class impacted primarily a handful of archaic letters.
- U+10341 GOTHIC LETTER NINETY was changed from gc=Lo to gc=Nl. This change also impacted a numeric field, for consistency.
- U+103D0 OLD PERSIAN WORD DIVIDER was changed from gc=So to gc=Po, and from bc=ON to bc=L.
- U+103D1..U+103D5 were changed from bc=ON to bc=L.
- U+23B4..U+23B6 were changed from various punctuation assignments to gc=So.
- U+2132 TURNED CAPITAL F was changed from gc=So to gc=Lu, and from bc=ON to bc=L.
- U+2183 ROMAN NUMERAL REVERSED ONE HUNDRED was changed from gc=Nl to gc=Lu.
- U+0294 LATIN LETTER GLOTTAL STOP was changed from gc=Ll to gc=Lo.
- Casing assignments were added for several characters for new case pairs.
- Case mappings were removed for U+0294 LATIN LETTER GLOTTAL STOP and updated for U+0241 LATIN CAPITAL LETTER GLOTTAL STOP.
- 30 characters were changed to Bidi_Mirrored=Y. These consisted of compatibility paired punctuation and some quotation marks. See PRI #80 and PRI #91.
Unihan.txt
- 4 new provisional properties were added, and extensive corrections and additions to other properties were made. See Unihan.html for the modification history.
ArabicShaping.txt
- New joining classes were added for N'Ko.
BidiMirroring.txt
- 30 entries were added, to give glyph mappings for characters changed to Bidi_Mirrored=Y. See PRI #80 and PRI #91.
Blocks.txt
- Added 9 new block definitions.
DerivedCoreProperties.txt
- The deprecated derived property, Grapheme_Link, was added to this file.
LineBreak.txt
- There were numerous updates to linebreaking properties. See the Modification History in UAX #14 for details. Also see PRI #88.
NamedSequences.txt
- 6 named sequences for Gurmukhi and one for Latin were removed.
PropertyValueAliases.txt
- Appropriate aliases were added for new Block and Script values.
PropList.txt
- The Grapheme_Link property was deprecated and moved to DerivedCoreProperties.txt as derivable. U+034F COMBINING GRAPHEME JOINER was removed from the derivation.
- U+1D6A4 MATHEMATICAL ITALIC SMALL DOTLESS I and U+1D6A5 MATHEMATICAL ITALIC SMALL DOTLESS J were added to Other_Math.
- U+1039F UGARITIC WORD DIVIDER and U+103D0 OLD PERSIAN WORD DIVIDER were added to Terminal_Punctuation.
Scripts.txt
- 5 new Script values were added: Balinese, Cuneiform, Phoenician, Phags-pa, and Nko.
- A new Script value Unknown was added and made the default for unassigned characters. See PRI #87.
- 3 Mongolian punctuation characters used by Phags-pa were changed to Script=Common.
- U+1DBF MODIFIER LETTER SMALL THETA was changed from Script=Latin to Script=Greek.
- U+2132 TURNED CAPITAL F was changed from Script=Common to Script=Latin.
StandardizedVariants.txt
- 6 standardized variation sequences were added for Phags-pa.
WordBreakProperty.txt
- U+2132 TURNED CAPITAL F was added to ALetter.
- 220 characters from the Myanmar, Khmer, Tai Le, and New Tai Lue scripts were removed from ALetter, because those scripts do not cusomarily use spaces between words and require special handling.

Unicode 4.1.0

This document:

Added description of new directory and release structure, including the Auxiliary files.
Removed exception for field numbering in LineBreak and EastAsianWidth.
Added new properties, and changed some of the documentation of the identifier properties.
Removed the material that is now to be in Unihan.html
Removed the listing of default BIDI properties, referring now to extracted/DerivedBidiClass.txt
Replaced direct links to UAXes with links to references section.

Common file changes:

All remaining files not corrected for Unicode 4.0.1 have had their headers updated to explicitly point to Terms of Use. The headers have also been synchronized somewhat to share a more common format for file version, date, and pointers to documentation. The major exception is UnicodeData.txt, which for legacy reasons, has no header.

Changes in specific files:

In some of the following, reference is made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Appropriate data files were updated to include the 1273 new characters added in Unicode 4.1.

The description of the Unihan properties was separated out from UCD.html, and extensively revised, and now appears in Unihan.html.

An auxiliary directory has been added. In 4.1.0 it contains properties associated with UAX #29: Text Boundaries [Breaks].

UnicodeData.txt
- The Bidi_Class of U+202F was changed from bc=WS to bc=CS. See PRI #45.
- The Bidi_Class of U+FF0F was changed from bc=ES to bc=CS. See PRI #44.
- The Bidi_Class of U+2212 MINUS SIGN and 9 other characters similar to either a minus sign or a plus sign were changed to bc=ES. See PRI #57.
- U+30FB KATAKANA MIDDLE DOT and U+FF65 HALFWIDTH KATAKANA MIDDLE DOT were changed from gc=Pc to gc=Po. See PRI #55.
- Case mappings were added for Georgian capitals (Asomtavruli) to map them to the newly added Nuskhuri alphabet.
- U+A015 YI SYLLABLE WU was changed from gc=Lo to gc=Lm.
- 9 Ethiopic digits were changed from gc=Nd to gc=No.
- The Numeric_Type of U+1034A GOTHIC LETTER NINE HUNDRED was changed from nt=None to nt=Nu, and it was given a Numeric_Value of 900.
- Uppercase and titlecase mappings were added for U+019A LATIN SMALL LETTER L WITH BAR and U+0294 LATIN LETTER GLOTTAL STOP to map them to newly added capital letters.
Unihan.txt
- Extensive additions and corrections were made for this data file. See Unihan.html for the modification history.
ArabicShaping.txt
- The Joining_Group of U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE was changed to jg=Heh_Goal.
BidiMirroring.txt
- The Bidi_Mirroring_Glyph value for U+2A2D was corrected.
Blocks.txt
- Added 20 new block definitions.
LineBreak.txt
- The Line_Break property of all conjoining jamos was updated from lb=ID to make use of Hangul-specific Line_Break property values, aligned with the Hangul_Syllable_Type property.
- Many other corrections were made to the Line_Break property of characters, particularly for punctuation marks specific to Runic, Mongolian, Tibetan and various Indic scripts. For details on these changes, see UAX #14.
PropertyAliases.txt
- Properties and aliases were added for UAX #29, Text Boundaries: Grapheme_Cluster_Break, Word_Break, and Sentence_Break.
- Properties and aliases were added for: Other_ID_Continue, Pattern_White_Space, and Pattern_Syntax.
- An alias was added for White_Space: "space", for compatibility with POSIX.
PropertyValueAliases.txt
- Property value aliases were added for all new properties, and for new values added to existing catalog properties (blocks and scripts).
- Property value aliases were added for compatibility with POSIX: "cntrl", "digit", and "punct"
PropList.txt
- 3 new properties were added: Other_ID_Continue, Pattern_White_Space, and Pattern_Syntax.
- U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN was given the Dash property.
- U+A015 YI SYLLABLE WU was given the Extender property.
- Golden number runes (U+16EE..U+16F0), Roman numerals (U+2160..U+2183), and U+1034A GOTHIC LETTER NINE HUNDRED were removed from Other_Alphabetic.
- Circled Latin letters (U+24B6..U+24E9) were added to Other_Alphabetic. These changes to Other_Alphabetic were to better align Alphabetic and casing properties. The derived property Alphabetic is now a superset of the derived properties Lowercase and Uppercase, for compatibility with POSIX-style character classes.
- 3 musical symbol combining flags (U+1D170..U+1D172) were added to Other_Grapheme_Extend to fix an inconsistency in the data.
- U+200B ZERO WIDTH SPACE was removed from Other_Default_Ignorable_Code_Point.
Scripts.txt
- 8 new Script values were added: Buginese, Coptic, New_Tai_Lue, Glagolitic, Tifinagh, Syloti_Nagri, Old_Persian, and Kharoshthi.
- The Script value Katakana_Or_Hiragana (Hrkt) was removed.
- The Script for the 14 Coptic letters in the Greek and Coptic block were updated to sc=Copt.
- 10 characters (punctuation and extenders) shared by Katakana and Hiragana were changed from sc=Hrkt to sc=Zyyy.
SpecialCasing.txt
- The case mapping contexts defined in this file were updated.
- A number of clarifying changes were made to comments in the header of this data file.

Unicode 4.0.1

This document:

Added two new properties
Added the property types Catalog and Miscellaneous
Described loose matching of property names and values
Added to file format

Common file changes:

Some property values have different casing (upper vs. lower) for consistency between the data files and the PropertyValueAlias file. There are some additional changes in comments:

Nearly all files changed headers to explicitly point to Terms of Use
Names for code points without names now have a more uniform style, such as <reserved-1234>
Where characters with a default value are not listed, that information is indicated in the total code point counts
The full property name and property value name (for enumerated properties) is usually supplied in a comment

Changes in specific files:

In some of the following, reference is made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

UnicodeData.txt
- Changed general category of Zero Width Space (U+200B) from Zs to Cf. For background information, see PRI #21.
- Bidi Conformance was made much clearer and more rigorous, also resulting in a number of property changes:
  - Several Bidi fixes impact number and date formatting with the following characters: +, -, /
  - Braille symbols were changed to being strong Left-to-right, to reflect usage.
  - A review of BN and Default Ignorable code points resulted in a number of changes: for details, see PRI #28.
  - Some other bidi tweaks were made for consistency.
- While the properties of the Join_Controls have not changed, their role in combining characters sequences has. For more information, see http://www.unicode.org/versions/Unicode4.0.1/.
- Removed an extraneous space at the end of the name field for two characters.
Unihan.txt
- A major update of the Unihan data file, to bring it up-to-date for Unicode 4.0. (It was not released in Version 4.0.0, because of the time required to complete and check corrections to the data file.) This update rolls in fixes for nearly all known errors in the prior version of the file and adds a very large amount of other informative data. For details, see the header of that file.
- Added three new tags: kHanyuPinlu, kGSR, and kIRG_USource.
- Completed data for kCihaiT, kCowles, kGradeLevel, and kLau
- The kMandarin field has been corrected and its order restored to a "frequency" order
ArabicShaping.txt
- Moved one entry into code point order.
Blocks.txt
- Corrected name of the Cyrillic Supplement block.
DerivedCoreProperties.txt
- ZWNJ/ZWJ (U+200C..U+200D) now have the Grapheme_Extend property.
DerivedNormalizationProps.txt
- While not actually changing the particular values associated with the Quick Check properties for characters, a revision was made in how the Quick Check properties are expressed in the file, to bring it more into line with the model for other properties. This resulted in a significant change in the format of the data file and the explicit separation of Yes, No, and Maybe values. In addition, the actual aliases for the property changed in the data file.
Index.txt
- Updated to correspond to the character index published as part of the Unicode Standard, Version 4.0.
LineBreak.txt
- Many changes for consistency and to better match best practice in existing line break implementations; for details, see UAX #14: Line Breaking Properties
PropertyAliases.txt
- Addition of some property categories, with the order of property aliases adjusted for clarity.
- Addition of alias entries for the new STerm and Variation_Selector properties.
PropertyValueAliases.txt
- Addition of specific values and aliases for age.
- Addition of second alias for the Cyrillic Supplement block.
- Addition of second alias for the Inseparable value of the Line Break property.
- Revision of the all the Normalization Quick Check properties, to replace the pseudo-property "qc" with actual specific properties with explicit enumerated value aliases.
- Addition of Katakana_Or_Hiragana script alias.
- Fixed None (so it is used uniformly in first aliases instead of being the only n/a)
PropList.txt
- Major revision of the Other_Math property to align the derived Math property with the explanation given in UTR #25.
- Extension of the list of characters with the Soft_Dotted property.
- Significant update of the list of characters with the Terminal_Punctuation property.
- Addition of a new STerm property, to simplify the description used in UAX #29.
- Addition of the Variation_Selector property.
- Reassignment of the list of characters with the Other_Default_Ignorable_Code_Point property, to enable simpler derivation.
- Addition of ZWNJ/ZWJ (200C..200D) to Other_Grapheme_Extend.
Scripts.txt
- Significant revision of script assignments, to assign specific script values to many characters that previously had the Common script value.
- Addition of the Katakana_Or_Hiragana script value, with list of characters for it.
- The Common values are now listed, for comparison.
SpecialCasing.txt
- Correction of typo in comments.

Unicode 4.0

UnicodeData.txt
- Decimal Digits
  - Numeric_Type=decimal digit now aligned with General_Category=Nd
- Modifier letters*
  - The general category of 02B9..02BA, 02C6..02CF changed to general category Lm.
Other Files
- New Properties and Values
  - Hangul_Syllable_Type, Unicode_Radical_Stroke
  - CJK numeric values added.
  - PropertyValueAliases adds block names
  - UCD fallback props more precisely defined, for code points not explicitly in data files
  - Added script value for Braille
  - New line breaking properties: NL, WJ
- Khmer
  - Two Khmer characters are deprecated; four others strongly discouraged.
- Special Casing
  - Fixed for Turkish, Lithuanian
- Default Ignorables
  - Hangul Filler characters
  - Soft-Hyphen, CGJ, ZWS
  - Arabic End of Ayah and Syriac Abbreviation Mark no longer DI (their shaping classes are also fixed.)
- Grapheme_Extend
  - Removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)
- Stabilized Properties
  - The Hyphen property is now stabilized.

Unicode 3.2

Modifications made for Version 3.2.0 of UnicodeData.txt include:

Addition of 1016 new entries, to cover new characters encoded in Unicode 3.2.

Updated ISO 6429 names for control functions to match the currently published version of that standard.

Changed general category for Mongolian free variation selectors (U+180B..U+180D) from Cf to Mn.

Changed general category for U+0B83 TAMIL SIGN VISARGA (aytham) from Mc to Lo.

Changed general category for U+06DD ARABIC END OF AYAH from Me to Cf.

Changed general category for U+17D7 KHMER SIGN LEK TOO from Po to Lm.

Changed general category for U+17DC KHMER SIGN AVAKRAHASANYA from Po to Lo.

Changed canonical decomposition for U+F951 from 96FB to 964B (see Corrigendum #3: U+F951 Normalization).

Unicode 3.1.1

Modifications made for Version 3.1.1 of UnicodeData.txt include:

Modification of ISO 10646 annotation regarding Greek tonos, affecting entries for U+0301 and U+030D.

Unicode 3.1

Modifications made for Version 3.1.0 of UnicodeData.txt include:

Addition of 2237 new entries, to cover new characters and new ranges of unified Han characters encoded in Unicode 3.1.
Changed General Category value of 16EE..16F0 (Runic golden numbers) from No to Nl.

Unicode 3.0.1

Modifications made for Version 3.0.1 of UnicodeData.txt include:

Added 5- and 6-digit representation of code points past U+FFFF.
Added Private Use range definitions for Planes 15 and 16.
Minor additions for the 10646 comment field.

Unicode 3.0.0

Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and a number of property changes. These are summarized in Appendix D of The Unicode Standard, Version 3.0.

Unicode 2.1.9

Modifications made for Version 2.1.9 of UnicodeData.txt include:

Corrected combining class for U+05AE HEBREW ACCENT ZINOR.
Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE
Corrected combining class for U+0F35 and U+0F37 to 220.
Corrected combining class for U+0F71 to 129.
Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR.
Added decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5, U+03D6, U+03F0..U+03F2.
Removed decompositions from the conjoining jamo block: U+1100..U+11F8.
Changes to decomposition mappings for some Tibetan vowels for consistency in normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81)
Updated the decomposition mappings for several Vietnamese characters with two diacritics (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive decomposition can be generated directly in canonically reordered form (not a normative change).
Updated the decomposition mappings for several Arabic compatibility characters involving shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so that the decompositions are generated directly in canonically reordered form (not a normative change).
Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE SEPARATOR.
Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035, U+FF9E, U+FF9F.
Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375.
Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL.
Added Unicode 1.0 names for many Tibetan characters (informative).