Unicode 4.0.1

Home | Site Map | Search

4.0.0 Front Matter
	Title and Copyright
	Acknowledgements
	Table of Contents
	Unicode 4.0 Web Bookmarks
	List of Figures
	List of Tables
	Preface
4.0.0 Chapters
1	Introduction
2	General Structure
3	Conformance
4	Character Properties
5	Implementation Guidelines
6	Writing Systems and Punctuation
7	European Alphabetic Scripts
8	Middle Eastern Scripts
9	South Asian Scripts
10	Southeast Asian Scripts
11	East Asian Scripts
12	Additional Modern Scripts
13	Archaic Scripts
14	Symbols
15	Special Areas and Format Characters
16	Code Charts Introductory Text
Code Charts
•	Code Charts (Latest)
•	Delta Code Charts (additions to 4.0.0 highlighted)
•	Archival Code Charts (4.0.0, 33 MB)
Han Radical-Stroke Index
17	Han Radical-Stroke Index (Introductory Text)
•	Interactive Han Radical-Stroke Index (Latest)
4.0.0 Appendices and Back Matter
A	Han Unification History
B	Abstracts of Unicode Technical Reports
C	Relationship to ISO/IEC 10646
D	Changes from Unicode Version 3.0
G	Glossary
R	References
I	I.1 Unicode Names Index I.2 General Index
4.0.1 Unicode Standard Annexes
UAX #9: The Bidirectional Algorithm
UAX #11: East Asian Width
UAX #14: Line Breaking Properties
UAX #15: Unicode Normalization Forms
UAX #24: Script Names
UAX #29: Text Boundaries
4.0.1 UCD
4.0.1 (files) (about)
For unchanged files see: Components
Related Links
About Versions
Latest Version
Archive of Unicode Versions
The Unicode Standard
Unicode Character Database
Technical Reports
Updates and Errata

Unicode 4.0.1

Version 4.0.1 has been superseded by the latest version of the Unicode Standard.

Version 4.0.1 of the Unicode Standard consists of the core specification, The Unicode Standard, Version 4.0, the additional specifications on this page, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 4.0.1 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 4.0.1, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/).

A complete specification of the contributory files for Unicode 4.0.1 is found on the page Components for Version 4.0.1.

Online Edition

The text of The Unicode Standard, Version 4.0, as well as the delta and archival code charts, is available online via the navigation links on this page. These files may not be printed. The Unicode 4.0 Web Bookmarks page has links to all sections of the online text.

Overview

Unicode 4.0.1 is an update version of the Unicode Standard. It adds no new characters.

The main new features in Unicode 4.0.1 are the following:

The first significant update of the Unihan Database (Unihan.txt) since Unicode 3.2.0, including a large number of fixes and additional data items.

Significant clarifications in four definitions used in conformance.

Unicode Character Database:

New character properties: STerm and Variation_Selector

Updated significantly: Terminal_Punctuation, Math, Script, and Line_Break

Changed: general category of U+200B ZERO WIDTH SPACE

Changed: bidi class of some characters including: +, -, / and FRACTION SLASH

Added: property value aliases

Revised: formats in some of the data files

Changes in the recommended loose comparison of character name values. See Property and Property Value Matching

Clearer definition of the encoding of Bengali Reph and Ya-phalaa

Changes to Definitions D13, D14, and D17

Unicode 4.0, Chapter 3 section 6 [page 70] contains the following definitions:

D13 Base Character: A character that does not graphically combine with preceding characters, and that is neither a control nor a format character.

D14 Combining character: A character that graphically combines with a preceding base character. The combining character is said to apply to that base character.

D17 Combining character sequence: A character sequence consisting of either a base character followed by a sequence of one or more combining characters, or a sequence of one or more combining characters.

These definitions are modified as follows in Unicode 4.0.1 for greater clarity and to allow U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-JOINER to be used in combining character sequences. (Definition D13 has been split into two parts, D13a and D13b. The bullet items, not formally parts of the definitions, are also modified for clarity. See the above-cited reference for details.)

D13a Graphic character: A character with the General Categories of Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).

Graphic characters specifically exclude the line and paragraph separators (Zl, Zp) and exclude the characters with the General Categories of Other (Cn, Cs, Cc, Cf).

The interpretation of private use characters (Co) as graphic characters or not is determined by the implementation.

For more information, see Chapter 2, especially Section 2.4 Code Points and Characters and Table 2-2 Types of Code Points.

D13b Base character: Any graphic character except for those with the General Category of Combining Mark (M).

Most Unicode characters are base characters. In terms of General Category values, a base character is any code point that has one of the categories: Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).

Base characters do not include control characters or format controls.

Base characters are independent graphic characters, but this does not preclude the presentation of base characters from adopting different contextual forms or participating in ligatures.

The interpretation of private use characters (Co) as base characters or not is determined by the implementation. However, the default interpretation of private use characters should be as base characters, in the absence of other information.

D14 Combining character: A character with the General Category of Combining Mark (M).

Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Non-Spacing Mark (Mn), and Enclosing Mark (Me).

All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class

The interpretation of Private Use characters (Co) as combining characters or not is determined by the implementation.

These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor one of ZERO WIDTH JOINER or ZERO WIDTH NON-JOINER. The combining character is said to apply to that base character.

There may be no such base character, such as when a combining character is at the start of text or follows a control or format character, such as a carriage return, tab, or RIGHT-LEFT MARK. In such cases, the combining characters are called isolated combining characters.

With isolated combining characters, or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.

The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.

Combining characters generally take on the properties of their base character, while retaining their combining property.

D17 Combining character sequence: A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER; or a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.

When identifying a combining character sequence in Unicode text, the definition of the combining character sequence is applied maximally. Thus, for example, in the sequence <c, dot-below, caron, acute, a>, the entire sequence <c, dot-below, caron, acute> is identified as the combining character sequence, rather than the alternative of identifying <c, dot-below> as a combining character sequence followed by a separate (defective) combining character sequence <caron, acute>.

(The changes to D14 and D17 do not imply that any particular sequence is automatically meaningful or interoperable; sequences must still be documented and used in conventional ways to convey specific meanings.)

Change to Definition D9b

Unicode 4.0.1 explicitly acknowledges that provisional properties are not maintained. Unicode 4.0 contains this definition in Chapter 3, section 5 [page 67]:

D9b Provisional property: A Unicode character property whose values are unapproved and tentative, and which may be incomplete or otherwise not in a usable state.

This has been modified by addition of a bullet item, as follows:

D9b Provisional property: A Unicode character property whose values are unapproved and tentative, and which may be incomplete or otherwise not in a usable state.

Provisional properties may be removed from future versions of the standard, without prior notice.

Clarification of Bengali Reph and Ya-phalaa

The formation of the Reph form is defined in the Unicode 4.0 Book, Section 9.1, Rules for Rendering, R2. Basically, the Reph is formed when a Ra which has the inherent vowel killed by the virama/halant begins a syllable. This is shown in the following example.

The Ya-phalaa is a post-base form of Ya and I formed when the Ya is the final consonant of a syllable cluster. In this case, the previous consonant retains is base shape and the virama/halant is combined with the following Ya. This is shown in the following example.

An ambiguous situation is encountered when the combination of Ra + virama/halant + Ya is encountered.

To resolve the ambiguity with this combination and to have consistent behavior, the processing order of the Bengali script is taken into account. When parsing the text, the ability to form the Reph is identified first and therefore the Reph form should have priority in processing. Thus, it is necessary to insert a U+200C ZERO WIDTH NON-JOINER character into the stream between the Ra and virama/halant to allow the virama/halant and Ya to be grouped together during processing.

In the example above, the ZWNJ is used because two characters that would join by default are intended to remain as separate entities. In cases other than where the RA is the first character in the cluster, the ZWNJ is not required for the formation of the Ya-phalaa. However, for ease of placing the Ya-phalaa input as a single key input, it should be permissible for the Ya-phalaa to be consistently formed by “ZWNJ + VIRAMA + YA” (U+200C + U+09CD + U+09AF).

Unicode Character Database

The updated Unicode Character Database files for this version are available in the 4.0.1 Update directory. For the unchanged files, see the Components for Version 4.0.1. For more detailed information about the changes in the Unicode Character Database, see the file UCD.html in the Unicode Character Database.