Unicode 4.0.1
Version 4.0.1 has been superseded by the
latest version
of the Unicode Standard.
|
Version 4.0.1 of the Unicode Standard consists of the core
specification, The Unicode Standard,
Version 4.0, the additional specifications on this page,
the delta and archival code charts for this version, the Unicode Standard Annexes,
and the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 4.0.1 of the Unicode Standard should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 4.0.1,
defined by: The Unicode Standard, Version 4.0 (Reading, MA,
Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by
Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/).
A complete specification of the contributory files for Unicode
4.0.1 is found on the page
Components for Version 4.0.1.
Online Edition
The text of The Unicode Standard, Version 4.0, as well as the
delta and archival code charts, is available online via the navigation links
on this page. These files may not be printed. The
Unicode 4.0
Web Bookmarks page has links to all sections of the online text.
Overview
Unicode 4.0.1 is an
update version
of the Unicode Standard. It adds no new characters.
The main new features in Unicode 4.0.1 are the following:
- The first significant update of the Unihan Database (Unihan.txt)
since Unicode 3.2.0, including a large number of fixes and
additional data items.
- Significant clarifications in four definitions used in conformance.
- Unicode Character Database:
- New character properties: STerm and Variation_Selector
- Updated significantly: Terminal_Punctuation, Math, Script, and Line_Break
- Changed: general category of U+200B ZERO WIDTH SPACE
- Changed: bidi class of some characters including: +, -, / and FRACTION SLASH
- Added: property value aliases
- Revised: formats in some of the data files
- Changes in the recommended loose comparison of character name values.
See Property
and Property Value Matching
- Clearer definition of the encoding of Bengali Reph and Ya-phalaa
Changes to Definitions D13, D14, and D17
Unicode 4.0,
Chapter 3
section 6 [page 70] contains the following definitions:
D13 Base Character:
A character that does not graphically combine with preceding characters,
and that is neither a control nor a format character.
D14 Combining character:
A character that graphically combines with a preceding base character. The combining character is said to apply to that base character.
D17 Combining character sequence:
A character sequence consisting of either a base character followed by a
sequence of one or more combining characters, or a sequence of one or more
combining characters.
These definitions are modified as follows in Unicode 4.0.1 for greater
clarity and to allow U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-JOINER
to be used in combining character sequences. (Definition D13 has been split into two
parts, D13a and D13b. The bullet items, not formally
parts of the definitions, are also modified for clarity. See the above-cited reference for
details.)
D13a Graphic character: A character with the General Categories of
Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or
Space Separator (Zs).
- Graphic characters specifically exclude the line and paragraph separators
(Zl, Zp) and exclude the characters with the General Categories of Other
(Cn, Cs, Cc, Cf).
- The interpretation of private use characters (Co) as graphic characters or
not is determined by the implementation.
- For more information, see Chapter 2, especially Section 2.4 Code Points
and Characters and Table 2-2 Types of Code Points.
D13b Base character: Any graphic character except for those with the
General Category of Combining Mark (M).
- Most Unicode characters are base characters. In terms of General Category
values, a base character is any code point that has one of the categories:
Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator
(Zs).
- Base characters do not include control characters or
format controls.
- Base characters are independent graphic characters, but this does not
preclude the presentation of base characters from adopting different
contextual forms or participating in ligatures.
- The interpretation of private use characters (Co) as base characters or
not is determined by the implementation. However, the default interpretation
of private use characters should be as base characters, in the absence of
other information.
D14 Combining character: A character with the General Category of Combining Mark (M).
- Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Non-Spacing Mark (Mn), and Enclosing Mark (Me).
- All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class
- The interpretation of Private Use characters (Co) as combining characters or not is determined by the implementation.
- These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
- The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor one of ZERO WIDTH JOINER or ZERO WIDTH NON-JOINER. The combining character is said to apply to that base character.
- There may be no such base character, such as when a combining character is at the start of text or follows a control or format character, such as a carriage return, tab, or RIGHT-LEFT MARK. In such cases, the combining characters are called isolated combining characters.
- With isolated combining characters, or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
- The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
- Combining characters generally take on the properties of their base character, while retaining their combining property.
D17 Combining character sequence: A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER; or a sequence of one or more characters where each is a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.
- When identifying a combining character sequence in Unicode text, the definition of the combining character sequence is applied maximally. Thus, for example, in the sequence <c, dot-below, caron, acute, a>, the entire sequence <c, dot-below, caron, acute> is identified as the combining character sequence, rather than the alternative of identifying <c, dot-below> as a combining character sequence followed by a separate (defective) combining character sequence <caron, acute>.
(The changes to D14 and D17 do not imply that any particular sequence is
automatically meaningful or interoperable; sequences must still be documented and used in
conventional ways to convey specific meanings.)
Change to Definition D9b
Unicode 4.0.1 explicitly acknowledges that provisional properties
are not maintained. Unicode 4.0 contains this definition in
Chapter 3,
section 5 [page 67]:
D9b Provisional property: A Unicode character property whose
values are unapproved and tentative, and which may be incomplete or
otherwise not in a usable state.
This has been modified by addition of a bullet item, as follows:
D9b Provisional property: A Unicode character property whose
values are unapproved and tentative, and which may be incomplete or
otherwise not in a usable state.
-
Provisional properties may be removed from future versions of the
standard, without prior notice.
Clarification of Bengali Reph and Ya-phalaa
The formation of the Reph form is defined in
the Unicode
4.0 Book, Section 9.1, Rules for Rendering, R2. Basically, the Reph is formed when a Ra which has the inherent vowel killed by the virama/halant
begins a syllable. This is shown in the following example.
The Ya-phalaa is a post-base form of Ya and I formed when the Ya is the
final consonant of a syllable cluster. In this case, the previous
consonant retains is base shape and the virama/halant is combined with the
following Ya. This is shown in the following example.
An ambiguous situation
is encountered when the combination of Ra + virama/halant + Ya is
encountered.
To resolve the
ambiguity with this combination and to have consistent behavior, the processing order of the Bengali script
is taken into account. When parsing the
text, the ability to form the Reph is identified first and therefore the
Reph form should have priority in processing. Thus, it is necessary to
insert a U+200C ZERO WIDTH NON-JOINER character into the stream between the Ra and virama/halant
to allow the virama/halant and Ya to be grouped together during
processing.
In the example above, the ZWNJ is used because two characters that would join by default
are intended to remain as
separate entities. In cases other than where the RA is the first character
in the cluster, the ZWNJ is not required for the formation of the Ya-phalaa.
However, for ease of placing the Ya-phalaa input as a single key input, it
should be permissible for the Ya-phalaa to be consistently formed by “ZWNJ +
VIRAMA + YA” (U+200C + U+09CD + U+09AF).
Unicode Character Database
The updated
Unicode Character Database files for this version are available in
the 4.0.1 Update
directory. For the unchanged files, see the
Components for
Version 4.0.1. For more detailed information about the changes in the Unicode
Character Database, see the file
UCD.html in the Unicode Character
Database.