Unicode 5.1.0
Version 5.1.0 has been superseded by the
latest version
of the Unicode Standard.
|
Version 5.1.0 of the Unicode Standard consists of the core
specification (The Unicode Standard,
Version 5.0), as amended by this specification, together with
the delta and archival code charts for this version, the 5.1.0 Unicode Standard Annexes,
and the 5.1.0 Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 5.1.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 5.1.0,
defined by: The Unicode Standard, Version 5.0 (Boston, MA,
Addison-Wesley, 2007. ISBN 0-321-48091-0), as amended by Unicode
5.1.0 (http://www.unicode.org/versions/Unicode5.1.0/).
A complete specification of the contributory files for Unicode
5.1.0 is found on the page
Components for 5.1.0.
That page also provides the recommended reference format for Unicode Standard Annexes.
A. Online Edition
B. Overview
C. Errata
D. Notable Changes From Unicode 5.0.0 to Unicode 5.1.0
E. Conformance Changes to the Standard
F. Changes to Unicode Standard Annexes
G. Other Changes to the Standard
H. Unicode Character Database
I. Character Assignment Overview
J.
Script Additions
K. Significant
Character Additions
Tamil Named Character Sequences
Malayalam Chillu Characters
Myanmar
The text of The Unicode Standard, Version 5.0, as well as
the delta and archival code charts,
is available online via the navigation links on this page.
The charts and the Unicode Standard Annexes may be printed, while
the other files may be viewed but not printed. The
Unicode 5.0 Web Bookmarks page has links to all sections of the
online text.
The changes
addressed in this document consist of additional characters, new normative
text, additional clarifications, and corrections.
This specification is a delta
document consisting of changes
to the text, typically with an indication of how the principal
affected text would be changed. The indications of affected text are
not exhaustive; all other relevant text in the core specification is also
overridden by text in this specification.
The Unicode Standard Annexes themselves
are not delta documents; they incorporate all of the textual
changes for their updates for
Version 5.1.0.
Unicode 5.1.0 contains over 100,000 characters, and provides significant additions and improvements that extend text processing for software worldwide. Some of the key features are: increased security in data exchange, significant character additions for Indic and South East Asian scripts, expanded identifier specifications for Indic and Arabic scripts, improvements in the processing of Tamil and other Indic scripts, linebreaking conformance relaxation for HTML and other protocols, strengthened normalization stability, new case pair stability, plus others given below.
In addition to updated existing files, implementers will find new test data files (for example, for linebreaking) and new XML data files that encapsulate all of the Unicode character properties.
A major feature of Unicode 5.1.0 is the enabling of ideographic variation
sequences. These sequences allow standardized representation of glyphic variants
needed for Japanese, Chinese, and Korean text. The first registered collection,
from Adobe Systems, is now available at
http://www.unicode.org/ivd/.
Unicode 5.1.0 contains significant changes to properties and behaviorial specifications. Several important property definitions were extended, improving linebreaking for Polish and Portuguese hyphenation. The Unicode Text Segmentation Algorithms, covering sentences, words, and characters, were greatly enhanced to improve the processing of Tamil and other Indic languages. The Unicode Normalization Algorithm now defines stabilized strings and provides guidelines for buffering. Standardized named sequences are added for Lithuanian, and provisional named sequences for Tamil.
Unicode 5.1.0 adds 1,624 newly encoded characters. These additions include characters required for Malayalam and Myanmar and important individual characters such as Latin capital sharp s for German. Version 5.1 extends support for languages in Africa, India, Indonesia, Myanmar, and Vietnam, with the addition of the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai scripts. Scholarly support includes important editorial punctuation marks, as well as the Carian, Lycian, and Lydian scripts, and the Phaistos disc symbols. Other new symbol sets include dominoes, Mahjong, dictionary punctuation marks, and math additions. This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC
10646:2003 plus Amendments 1 through 4.
Errata incorporated into Unicode 5.1.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 5.1.0, see the list of current
Updates and Errata.
The corrected formulation of the
regular expression for More_Above in Table 3-14, Context Specification for
Casing, may be of particular interest. For a statement of that erratum
for Table 3-14, see Errata
Fixed in Unicode 5.2.0.
Stability Policy Update
The Unicode Character Encoding Stability Policy has been updated.
This update strengthens normalization stability, adds stability
policy for case pairs, and extends constraints on property values.
For the current statement of these policies, see
Unicode Character Encoding Stability Policy.
Characters
- Additions to Malayalam and Myanmar; characters to complete support of Indic scripts
- New symbols: Mahjong, editorial punctuation marks, significant additions for math
- Capital Sharp S for German
- Some new minority scripts for communities in Vietnam, Indonesia, India, Africa
- Some historic scripts and punctuation marks
General Specification
- Important clarification of UTF-8 conformance
- Improved guidance on use of Myanmar and Malayalam scripts
- Definitions of extended base and extended combining character sequences
Unicode Standard Annexes
Properties
- Deprecation of tag characters
- Incorporation of
Corrigendum
#6: Bidi Mirroring, so that directional quotation marks are no
longer mirrored
- Revision of the definition of the Default_Ignorable_Code_Point property
- New standardized named sequences for Lithuanian
- Detailed documentation of provisional named
sequences for Tamil
- Documentation of U-source ideographs
- New property values for text segmentation:
- Sentence_Break property values CR, LF, Extend, and SContinue
- Word_Break property values CR, LF, Newline, Extend, and MidNumLet
- Grapheme_Cluster_Break property values Prepend and SpacingMark
Additional Constraints on Conversion of Ill-formed UTF-8
In Version 5.1, the text regarding ill-formed code unit
sequences is extended, with new definitions that make it
clear how to identify well-formed and ill-formed code
unit subsequences in strings. The implications for this
on UTF-8 conversion in particular are made much clearer.
This change is motivated by a potential security exploit
based on over consumption of ill-formed UTF-8 code unit
sequences, as discussed in UTR #36: Unicode Security
Considerations.
On p. 100 in Chapter 3 of The Unicode Standard, Version 5.0,
replace the existing text from D85 through D86 with
the following text. (Note that this does not actually
change definitions D85 or D86, but adds two new
definitions and extends the explanatory text considerably,
particularly with exemplications for UTF-8 code unit
sequences.)
Replacement Text |
D84a Ill-formed code unit subsequence:
A non-empty subsequence of a Unicode code unit sequence X
which does not contain any code units which also belong
to any minimal well-formed subsequence of X.
- In other words, an ill-formed code unit subsequence
cannot overlap with a minimal well-formed subsequence.
D85 Well-formed: A Unicode code unit sequence
that purports to be in a Unicode form is called
well-formed if and only if it does
follow the specification of that Unicode encoding form.
D85a Minimal well-formed code unit subsequence:
A well-formed Unicode code
unit sequence that maps to a single Unicode scalar value.
- For UTF-8, see the specification in D92 and Table 3-7.
- For UTF-16, see the specification in D91.
- For UTF-32, see the specification in D90.
A well-formed Unicode code unit sequence can be partitioned into one or
more minimal well-formed code unit sequences for the given Unicode encoding
form. Any Unicode code unit sequence can be partitioned into subsequences that
are either well-formed or ill-formed. The sequence as a whole is well-formed if
and only if it contains no ill-formed subsequence. The sequence as a whole is
ill-formed if and only if it contains at least one ill-formed subsequence.
D86 Well-formed UTF-8 code unit sequence: A
well-formed Unicode code unit sequence of UTF-8 code units.
- The UTF-8 code unit sequence <41 C3 B1 42> is well-formed,
because it can be partitioned into subsequences, all of
which match the specification for UTF-8 in Table 3-7. It
consists of the following minimal well-formed code unit subsequences:
<41>, <C3 B1>, and <42>.
- The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed,
because it contains one ill-formed subsequence. There is
no subsequence for the C2 byte which matches the specification for
UTF-8 in Table 3-7. The code unit sequence is partitioned
into one minimal well-formed code unit subsequence, <41>, followed by
one ill-formed code unit subsequence, <C2>, followed by
two minimal well-formed code unit subsequences, <C3 B1> and <42>.
- In isolation, the UTF-8 code unit sequence <C2 C3> would
be ill-formed, but in the context of the UTF-8 code
unit sequence <41 C2 C3 B1 42>, <C2 C3> does not constitute
an ill-formed code unit subsequence, because the C3 byte is
actually the first byte of the minimal well-formed UTF-8 code
unit subsequence <C3 B1>. Ill-formed code unit subsequences
do not overlap with minimal well-formed code unit subsequences.
|
On p. 101 in Chapter 3 of The Unicode Standard, Version 5.0, replace
the existing paragraph just above
Table 3-4 with the following text.
Replacement Text |
If a Unicode string purports to be in a
Unicode encoding form, then it must not contain any ill-formed
code unit subsequence.
If a process which verifies that a Unicode string is in a
Unicode encoding form encounters an ill-formed code unit
subsequence in that string, then it must not identify that
string as being in that Unicode encoding form.
A process which interprets a Unicode string must not
interpret any ill-formed code unit subsequences in the
string as characters. (See conformance clause C10.)
Furthermore, such a process must not treat any adjacent
well-formed code unit sequences as being part of
those ill-formed code unit sequences.
The most important consequence of this requirement on processes
is illustrated by UTF-8 conversion processes, which
interpret UTF-8 code unit sequences as Unicode character
sequences. Suppose that a UTF-8 converter is iterating
through an input UTF-8 code unit sequence. If the converter
encounters an ill-formed UTF-8 code unit sequence which
starts with a valid first byte, but which does not continue
with valid successor bytes (see Table 3-7), it must not
consume the successor bytes as part of the ill-formed
subsequence whenever those successor bytes themselves
constitute part of a well-formed UTF-8 code unit subsequence.
If an implementation of a UTF-8 conversion process stops at
the first error encountered, without reporting the end of
any ill-formed UTF-8 code unit subsequence, then the
requirement makes little practical difference. However,
the requirement does introduce a significant constraint
if the UTF-8 converter continues past the point of a
detected error, perhaps by substituting one or more U+FFFD
replacement characters for the uninterpretable, ill-formed
UTF-8 code unit subsequence. For example, with the input
UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion
process must not return <U+FFFD> or <U+FFFD, U+0042>, because
either of those outputs would be the result of
misinterpreting a well-formed subsequence as being part of
the ill-formed subsequence. The expected return value for such
a process would instead be <U+FFFD, U+0041, U+0042>.
For a UTF-8 conversion process to consume valid successor
bytes is not only non-conformant, but also leaves the
converter open to security exploits. See UTR #36,
Unicode Security Considerations.
Although a UTF-8 conversion process is required to never
consume well-formed subsequences as part of its error
handling for ill-formed subsequences, such a process is
not otherwise constrained in how it deals with any ill-formed
subsequence itself. An ill-formed subsequence consisting of
more than one code unit could be treated as a single
error or as multiple errors. For example, in processing
the UTF-8 code unit sequence <F0 80 80 41>, the only
requirement on a converter is that the <41> be processed
and correctly interpreted as <U+0041>. The converter could
return <U+FFFD, U+0041>, handling <F0 80 80> as a single
error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each
byte of <F0 80 80> as a separate error, or could take
other approaches to signalling <F0 80 80> as an
ill-formed code unit subsequence.
|
Extended Combining Character Sequences
In order to take into account the normalization behavior of Hangul syllables and conjoining jamo sequences, additional definitions for extended base and extended combining character sequence have been added to the standard.
The following text is added on p. 91 of The Unicode Standard, Version 5.0, just before D52:
Additional Text |
D51a Extended base: Any base character, or any standard Korean syllable block.
- This term is defined to take into account the fact that sequences of Korean conjoining jamo characters behave as if they were a single Hangul syllable character, so that the entire sequence of jamos constitutes a base.
- For the definition of standard Korean syllable block, see D117 in Section 3.12, Conjoining Jamo Behavior.
|
The following text is added on p. 93 of The Unicode Standard, Version 5.0, just before D57:
Additional Text |
D56a Extended combining character sequence: A maximal character sequence consisting of either an
extended base followed by a sequence of one or more characters where each is a combining character,
ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER; or a sequence of one or more characters where each is
a combining character, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.
- Combining character sequence is commonly abbreviated as CCS, and extended combining character sequence is commonly abbreviated as ECCS.
|
In addition, the existing definitions of grapheme cluster and extended grapheme cluster are slightly modified, to bring them into line with UAX #29, "Unicode Text Segmentation," where they are defined algorithmically.
The existing text for D60 and D61 on p. 94 of The Unicode Standard, Version 5.0, is replaced with the following
text:
Replacement Text |
D60 Grapheme cluster: The text between grapheme cluster boundaries as specified by
Unicode Standard Annex #29, "Unicode Text Segmentation."
- The grapheme cluster represents a horizontally segmentable
unit of text, consisting of some grapheme base (which may
consist of a Korean syllable) together with any number of
nonspacing marks applied to it.
- A grapheme cluster is similar, but not identical to a
combining character sequence. A combining character sequence
starts with a base character and extends across any
subsequent sequence of combining marks, nonspacing
or spacing. A combining character sequence is
most directly relevant to processing issues related to
normalization, comparison, and searching.
- A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. For completeness in text segmentation, a grapheme cluster may also consist of segments not containing a grapheme base, such as newlines or some default ignorable code points. A grapheme cluster is most directly relevant to text rendering and processes
such as cursor placement and text selection in editing.
- For many processes, a grapheme cluster behaves as if it were a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.
For example, <x, macron> behaves in line breaking or bidirectional layout as if it were the character x.
D61 Extended grapheme cluster: The text between extended grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation."
- Extended grapheme clusters are defined in a parallel manner to grapheme clusters, but also include sequences of spacing
marks and certain prepending characters.
- Grapheme clusters and extended grapheme clusters do not have
linguistic significance, but are used to break up a string of text
into units for processing.
- Grapheme clusters and extended grapheme clusters may be
adjusted for particular processing requirements, by tailoring the
rules for grapheme cluster segmentation specified in Unicode
Standard Annex #29.
|
Updates to Table of Named Unicode Algorithms
Table 3-1, Named Unicode Algorithms, and the associated
explanatory text on p. 81 of The Unicode Standard, Version 5.0 should be
updated to account for some slight changes in naming conventions for Unicode algorithms in Version 5.1.
The following text is added at the end of the paragraph above Table
3-1:
Additional Text |
When externally referenced, a named Unicode algorithm may be prefixed with
the qualifier "Unicode", so as to make the connection of the algorithm to the
Unicode Standard and other Unicode specifications clear. Thus, for example, the
Bidirectional Algorithm is generally referred to by the full name, "Unicode
Bidirectional Algorithm". As much as is practical, the titles of Unicode
Standard Annexes which define Unicode algorithms consist of the name of the
Unicode algorithm they specify. In a few cases, named Unicode algorithms are
also widely known by their acronyms, and those acronyms are also listed in Table
3-1. |
The following changes are made to select entries in the table:
Current Text |
Grapheme Cluster Boundary Determination |
UAX #29 |
Word Boundary Determination |
UAX #29 |
Sentence Boundary Determination |
UAX #29 |
Collation Algorithm (UCA) |
UTS #10 |
Replacement Text |
Character Segmentation |
UAX #29 |
Word Segmentation |
UAX #29 |
Sentence Segmentation |
UAX #29 |
Unicode Collation Algorithm (UCA) |
UTS #10 |
Update of Definition of case-ignorable
On p. 124 in Section 3.13, Default Case Algorithms of The Unicode Standard, Version 5.0,
update D121 to include the
value MidNumLet in the definition of case-ignorable. This change
was occasioned by the split of the Word_Break property value
MidLetter into MidLetter and MidNumLet.
Replacement Text |
D121 A character C is defined to be case-ignorable if C has the value
MidLetter or the value MidNumLet for the Word_Break property or its
General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format
(Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk). |
Clarifications of Default Case Conversion
On p. 125 in Section 3.13, Default Case Algorithms of The Unicode Standard, Version 5.0, replace
the first paragraph under "Default Case Conversion" with the following text and add
a new paragraph following rules R1-R4.
Replacement Text |
The following rules specify the default case conversion operations for
Unicode strings. These rules use the full case conversion operations,
Uppercase_Mapping(C), Lowercase_Mapping(C), and Titlecase_Mapping(C), as well
as the context-dependent mappings based on the casing context, as specified
in Table 3-14. [Rules R1-R4 are unchanged from 5.0.]
The default case conversion operations may be tailored for specific
requirements. A common variant, for example, is to make use of simple case
conversion, rather than full case conversion. Language- or locale-specific
tailorings of these rules may also be used. |
Canonical Combining Class Is Immutable
As a result of the strengthening of the
Normalization Stability Policy, the Canonical_Combining_Class property has become an immutable property, rather than merely a stable property.
To adjust for this change, the first bullet after D40 on p. 88 of The Unicode Standard, Version 5.0, is removed from the text.
Clarification of Conformance Clause C7
Because of the potential security issues involved in the deletion of characters, the following explanatory bullet is added to the text of
The Unicode Standard, Version 5.0, on p. 72, after the first two bullet items following Conformance Clause C7.
Additional Text |
- Note that security problems can result if noncharacter
code points are removed from text received from
external sources. For more information, see Section 16.7, Noncharacters, and Unicode Technical Report #36,
"Unicode Security Considerations."
|
In Unicode 5.1.0, some of the Unicode Standard Annexes have minor changes to their titles for consistency with new technical report naming practices. The following summarizes the
more significant changes in the
Unicode Standard Annexes. More detailed notes can be found by looking at the
modifications section of each document.
Minor changes were made to the following Unicode Standard Annexes—primarily clerical changes to update the version and revision numbers and to update references to Unicode 5.1.0.
The following explanatory material
augments the text in Unicode 5.0.
Addition of Dual-Joining Group to Table 8-7
In Table 8-7, on p. 279 of The Unicode Standard, Version 5.0,
add a row for a new group, right below the YEH group:
This joining group is for the newly encoded characters for
Burushaski:
U+077A YEH BARREE WITH DIGIT TWO ABOVE
U+077B YEH BARREE WITH DIGIT THREE ABOVE
These have the isolate and final forms of a Yeh Barree, but
are dual-joining, rather than right-joining, like
U+06D2 ARABIC LETTER YEH BARREE—a letter used in Urdu and
a number of other languages.
On p. 281 of The Unicode Standard, Version 5.0, below Table 8-8, add the following explanatory paragraph:
Additional Text |
The yeh barree is a form of yeh used in language such as Urdu. It is a right-joining letter, and has no initial or medial forms. However, some letterforms based on yeh barree used in other languages, such as Burushaski, do take initial and medial forms. Such characters are given the dual-joining type and have a separate joining group, BURUSHASKI YEH BARREE, based on this difference in shaping behavior. |
Clarification Regarding Non-decomposition of Overlaid Diacritics
Most characters that people think of as being a character "plus accents"
have formal decompositions in Unicode. For example:
00C0 LATIN CAPITAL LETTER A WITH GRAVE → 0041 LATIN CAPITAL LETTER A + 0300 COMBINING GRAVE ACCENT
00C7 LATIN CAPITAL LETTER C WITH CEDILLA →
0043 LATIN CAPITAL LETTER C + 0327 COMBINING CEDILLA
Based on that pattern, people often also expect to see formal Unicode
decompositions for characters with slashes, bars, hooks, and the like
used as diacritics for forming new Latin letters:
U+00D8 LATIN CAPITAL LETTER O WITH STROKE →
U+004F LATIN CAPITAL LETTER O + 0338 COMBINING LONG SOLIDUS OVERLAY
However, such decompositions are not formally defined in Unicode.
For historical and implementation reasons, there are no
decompositions for characters with overlaying diacritics such as bars
or slashes, or for most hooks, swashes, tails, and other similar
modifications to the form of a base character. These
include characters such as:
U+00D8 LATIN CAPITAL LETTER O WITH STROKE
U+049A CYRILLIC CAPITAL LETTER KA WITH DESCENDER
and also characters that would seem analogous in appearance to
a Latin letter with a cedilla, such as:
U+0498 CYRILLIC CAPITAL LETTER ZE WITH DESCENDER
Because these characters with overlaid diacritics or modifications
to their base form shape have no formal decompositions,
some kinds of processing that
would normally use Normalization Form D (NFD) for internal processing may end up
simulating decompositions instead, so that they can treat the diacritic
as if it were a separately encoded combining mark.
For example, a common operation in searching or
matching is to sort as if accents were removed. This is easy to do with
characters that decompose, but for characters with overlaid
diacritics, the effect of ignoring the diacritic has to be simulated
instead with data tables that go beyond simple use of Unicode
decomposition mappings.
The lack of formal decompositions for characters with overlaid
diacritics also means there are increased opportunities for spoofing
with such characters. The display of a base letter plus a combining
overlaid mark such as U+0335 COMBINING SHORT STROKE OVERLAY may
look the same as the encoded base letter with bar diacritic, but
the two sequences are not canonically equivalent and would not
be folded together by Unicode normalization.
For more information and data for handling
these confusable sequences involving overlaid diacritics,
see UTR #36: Unicode Security
Considerations.
Modifier Letters
Modifier letters, in the sense used in the Unicode Standard,
are letters or symbols that are typically written adjacent to
other letters and which modify their usage in some way.
They are not formally combining marks (gc=Mn or gc=Mc) and do
not graphically combine with the base letter that they
modify. In fact, they are base characters in their own right.
The sense in which they modify other letters is more a matter
of their semantics in usage; they often tend to function as
if they were diacritics, serving to indicate a change in
pronunciation of a letter, or to otherwise indicate something
to be distinguished in a letter's use.
Modifier letters are very commonly used in technical phonetic
transcriptional systems, where they augment the use of
combining marks to make phonetic distinctions. Some of them
have been adapted into regular language orthographies as well.
Many modifier letters take the form of superscript or subscript
letters. Thus the IPA modifier letter that indicates
labialization (U+02B7) is a superscript form of the letter w.
As for all such superscript or subscript form characters in
the Unicode Standard, such modifier letters have compatibility
decompositions.
Many modifiers letters are derived from letters in the Latin script, although some modifier letters occur in other scripts, as well. Latin-derived modifier letters may be based on either minuscule (lowercase) or majuscule (uppercase) forms of the letters, but never have case mappings. Modifier letters which have the shape of capital or small capital Latin letters, in particular, are used exclusively in technical phonetic transcriptional systems. Strings of phonetic transcription are notionally lowercase—all letters in them are considered to be lowercase, whatever their shapes. In terms of formal properties in the Unicode Standard, modifier letters based on letter shapes are Lowercase=True; modifier letters not based on letter shapes are simply caseless. All modifier letters, regardless of their shapes, are operationally caseless; they need to be unaffected by casing operations,
because changing them by a casing operation would destroy their meaning for the phonetic transcription.
Modifier letters in the Unicode Standard are indicated by
either one of two General_Category values: gc=Lm or gc=Sk.
The General_Category Lm is given to modifier letters derived
from regular letters. It is also given to some other characters
with more punctuation-like shapes, such as raised commas,
which nevertheless have letterlike behavior and which occur
on occasion as part of the orthography for regular words in
one language or another.
The General_Category Sk is given to modifier letters that
typically have more symbol-like origins and which seldom,
if ever, are adapted to regular orthographies outside the
context of technical phonetic transcriptional systems.
This subset of modifier letters is also known as
"modifier symbols".
This general distinction between gc=Lm and gc=Sk is reflected
in other Unicode specifications relevant to identifiers and
word boundary determination. Modifier letters with gc=Lm
are included in the set definitions that result in the
derived properties ID_Start and ID_Continue (and XID_Start
and XID_Continue). As such, they are considered part of the
default definition of Unicode identifiers. Modifer symbols
(gc=Sk), on the other hand, are not included in those set
definitions, and so are excluded by default from Unicode
identifiers.
Modifier letters (gc=Lm) have the derived
property Alphabetic, while modifier symbols (gc=Sk) do not.
Modifier letters (gc=Lm) also have the word break property
value (wb=ALetter), while modifier symbols (gc=Sk) do not.
This means that for default determination of word break
boundaries, modifier symbols will cause a word break,
while modifier letters proper will not.
Most general use modifier letters (and modifier symbols) were
collected together in the Spacing Modifier Letters block
(U+02B0..U+02FF), the UPA-related Phonetic Extensions block
(U+1D00..U+1D7F), the Phonetic Extensions Supplement block
(U+1D80..U+1DBF), and the Modifier Tone Letters block
(U+A700..U+A71F). However, some script-specific modifier
letters are encoded in the blocks appropriate to those
scripts. They can be identified by checking for their General_Category
values.
There is no requirement that the Unicode names for modifier
letters contain the label "MODIFIER LETTER", although most
of them do.
Clarifications Related to General_Category Assignments
There are several other conventions about how General_Category
values are assigned to Unicode characters.
The General_Category of an assigned character serves as
a basic classification of the character, based on its primary
usage. The General_Category extends the widely used
subdivision of ASCII characters into
letters, digits, punctuation, and symbols—but needed to be elaborated
and subdivided to be appropriate for the much larger and
more comprehensive scope of Unicode.
Many characters have multiple uses, however, and not all such
uses can be captured by a single, simple partition property
such as General_Category. Thus, many letters
often serve dual functions as numerals in traditional numeral
systems. Examples can be found in the Roman numeral system,
in Greek usage of letters as numbers, in Hebrew, and so on for
many scripts. In such cases the General_Category is assigned
based on the primary letter usage of the character, despite
the fact that it may also have numeric values, occur in
numeric expressions, or may also be used symbolically in
mathematical expressions, and so on.
The General_Category gc=Nl is reserved primarily for letterlike
number forms which are not technically digits.
For example, the compatibility
Roman numeral characters, U+2160..U+217F, all have gc=Nl.
Because of the compatibility status of these characters, the recommended way
to represent Roman numerals is with
regular Latin letters (gc=Ll or gc=Lu). These letters
derive their numeric status from conventional usage to express
Roman numerals, rather than from their Generic_Category value.
Currency symbols (gc=Sc), by contrast, are given their General_Category
value based entirely on their function as symbols for currency,
even though they are often derived from letters and may appear
similar to other diacritic-marked letters that get assigned one of the
letter-related General_Category values.
Pairs of opening and closing punctuation are given their
General_Category values (gc=Ps for opening and gc=Pe for closing)
based on the most typical usage and orientation of such pairs.
Occasional usage of such punctuation marks unpaired or in
opposite orientation certainly occurs, however, and is in
no way prevented by their General_Category values.
Similarly, characters whose General_Category identifies them
primarily as a symbol or as a math symbol may function in
other contexts as punctuation or even paired punctuation.
The most obvious such case is for U+003C "<" LESS-THAN SIGN
and U+003E ">" GREATER-THAN SIGN. These are given the General_Category gc=Sm because their primary identity is as
mathematical relational signs. However, as is obvious from
HTML and XML, they also serve ubiquitously as paired bracket
punctuation characters in many formal syntaxes.
Clarification of Hangul Jamo Handling
The normalization of Hangul conjoining jamos and of Hangul syllables
depends on algorithmic mapping, as specified in Section 3.12,
Conjoining Jamo Behavior in [Unicode]. That algorithm specifies the full
decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings,
as for all other Unicode characters. Formally, the Decomposition_Mapping (dm)
property value for a Hangul syllable is the pairwise decomposition and not the
full decomposition. Each character with the Hangul_Syllable_Type value LVT
has a decomposition mapping consisting of a character with an LV value
and a character with a T value. Thus for U+CE31 the decomposition
mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.
The Unicode Standard provides default casing operations. There are
circumstances in which the default operations need to be tailored for
specific locales or environments. Some of these tailorings have data that
is in the standard, in the SpecialCasing.txt file, notable for the Turkish
dotted capital I and dotless small i. In other cases, more specialized
tailored casing operations may be appropriate. These include:
- Titlecasing of IJ at the start of words in Dutch
- Removal of accents when uppercasing letters in Greek
- Uppercasing U+00DF ( ß ) LATIN SMALL LETTER SHARP S to the new
U+1E9E LATIN CAPITAL LETTER SHARP S
However, these tailorings may or may not be desired, depending on the implementation in question.
In particular, capital sharp s is intended for typographical representations of signage and uppercase titles, and other environments where users require the
sharp s to be preserved in uppercase. Overall, such usage is rare. In contrast, standard German orthography uses the string
"SS" as uppercase mapping for
small sharp s. Thus, with the default Unicode casing operations, capital sharp s
will lowercase to small sharp s, but not the reverse: small sharp s uppercases to
"SS". In those instances where the reverse casing operation is needed, a tailored operation would be required.
Clarification of Lowercase and Uppercase
Implementers occasionally find the several ways in which the Unicode Standard uses the concepts of lowercase and uppercase somewhat confusing.
To address this, the following clarifying text is added to The Unicode Standard, Version 5.0, in
Section 4.2, Case—Normative, at the bottom of p. 132:
Additional Text |
For various reasons, the Unicode Standard has more than one formal definition of lowercase and uppercase.
(The additional complications of titlecase are not discussed here.)
The first set of definitions is based on the General_Category property in UnicodeData.txt. The relevant values are General_Category=Ll (Lowercase_Letter) and General_Category=Lu (Uppercase_Letter). For most ordinary letters of bicameral scripts such as Latin, Greek, and Cyrillic, these values are obvious and non-problematical. However, the General_Category property is, by design, a partition of Unicode codespace. This means that each Unicode character can only have one General_Category value, and that situation results in some odd edge cases for modifier letters, letterlike symbols and letterlike numbers.
So not every Unicode character that looks like a lowercase character necessarily ends up with General_Category=Ll, and
the same is true for uppercase characters.
The second set of definitions consist of the derived binary properties, Lowercase and Uppercase. Those derived properties augment the General_Category values by adding the additional characters that ordinary users think of as being lowercase or uppercase, based primarily on their letterforms.
The additional characters are included in the derivations by means of the Other_Lowercase and Other_Uppercase properties defined in PropList.txt. For example, Other_Lowercase adds the various modifier letters that are letterlike in shape, the circled lowercase letter symbols, and the compatibility lowercase
Roman numerals. Other_Uppercase adds the circled uppercase letter symbols, and
the compatibility uppercase Roman numerals.
The third set of definitions is fundamentally different in kind, and are not character properties at all.
The functions isLowercase and isUppercase are string functions returning a binary True/False value. These functions are defined in
Section 3.13, Default Case Algorithms, and depend on case mapping relations, rather than
being based on letterforms per se. Basically, isLowercase is True for a string if the result of
applying the toLowercase mapping operation for a string is the same as the string itself.
The following table illustrates the various possibilities for how these definitions interact,
as applied to exemplary single characters or single character strings.
Code |
Character |
gc |
Lowercase |
Uppercase |
isLowerCase(S) |
isUpperCase(S)
|
0068 |
h |
Ll |
True |
False |
True |
False |
0048 |
H |
Lu |
False |
True |
False |
True |
24D7 |
ⓗ |
So |
True |
False |
True |
False |
24BD |
Ⓗ |
So |
False |
True |
False |
True |
02B0 |
ʰ |
Lm |
True |
False |
True |
True |
1D34 |
ᴴ |
Lm |
True |
False |
True |
True |
02BD |
ʽ |
Lm |
False |
False |
True |
True |
Note that for "caseless" characters, such as U+02B0, U+1D34, and
U+02BD, isLowerCase and isUpperCase are both True, because
the inclusion of a caseless letter in a string is not criterial for determining the casing of the string—a caseless letter always case maps to itself.
On the other hand, all modifier letters derived from letter shapes are also notionally lowercase, whether the letterform itself is a minuscule or a majuscule in shape. Thus U+1D34 MODIFIER LETTER CAPITAL H is actually Lowercase=True. Other modifier letters not derived from letter shapes are neither Lowercase nor Uppercase.
The string functions isLowerCase and isUpperCase also apply to strings longer than one character,
of course, in which instance the character properties General_Category, LowerCase, and Uppercase are
not relevant. In the following table, the string function isTitleCase is also illustrated, to show
its applicability for the same strings.
Codes |
String |
isLowerCase(S) |
isUpperCase(S) |
isTitleCase(S) |
0068 0068 |
hh |
True |
False |
False |
0048 0048 |
HH |
False |
True |
False |
0048 0068 |
Hh |
False |
False |
True |
0068 0048 |
hH |
False |
False |
False |
Programmers concerned with manipulating Unicode strings should generally be dealing with the string functions such as isLowerCase (and its functional cousin, toLowerCase), unless they are working directly with single character properties. Care is always advised, however, when dealing with case in
the Unicode Standard, as expectations based simply on the behavior of A..Z, a..z
do not generalize easily across the entire repertoire of Unicode characters, and because case for modifier letters, in particular, can result in unexpected behavior.
|
Canonical Equivalence Issues for Greek Punctuation
Replace the last two sentences of the paragraph on "Other Basic Latin Punctuation Marks," on p. 214 of
The Unicode Standard, Version 5.0, with the following expanded text as new paragraphs:
Replacement Text |
Canonical Equivalence Issues for Greek Punctuation. Some commonly used Greek punctuation marks are encoded in the Greek and Coptic block, but are canonical equivalents to generic punctuation marks encoded in the C0 Controls and Basic Latin block, because they are indistinguishable in shape. Thus, U+037E
";" GREEK QUESTION MARK is canonically equivalent to U+003B ";" SEMICOLON, and U+0387
"·" GREEK ANO TELEIA is canonically equivalent to U+00B7 "·" MIDDLE DOT.
In these cases, as for other canonical singletons, the preferred form is the character that the canonical singletons are mapped to, namely U+003B and U+00B7 respectively. Those are the characters that will appear in any normalized form of Unicode text, even when used in Greek text as Greek punctuation. Text segmentation algorithms need to be aware of this
issue, as the kinds of text units delimited by a semicolon or a middle dot in Greek text will typically
differ from those in Latin text.
The character properties for U+00B7 MIDDLE DOT are particularly problematical, in part because of identifier issues for that character. There is no guarantee that all of its properties will align exactly with U+0387 GREEK ANO TELEIA itself,
because the latter were established based on the more limited function of the middle dot in Greek as a delimiting punctuation mark.
|
Coptic Font Style
Update the second paragraph on p. 244 of The Unicode Standard, Version 5.0, to read as follows:
Replacement Text |
Font Styles. Bohairic Coptic uses only a subset of the letters in the Coptic repertoire. It also uses a font style distinct from that for Sahidic. Prior to Version 5.0, the Coptic letters derived from Demotic, encoded in the range U+03E2..U+03EF in the Greek and Coptic block, were shown in the code charts in a Bohairic font style.
Starting from Version 5.0, all Coptic letters in the standard, including those in the range U+03E2..U+03EF, are shown in the code charts in a Sahidic font style, instead. |
Rendering Default Ignorable Code Points
Update the last paragraph on p. 192 of The Unicode Standard, Version 5.0, in
Section 5.20, Default Ignorable Code Points, to read as follows:
Replacement Text |
An implementation should ignore all default ignorable code points in rendering whenever
it does not support those code points, whether they are assigned
or not.
In previous versions of the Unicode Standard, surrogate code
points, private use code points, and some control characters
were also default ignorable code points. However, to avoid
security problems, such characters always should be displayed
with a missing glyph, so that there is a visible indication of
their presence in the text. In Unicode 5.1 these code points
are no longer default ignorable code points. For more
information, see
UTR #36, "Unicode Security Considerations." |
Stateful Format Controls
Update the subsection on "Stateful Format Controls" in Section 5.20, Default Ignorable Code Points, on p. 194 of
The Unicode Standard, Version 5.0, to read as follows:
Replacement Text |
Stateful Format Controls. There are a small number of paired stateful controls.
These characters are used in pairs, with an initiating character
(or sequence) and a terminating character. Even when these
characters are ignored, complications can arise due to their
paired nature. Whenever text is cut, copied, pasted, or deleted,
these characters can become unpaired. To avoid this problem,
ideally both any copied text and its context (site of a
deletion, or target of an insertion) would be modified so as to
maintain all pairings that were in effect for each piece of
text. This process can be quite complicated, however, and is not
often done—or is done incorrectly if attempted. The paired stateful controls recommended for use are listed in
Table 5-6.
Table 5-6. Paired Stateful Controls
Characters |
Documentation |
Bidi Overrides and Embeddings |
Section 16.2, Layout Controls; UAX #9 |
Annotation Characters |
Section 16.8, Specials |
Musical Beams and Slurs |
Section 15.11, Western Musical Symbols |
The bidirectional overrides and embeddings and the annotation characters are reasonably robust, because their behavior terminates at paragraph boundaries. Paired format controls for representation of beams and slurs in music are recommended only for specialized musical layout software, and also have limited scope.
Other paired stateful controls in the standard are deprecated, and their use should be avoided. They are listed in
Table 5-7.
Table 5-7. Paired Stateful Controls
(Deprecated)
Characters |
Documentation |
Deprecated Format Characters |
Section 16.3, Deprecated Format
Characters |
Tag Characters |
Section 16.9, Tag Characters |
The tag characters, originally intended for the representation of language tags, are particularly fragile under editorial operations that move spans of text around. See
Section 5.10, Language Information in Plain Text, for more information about language tagging. |
Clarification About Handling Noncharacters
The third paragraph of Section 16.7, Noncharacters, on p. 549 of
The Unicode Standard, Version 5.0, is updated to read:
Replacement Text |
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them.
If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to
indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. (See conformance clause C7 in
Section 3.2, Conformance Requirements, and Unicode Technical Report #36, "Unicode Security Considerations.") |
Tag Characters
Update the first paragraph of
Section 16.9, Tag Characters, on pp. 554-555 of The Unicode Standard, Version 5.0, to read as follows:
Replacement Text |
The characters in this block provide a mechanism for language tagging in Unicode plain text. These characters are deprecated, and should not be used—particularly with any protocols that provide alternate means of language tagging. The Unicode Standard recommends the use of higher-level protocols, such as HTML or XML, which provide for language tagging via markup. See Unicode Technical Report #20, "Unicode in XML and Other Markup Languages." The requirement for language information embedded in plain text data is often overstated, and markup or other rich text mechanisms constitute best current practice. See
Section 5.10, Language Information in Plain Text for further discussion. |
Ideographic Variation Database
In Section 12.1, Han, on p. 418 of The Unicode Standard, Version 5.0, replace
the last sentence of the last paragraph of the subsection
"Principles of Han Unification" as follows:
Replacement Text |
Z-axis typeface and stylistic differences are generally ignored for the purpose of encoding Han ideographs, but can be represented in
text by the use of variation sequences; see Section 16.4, Variation Selectors. |
In Section 16.4, Variation Selectors, on p. 545 of The Unicode Standard, Version 5.0, replace the first two paragraphs
of the
"Variation Sequence" subsection by the following:
Replacement Text |
Variation Sequence. A variation sequence always consists of a base character followed by a variation selector character. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character. The variation selector is not used as a general code extension mechanism; only certain sequences are defined, as follows:
Standardized variation sequences are defined in the file StandardizedVariants.txt in the Unicode Character Database.
Ideographic variation sequences are defined by the registration process defined in
Unicode Technical Standard #37,
"Ideographic Variation Database," and are listed in the
Ideographic Variation Database. Only those two types of variation sequences are sanctioned for use by conformant implementations.
In all other cases, use of a variation selector character does not change the visual appearance of the preceding base character from what it would have had in the absence of the variation selector.
|
For more detailed information about the changes in the Unicode
Character Database, see the file UCD.html in the
Unicode Character
Database.
Note that as of Version 5.1.0 an XML version of the complete
Unicode Character Database is available. For details see
UAX #42, An XML
Representation of the UCD.
The Unihan.txt data file is now available only as a zip file, there
is no longer a link for the uncompressed text file.
1624 new character assignments were made to the Unicode Standard, Version 5.1.0 (over and above what was in Unicode 5.0.0). These additions include new characters for mathematics, punctuation, and symbols. There are also eleven newly encoded scripts, including eight minority scripts (Cham, Kayah Li, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai) and three historical scripts (Carian, Lycian, and Lydian).
The new character additions were to both the BMP and the SMP
(Plane 1). The following table shows the allocation of code points in Unicode
5.1.0. For more information on the specific characters, see the file
DerivedAge.txt in the
Unicode Character Database.
Graphic |
100,507 |
Format |
141 |
Control |
65 |
Private Use |
137,468 |
Surrogate |
2,048 |
Noncharacter |
66 |
Reserved |
873,817 |
The character repertoire corresponds to ISO/IEC 10646:2003 plus
Amendments 1 through 4. For more details of character counts, see
Appendix D, Changes from Previous Versions in Unicode 5.0.
Kayah Li: U+A900—U+A92F
The Kayah Li script was devised in 1962 to write the Eastern and Western Kayah Li languages, spoken in Northern Myanmar and Northern Thailand. An orthography for these languages using the Myanmar script also exists.
Kayah Li letterforms are historically related to some other Brahmi-derived scripts, but the Kayah Li script itself is a simple, true alphabet. Some of the vowels are written with spacing letters, while others are written with combining marks applied above the letter a, which serves as a vowel carrier.
The Kayah Li script has its own set of digits. It makes use of common punctuation from Latin typography, but has a few distinct signs of its own. Spaces are used to separate words.
Lepcha: U+1C00—U+1C4F
Lepcha is a Sino-Tibetan language. A Brahmic script derived
directly from Tibetan, Lepcha was likely devised
around 1720 CE by the Sikkim king. The script is used by many
people in Sikkim and West Bengal, especially in the Darjeeling
district. It is a complex script that uses various combining
marks.
Lepcha digits have distinctive forms. Lepcha has traditional punctuation signs, but
everyday writing now uses punctuation such as comma,
full stop, and question mark, though sometimes Tibetan
tshegs are found.
Opportunities for hyphenation occur after any full orthographic
syllable. Lepcha punctuation marks can be expected to have
behavior similar to that of Devanagari danda and double
danda.
Rejang: U+A930—U+A95F
The Rejang script dates from at least the mid-18th century.
Rejang is spoken by about 200,000 people living in Indonesia on
the island of Sumatra. There are five major dialects of Rejang.
Rejang is a complex, Brahmic script that uses combining marks. It uses
European digits and
common punctuation, as well as one script-specific section mark. Traditional texts
tend not to use spacing and there are no known examples using
hyphenation. Modern use of the script may use spaces between words.
Sundanese: U+1B80—U+1BBF
The Sundanese script is used to write the Sundanese language,
one of the languages of the island of Java in Indonesia. It is a complex
Brahmic script and uses combining marks. Spaces are used between words. Sundanese has
script-specific digits, but uses common punctuation.
Hyphenation may occur after any full orthographic syllable.
Saurashtra: U+A880—U+A8DF
Saurashtra is an Indo-European language, related
to Gujarati and spoken in southern India,
mainly in the area around the cities of Madurai, Salem, and Thanjavur. Saurashtra
is most often written in the Tamil script, augmented with the use
of superscript digits and a colon to indicate sounds not available
in the Tamil script. Saurashtra is a complex, Brahmic script that
uses combining marks and has script-specific digits. It mainly uses common
punctuation, but several script-specific punctuation marks may be used.
Cham: U+AA00—U+AA5F
Cham is an Austronesian language used in Vietnam and Cambodia.
There are two main groups, the Eastern Cham and the Western Cham;
the script is used more by the Eastern Cham. It is a complex, Brahmic
script and uses combining marks.
Cham has script-specific digits, although European digits are also
used. It also has some script-specific punctuation, although, again,
Western punctuation is also used. Opportunities for linebreak occur after any full orthographic
syllable. Spaces are used between words.
Ol Chiki: U+1C50—U+1C7F
The Ol Chiki script was invented in the first half of the 20th century to write Santali, a Munda language spoken mainly in India. There are a few speakers in Nepal and Bangladesh.
Ol Chiki is a simple, alphabetic script, consisting of letters
representing consonants and vowels. It is written from left to
right.
Ol Chiki has script-specific digits. It mainly uses common
punctuation, but has some script-specific punctuation marks. It does not
use full stop. Spaces are used between words.
Vai: U+A500—U+A61F (Vai block: A500—A63F)
The Vai script was probably invented in the 1830s, and was
standardized for modern usage in 1962 at a conference at the
University of Liberia. Used in Liberia, Vai is a simple, syllabic
script, which is written from left to right. It is a syllabary.
Vai has some script-specific punctuation and also uses common
punctuation. It does not use script-specific digits. Linebreaking within
words can occur after any character. Special consideration is
necessary for U+A608 VAI SYLLABLE LENGTHENER, which should not
begin a line.
Carian: U+102A0—U+102D0
The Carian script is an alphabet used to write the Carian language, an
ancient Indo-European language of southwestern Anatolia. The
script dates from the first millennium BCE and scriptio
continua is common. It does not have
script-specific digits and has no script-specific punctuation. The script is written both left to right
and right to left; it is encoded as a left-to-right script.
Lycian: U+10280—U+1029C
Lycian is used to write an ancient Indo-European language of
Western Anatolia. The script is related to Greek and was used from
about 500 BCE until 200 BCE. It is a simple alphabetic script and
is written from left to right. It uses word dividers. Spaces may
be used in modern editions of the text. The script does not
include any script-specific digits or punctuation.
Lydian: U+10920—U+1093F
Like Lycian, Lydian is an ancient Indo-European language that
was used in Western Anatolia. The script is related to Greek and
is a simple alphabetic script, whose use is documented from the
late-eighth century BCE to the third century BCE. Most Lydian
texts have right-to-left directionality and use spaces. There is
one script-specific punctuation mark, the Lydian quotation mark. The script
does not have distinct digits.
In addition to new scripts, many characters were added to existing scripts or to the repertoire of symbols in the standard. The following sections briefly describe these additions.
For Myanmar and Malayalam, in particular, the additional characters have significant impact on the representation of text, so those sections are more detailed.
For convenience in reference, the discussion of significant character additions is organized roughly by chapters in the standard.
Chapter 6: Writing Systems and Punctuation
Editorial Punctuation Marks
Editorial punctuation marks added in Unicode 5.1 include a set of medievalist
editorial punctuation marks, as well as corner brackets used in critical
editions of ancient and medieval texts. U-brackets and double parentheses
employed by Latinists have also been added, as well as a few punctuation marks
that appear in dictionaries.
The medievalist editorial marks are used by a variety of European traditions
and will be very useful for those working on medieval manuscripts. Corner
brackets are likewise used widely, appearing in transliterated Cuneiform and
ancient Egyptian texts, for example.
Chapter 7: European Alphabetic Scripts
Latin, Greek, and Cyrillic
A number of characters were added to these scripts, including characters
for German, Mayan, Old Church Slavonic, Mordvin, Kurdish, Aleut, Chuvash,
medievalist Latin, and Finnish dictionary use.
The Latin additions include U+1E9E LATIN CAPITAL LETTER SHARP S for use in
German. The recommended uppercase form for most casing operations on
U+00DF LATIN SMALL LETTER SHARP S continues to be "SS", as a capital
sharp s is only used in restricted circumstances. See
Tailored
Casing Operations.
Chapter 8: Middle Eastern Scripts
Arabic
A number of Arabic character were added in Version 5.1 in support of minority languages, four Qur'anic Arabic characters were added, and the Arabic math repertoire was greatly extended. Sixteen characters were added in support of the Khowar, Torwali, and Burushaski languages spoken primarily in Pakistan, and a set of eight Arabic characters were added in support of Persian and Azerbaijani. The 27 newly added Arabic math characters include arrows, mathematical operators and letterlike symbols.
Chapter 9: South Asian Scripts-I
Indic
A number of useful characters were added to Indic scripts. The Devanagari
candra-a, Gurmukhi udaat and
yakash, and additional Oriya, Tamil, and Telugu characters
were added. These new characters expand the support of Sanskrit in those
scripts, further the support of minority languages, and encode old
fraction and number systems.
Tamil is less complex than some of the other Indic scripts, and both conceptually and in processing can be treated as an atomic set of elements: consonants, stand-alone vowels, and syllables. The following chart shows these atomic elements, with the corresponding Unicode code points. These elements have also been accepted for a future version of the Unicode Standard as Tamil named
character sequences: see the NamedSequencesProv file in
the Unicode Character Database.
In implementations such as natural language processing, where it may be useful to treat these
units as single code points for ease of processing, they can be mapped to a segment of the Private Use
Area.
In the following "Tamil Vowels, Consonants, and Syllables" table, row 1 shows the ASCII representation of the vowel names. Column
1 shows the ASCII representation of the consonant names.
Alternative Formats: For a separate file with this table in an HTML
version and a larger size PNG version, please click
here.
Tamil Vowels, Consonants, and Syllables
The most important new characters for Malayalam are the six new chillu characters, U+0D7A..U+0D7F, encoding dead consonants (those without an implicit vowel). To simplify the discussion here, the formal names of the characters are shortened to use the terms that are typically used in spoken discussion of the chillu characters: chillu-n for MALAYALAM LETTER CHILLU N, and so forth.
In Malayalam-language text, chillu characters never start a word. The chillu letters -nn, -n, -rr, -l, and -ll are quite common; chillu-k is not very common.
Prior to Unicode 5.1, the representation of text with chillus was problematic, and not clearly described in the text of the standard. Because older data will use different representation for chillus, implementations must be prepared to handle both kinds of data.
Table 1 shows the relation between the representation in Unicode Version 5.0 and earlier and the new representation in Version 5.1, for the chillu letters considered in isolation.
Table 1. Atomic Encoding of Chillus
|
Visual |
Representation in 5.0 and Prior |
Preferred 5.1 Representation |
1 |
|
NNA, VIRAMA, ZWJ
(0D23, 0D4D, 200D) |
0D7A MALAYALAM LETTER CHILLU NN |
2 |
|
NA, VIRAMA, ZWJ
(0D28, 0D4D, 200D) |
0D7B MALAYALAM LETTER CHILLU N |
3 |
|
RA, VIRAMA, ZWJ
(0D30, 0D4D, 200D) |
0D7C MALAYALAM LETTER CHILLU RR |
4 |
|
LA, VIRAMA, ZWJ
(0D32, 0D4D, 200D) |
0D7D MALAYALAM LETTER CHILLU L |
5 |
|
LLA, VIRAMA, ZWJ
(0D33, 0D4D, 200D) |
0D7E MALAYALAM LETTER CHILLU LL |
6 |
|
undefined |
0D7F MALAYALAM LETTER CHILLU K |
The letter
ra is normally read /r/. Repetition of that sound is written by two occurrences of the letter:
. Each occurrence can bear a vowel sign.
Repetition of the letter, written either
or
, is also used for the sound /tt/. In this case, the two
fundamentally behave as a digraph. The digraph can bear a vowel sign in which case the digraph as a whole acts graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part goes to the right of the digraph. Historically, the side-by-side form was used until around 1960 when the stacked form began appearing and supplanted the side-by-side form.
The use of
in text is ambigous. The reader must in general use the context to understand if this is read /rr/ or /tt/. It is only when a vowel part appears between the two
that the reading is unambiguously /rr/.
Note: the same situation is common in many other orthographies. For example, th in English can be a digraph (cathode) or two separate letters (cathouse); gn in French can be a digraph (oignon) or two separate letters (gnome)
The sequence <0D31, 0D31> represents
, regardless of the reading of that text. The sequence <0D31, 0D4D, 0D31> represents
. In both cases, vowels signs can be used as appropriate:
Table 2. /rr/ and /tt/
1 |
|
0D2A 0D3E 0D31 0D31 |
/paatta/ |
cockroach |
2 |
|
0D2A 0D3E 0D31 0D4D 0D31 |
3 |
|
0D2E 0D3E 0D31 0D46 0D31 0D3E 0D32 0D3F |
/maattoli/ |
echo |
4 |
|
0D2E 0D3E 0D31 0D4D 0D31 0D46 0D3E 0D32 0D3F |
5 |
|
0D2C 0D3E 0D31 0D31 0D31 0D3F |
/baattari/ |
battery |
6 |
|
0D2C 0D3E 0D31 0D4D 0D31 0D31 0D3F |
7 |
|
0D38 0D42 0D31 0D31 0D31 0D4D |
/suuratt/ |
(name of a place) |
8 |
|
0D38 0D42 0D31 0D31 0D4D 0D31 0D4D |
9 |
|
0D1F 0D46 0D02 0D2A 0D31 0D31 0D3F |
/temparari/ |
temporary (English loan word) |
10 |
|
0D32 0D46 0D15 0D4D 0D1A 0D31 0D31 0D4B 0D1F 0D4D |
/lekcararoot/ |
to the lecturer |
A very similar situation exists for the combination of
chillu-n and
ra. When used side by side,
can be read either /nr/ or /nt/, while
is always read /nt/.
The sequence <0D7B, 0D31> represents
, regardless of the reading of that text. The sequence <0D7B, 0D4D, 0D31> represents
. In both cases, vowels signs can be used as appropriate:
Table 3. /nr/ and /nt/
1 |
|
0D06 0D7B 0D47 0D31 0D3E |
/aantoo/ |
(proper name) |
2 |
|
0D06 0D7B 0D4D 0D31 0D47 0D3E |
3 |
|
0D0E 0D7B 0D31 0D47 0D3E 0D7A |
/enrool/ |
enroll (English word) |
The Unicode Technical Committee is aware of the existence of a repha form of ra, which looks like a dot. The representation of that form is currently under investigation.
Other New Malayalam Characters
The four new characters, avagraha, vocalic rr sign,
vocalic l sign, and vocalic ll sign, are only used to
write Sanskrit words in the Malayalam script. The avagraha
is the most common of the four, followed by the vocalic l
sign. There are six new characters used for the archaic
number system, including characters for numbers 10, 100, 1000
and fractions. There is also a new character, the date mark,
used only for the day of the month in dates; it is roughly
the equivalent of "th" in "Jan 5th." While it has been used in
modern times it is not seen as much in contemporary use.
Chapter 11: Southeast Asian Scripts
The following updated text replaces the Myanmar block introduction on
pp. 379-381 of The Unicode Standard, Version 5.0.
The Myanmar script is used to write Burmese, the majority language of Myanmar (formerly
called Burma). Variations and extensions of the script are used to write other languages
of the region, such as Mon, Karen, Kayah, Shan, and Palaung, as well as Pali and Sanskrit. The Myanmar
script was formerly known as the Burmese script, but the term “Myanmar” is now preferred.
The Myanmar writing system derives from a Brahmi-related script borrowed from South
India in about the eighth century to write the Mon language. The first inscription in the
Myanmar script dates from the eleventh century and uses an alphabet almost identical to
that of the Mon inscriptions. Aside from rounding of the originally square characters, this
script has remained largely unchanged to the present. It is said that the rounder forms were
developed to permit writing on palm leaves without tearing the writing surface of the leaf.
The Myanmar script shares structural features with other Brahmi-based scripts such as Khmer: consonant symbols include an inherent “a” vowel; various signs are attached to a
consonant to indicate a different vowel; medial consonants are attached to the consonant; and the overall writing direction is from left to right.
Standards. There is not yet an official national standard for the encoding of Myanmar/Burmese.
The current encoding was prepared with the consultation of experts from the Myanmar
Information Technology Standardization Committee (MITSC) in Yangon (Rangoon).
The MITSC, formed by the government in 1997, consists of experts from the Myanmar
Computer Scientists’ Association, Myanmar Language Commission, and Myanmar Historical
Commission.
Encoding Principles. As with Indic scripts, the Myanmar encoding represents only the
basic underlying characters; multiple glyphs and rendering transformations are required to
assemble the final visual form for each syllable. Characters and
combinations that may appear visually identical in some fonts, such as U+101D MYANMAR LETTER WA and U+1040 MYANMAR DIGIT ZERO, are distinguished by their underlying
encoding.
Composite Characters. As is the case in many other scripts, some Myanmar letters or signs
may be analyzed as composites of two or more other characters and are not encoded separately.
The following are examples of Myanmar letters represented by combining character
sequences
myanmar vowel sign àw
U+1000 ka + U+1031 vowel sign e + U+102C vowel sign aa → kàw
myanmar vowel sign aw
U+1000 ka + U+1031 vowel sign e + U+102C vowel sign aa +
U+103A asat → kaw
myanmar vowel sign o
U+1000 ka + U+102D vowel sign i + U+102F vowel sign u → ko
Encoding Subranges. The basic consonants, medials, independent vowels, and dependent vowel
signs required for writing the Myanmar language are encoded at the beginning of the
Myanmar range. Extensions of each of these categories for use in writing other languages are appended at the end of the range. In between these two sets lie
the script-specific signs, punctuation, and digits.
Conjuncts. As in other Indic-derived scripts, conjunction of two
consonant letters is indicated by the insertion of a virama U+1039 MYANMAR SIGN VIRAMA between them. It causes the second consonant to be displayed in a smaller form
below the first; the virama is not visibly rendered.
Kinzi. The conjunct form of U+1004 MYANMAR LETTER NGA is rendered as a superscript sign called kinzi. That superscript sign is not encoded as a separate mark, but instead is simply the rendering form of the nga in a conjunct context. The nga is represented in logical order first in the sequence, before the consonant which actually bears the visible kinzi superscript sign in final rendered form. For example, kinzi applied to
U+1000
MYANMAR LETTER KA would be written via the following sequence:
U+1004 nga + U+103A asat + U+1039 virama + U+1000 ka → ka
Note that this sequence includes both U+103A asat and U+1039 virama between the nga and the ka. Use of the virama alone would ordinarily indicate stacking of the consonants, with a small ka appearing under the nga. Use of the asat killer in addition to the virama gives a sequence that can be distinguished from normal stacking: the sequence
<U+1004, U+103A, U+1039> always maps unambiguously to a visible kinzi superscript sign on the following consonant.
Medial Consonants. The Myanmar script traditionally distinguishes a set of subscript “medial” consonants:
forms of ya, ra, wa, and ha that are considered to be modifiers of the syllable’s vowel.
Graphically, these medial consonants are sometimes written as subscripts, but sometimes,
as in the case of ra, they surround the base consonant instead. In the Myanmar encoding,
the medial consonants are encoded separately. For
example, the word krwe , [kjwei] (“to drop off ”) would be written via the following
sequence:
U+1000 ka + U+103C medial ra + U+103D medial wa + U+1031 vowel sign e → krwe
In Pali and Sanskrit texts written in the Myanmar script, as well as in older orthographies of Burmese, the consonants ya, ra, wa, and ha are sometimes rendered in subjoined form.
In those cases, U+1039 MYANMAR SIGN VIRAMA and the regular form of the consonant are used.
Asat. The asat, or killer, is a visibly displayed sign.
In some cases it indicates that the inherent vowel sound of a consonant
letter is suppressed. In other cases it combines with other characters
to form a vowel letter. Regardless of its function, this visible sign
is always represented by the character U+103A MYANMAR SIGN ASAT.
Contractions. In a few Myanmar words, the repetition of a consonant sound is written with a single occurrence of the letter for the consonant sound together with an asat sign. This asat sign occurs immediately after the double-acting consonant in the coded representation:
U+101A ya + U+1031 vowel sign e + U+102C vowel sign aa + U+1000 ka + U+103A asat + U+103B medial ya + U+102C vowel sign aa + U+1038 visarga → man, husband
U+1000 ka + U+103B medial ya + U+103D medial wa + U+1014 na + U+103A asat + U+102F vowel sign u + U+1015 pa + U+103A asat → I (first person singular)
Great sa. The great sa is encoded as U+103F MYANMAR LETTER GREAT SA. This letter should be represented with <U+103F>, while the sequence <U+101E, U+1039, U+101E> should be used for the regular conjunct form of two sa.
Tall aa. The two letters and are both used to write the sound /aa/. In Burmese orthography, both letters are used, depending on the context. In S'gaw Karen orthography, only the tall form is used. For this reason, two characters are encoded: U+102B MYANMAR VOWEL SIGN TALL AA and U+102C MYANMAR VOWEL SIGN AA. In Burmese texts, the coded character appropriate to the context should be used.
Ordering of Syllable Components. Dependent vowels and other signs are encoded after the
consonant to which they apply, except for kinzi, which precedes the consonant. Characters
occur in the relative order shown in Table 11-3.
Table 11-3. Myanmar Syllabic Structure
Name |
Encoding |
kinzi |
<U+1004, U+103A, U+1039> |
consonant and vowel letters |
[U+1000..U+102A, U+103F, U+104E] |
asat sign (for contractions) |
U+103A |
subscript consonant |
<U+1039, [U+1000..U+1019, U+101C, U+101E, U+1020, U+1021]> |
medial ya |
U+103B |
medial ra |
U+103C |
medial wa |
U+103D |
medial ha |
U+103E |
vowel sign e |
U+1031 |
vowel sign i, ii, ai |
[U+102D, U+102E, U+1032] |
vowel sign u, uu |
[U+102F, U+1030] |
vowel sign tall aa, aa |
[U+102B, U+102C] |
anusvara |
U+1036 |
asat sign |
U+103A |
dot below |
U+1037 |
visarga |
U+1038 |
U+1031 MYANMAR VOWEL SIGN E is encoded after its consonant (as in the earlier example),
although in visual presentation its glyph appears before (to the left of) the consonant
form.
Table 11-3 nominally refers to the character sequences used in representing the syllabic structure of the Burmese language proper. It would require further extensions and modifications to cover the various other languages, such as Karen, Mon, and Shan, which also use the Myanmar script.
Spacing. Myanmar does not use any whitespace between words. If word boundary indications
are desired—for example, for the use of automatic line layout algorithms—the character
U+200B ZERO WIDTH SPACE should be used to place invisible marks for such breaks.
The zero width space can grow to have a visible width when justified. Spaces are used to mark phrases. Some phrases are relatively short (two or three syllables).
Chapter 15: Symbols
Mathematical Symbols
As the mathematical community completes its migration to Unicode, the need
for additional mathematical symbols was discovered, and 29 new symbols
were added for the publication of mathematical and technical material, to
support mathematically oriented publications, and to address the needs for
mathematical markup languages, such as MathML.
The new math characters include a non-combining diacritic, U+27CB
MATHEMATICAL SPACING LONG SOLIDUS OVERLAY, and a new operator, U+2064
INVISIBLE PLUS. The non-combining diacritic can be used to decorate a
mathematical variable or even an entire expression. By convention, such a
decoration is indicated with a Unicode character, even in markup
languages, such as MathML. The invisible plus operator is
used to unambiguously represent expressions like 3½.
New delimiters, arrows, squares and other math symbols were also added.
Phaistos Disc Symbols: U+101D0—U+101FF
The Phaistos disc was found during an archeological dig in
Phaistos, Crete about a century ago. The disc probably dates from
the mid-18th to the mid-14th century BCE. Unlike other ancient
scripts, relatively little is known about the Phaistos Disc
Symbols. The symbols have not been deciphered and the disc remains
the only known example of the writing. Nonetheless, the disc has
engendered great interest, and numerous scholars and amateurs spend
time discussing the symbols.
Mahjong Tile Symbols: U+1F000—U+1F02F
Mahjong tile symbols encode a set of tiles used to play the
popular Chinese game of Mahjong. The exact origin of the game is
unknown, but it has been around since at least the mid-nineteenth
century, and its popularity spread to Japan, Britain and the US in
the early twentieth century. There is some variation in the set of
tiles used, so the Unicode Standard encodes a superset of the
tiles used in various traditions of the game. The main set of
tiles is made up of three suits with nine tiles each: the Bamboos,
the Circles and the Characters. Additional tiles include the
Dragons, the Winds, the Flowers and the Seasons.
Domino Tile Symbols: U+1F030—U+1F09F
Domino tile symbols encode the "double-six" set of tiles used
to play the game of dominoes, which derives from Chinese tile
games dating back to the twelfth century. The tiles are encoded in
horizontal and vertical orientations, thus, for example, both
U+1F081 DOMINO TILE VERTICAL-04-02 and U+1F04F DOMINO TILE HORIZONTAL-04-02 are
encoded.