Technical Notes |
Version | 1 |
Authors | Markus W. Scherer |
Date | 2002-07-24 |
This Version | http://www.unicode.org/notes/tn5/tn5-1.html |
Previous Version | http://www.unicode.org/notes/tn5/tn5-1.html |
Latest Version | http://www.unicode.org/notes/tn5 |
This document describes methods and formats for efficient processing of text under canonical equivalence, as defined in UAX #15 Unicode Normalization Forms [UAX15]:
The reader should be familiar with [UAX15] and with Unicode normalization and canonical equivalence in general.
This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes.
As described in [UAX15], multiple Unicode strings can be canonically equivalent. Such strings are not distinguishable by a user and therefore the standard requires them to be treated the same (see Conformance Requirement C9 of Unicode 3.0 [C9]). This means that canonically equivalent strings should normally be displayed identically, sort identically, be case-mapped equivalently, and so on.
It is sometimes necessary to normalize a string before processing in order to fulfill this requirement. Such additional normalization decreases performance.
For most strings it is possible to avoid or minimize an additional normalization step by preparing the supporting data appropriately. This document describes conditions that such strings must meet, including a test algorithm, and describes how to modify the supporting data accordingly. It also reintroduces a variation of the NFC normalization form that transforms strings into a unique, compact form that always fulfills these conditions.
While normalization is very fast, for time-critical processes the cost of simply traversing the text an extra time (or for allocating extra buffers) can be too expensive.
This Technical Note collects concepts and definitions from earlier documents as referenced.
Unicode text processing is primarily done based on per-code point data. At the same time, many processes are specified for text in NFD. This is done to simplify the specification of a process including the requirement to produce equivalent results for equivalent strings. Any conformant implementation must yield equivalent results even if it does not normalize its input.
For example, the Unicode Collation Algorithm [UCA] assigns collation weights to code points (and sometimes to sequences of code points). An implementation must map canonically equivalent strings to the same sequences of collation weights.
Decomposition to NFD is not necessary for many strings if an implementation has “precomposed data” for precomposed characters. That is, the implementation data maps each code point to the same sequence of collation weights as it does for the canonical decomposition of that code point. Such data is called “canonically closed”, and the process of building such data from the original table is called “canonical closure”.
For example, if a collation table maps the character A to the collation weight (A) and the combining ring to the weight (ring), then the string A+ring will be mapped to the sequence of collation weights (A)+(ring). A canonically closed collation table also directly maps the character A-ring to that same sequence of weights.
Character | Weights | Comment |
---|---|---|
A | (A) | From original table |
ring | (ring) | From original table |
A-ring | (A)(ring) | From canonical closure |
The decomposition is effectively performed directly with a data lookup instead of on the text because its data is the same as for its equivalent decomposition. This avoids the normalization process.
An implementation can use canonically closed data without normalizing input strings only if the strings fulfill a certain condition: The decompositions of adjacent code points must not need canonical reordering. This guarantees that the concatenation of per-code point data (like collation weights) yields the same data sequence as for the NFD form of the strings.
Since all NFD strings and most NFC strings fulfill this condition, they are said to be in “Fast C or D” form, or “FCD”. Note that FCD does not describe a unique form: Unlike with regular Normalization Forms, two FCD strings can be canonically equivalent but not identical.
Example:
Text | Weights | Comment |
---|---|---|
A | (A) | |
ring | (ring) | |
cedilla | (cedilla) | |
A-ring | (A)(ring) | NFC, FCD |
A+ring | (A)+(ring) | NFD, FCD |
A+cedilla+ring | (A)+(cedilla)+ (ring) | NFD, FCD |
A-ring+cedilla | (A)(ring)+(cedilla) | wrong: NFC but not FCD needs reordering before data lookup |
Strings can be checked for FCD efficiently. The algorithm is best expressed in pseudo-code:
// get the combining class of the first code point of c's decomposition unsigned byte getLeadCC(code_point c) { string d = getCanonicalDecomposition(c); code_point first = d.getFirstCodePoint(); return getCombiningClass(first); } // get the combining class of the last code point of c's decomposition unsigned byte getTrailCC(code_point c) { string d = getCanonicalDecomposition(c); code_point last = d.getLastCodePoint(); return getCombiningClass(last); } // check if a string is in FCD boolean checkFCD(string s) { code_point c; // current code point unsigned byte prevCC; // trail cc of the previous code point unsigned byte cc; // lead cc of the current code point prevCC = 0; // initialize for(each code point c in s) { cc = getLeadCC(c); if(cc != 0 && cc < prevCC) { // s is not in FCD: // the concatenation of per-code point decompositions needs reordering return false; } prevCC = getTrailCC(c); } return true; // s is in FCD }
The functions getLeadCC and getTrailCC can be replaced by a single table lookup of the code point, returning a 16-bit value that contains both combining classes, which makes checkFCD very fast.
Mark Davis first described FCD in the ICU Collation Design Document [Collation].
As demonstrated above, not all NFC strings are also in FCD. The reason is that the composition step in the NFC algorithm sometimes combines two characters even if they are separated by one or more other characters. In the case of A+cedilla+ring the A combines with the ring even though the cedilla is between them. The NFC result, A-ring+cedilla, is not in FCD.
A modified NFC algorithm will produce unique strings that are always in FCD. The only modification needed is to remove the discontiguous composition. This is called here “FCC” for “Fast C Contiguous”. FCC is almost as compact as NFC, and most legacy codepages also convert 1:1 into FCC strings. In fact, for most strings the NFC and FCC Forms are identical, which makes the transformation between them fast (mostly only a test).
Martin Dürst proposed this form in an Internet Draft titled “Normalization of Internationalized Identifiers” [altNFC] as an alternative for the definition of NFC. The current NFC algorithm was chosen because it tends to be more compact, and because NFC strings are more “stable”: appending a combining mark causes a precomposed character to unravel in fewer cases. (In FCC, appending a cedilla to A-ring results in A+cedilla+ring while in NFC the A-ring stays intact and the result is A-ring+cedilla.)
Sample code for FCC is a simple modification of the NFC sample code (see [UAX15]). In
the function internalCompose (in Normalizer.java) replace lastClass = chClass;
with lastClass
= 256;
to prevent a discontiguous combination across the current character, which failed to
combine with the starter. (This makes one of the conditions fail that are tested for the actual
combination: lastClass < chClass
)
In order to build canonically closed data one needs to enumerate all strings that are canonically equivalent to an input string and in FCD. Each such string then gets assigned the same data as for the original string.
Mark Davis developed the following algorithm to enumerate canonically equivalent strings. For efficiency, supporting data is precomputed, but this is the logical structure.
Many of the concepts presented here are implemented in the International Components for Unicode [ICU]. The ICU normalization code supports FCD in its quickCheck and isNormalized functions as well as in the normalization transformation functions. Transforming a string “into FCD” generally only decomposes and reorders as far as is necessary for the result to meet the FCD condition.
Incremental and whole-string FCD checks are used in the collation and string search services.
The CanonicalIterator class implements the above algorithm for Enumerating Equivalent Strings. It is used in the collation code for building collation tailoring tables.
As of ICU 2.2, the CanonicalIterator is an internal API and FCC is not yet implemented in ICU.
[C9] | Conformance Requirement C9 of Unicode 3.0 http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode_3_0_0 Requires that canonically equivalent text be treated equivalently. See also C10 and several other places in The Unicode Standard. |
[UAX15] | Unicode Standard Annex #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/ Specifies NF*, canonical equivalence, etc., together with passages of The Unicode Standard as described in UAX #15. |
[UCA] | Unicode Technical Standard #10: Unicode Collation Algorithm http://www.unicode.org/reports/tr10/ |
[Collation] | ICU Collation Design Document http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm First description of FCD by Mark Davis. |
[OptiNorm] | Optimizing the Usage of Normalization http://www.icu-project.org/docs/papers/normalization_iuc21.ppt Also IUC 21 session C14 http://www.unicode.org/iuc/iuc21/a348.html Presentation about FCD and other normalization optimizations. |
[altNFC] | Martin Dürst proposed what is called FCC here as an alternative to the
current NFC:
Internet Draft M. Duerst <draft-duerst-i18n-norm-00.txt> University of Zurich Expires in six months July 1997 Normalization of Internationalized Identifiers 2.1 Normalization of Combining Sequences |
[ICU] |
International Components for Unicode http://www.icu-project.org Open-source C/C++ and Java libraries for Unicode and I18N support. |
The following summarizes modifications from the previous version of this document.
2008.10.01 | Updated stale links in version 1 |
1 | Initial version |
© 2002 Markus W. Scherer. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.
Use of this publication is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.