Draft Unicode Technical Report #30

Character Foldings

Authors	Asmus Freytag (asmus@unicode.org)
Date	2004-07-14
This Version	http://www.unicode.org/reports/tr30-4.html
Previous Version	http://www.unicode.org/reports/tr30-3.html
Latest Version	http://www.unicode.org/reports/tr30/
Revision	4

Summary

This report identifies a set of operations that map similar characters to a common target. Such operations, called character foldings are used to ignore certain distinctions between similar characters. The report also provides an algorithm for applying these operations to searching plus additional guidelines.

Status

This document is a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in [References]. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

The foldings specified in this Technical Report cover Unicode, Version 4.0

[Note to reviewers: significant changes in content have been highlighted as follows: Revision 3, Revision 4.]

Scope
Introduction
- 2.1 Search Term Folding
- 2.2 Relation to Normalization
- 2.3 Relation to Collation
Terms and Conventions used in this Document
- 3.1 Definitions
- 3.2 Notation
Specifications
- 4.1 Folding Algorithm
- 4.2 Specification of Foldings
Notes and Guidelines
- 5.1 General Notes
- 5.2 Problematic Foldings
- 5.3 Versioning and Stability

References
Acknowledgements
Modifications

1 Scope

This report identifies a set of character foldings, in other words, operations that map similar characters to a common target. Folding operations are most often used to temporarily ignore certain distinctions between similar characters. For example, they are useful for "fuzzy" or "loose" searches. More rarely, certain folding operations may be used to permanently remove distinctions. Each of the folding operations specified in this report has well-understood properties, and is appropriate in specific contexts. For example some identifiers need case folding, some do not. Some text searches need to preserve superscript forms such as the trademark symbol ™, while others do not. For those and similar reasons, not all of these folding operations may be appropriate in a given context. See Section 5.2 Problematical Foldings./for some of the more problematic folding or expansion operations.

The report also provides an algorithm for combining these operations for the purpose of searching or programmatic identifier matching. This algorithm combines canonical normalization with optional folding operations. This allows implementers to decide which folding option is useful for a particular purpose.

2 Introduction

A folding function or folding operation removes a distinction between related characters by mapping them to the same target. For example, a case folding may remove the case distinction, by replacing upper and title case variants of a character with the lower case. In other words, foldings define equivalence classes, and chose a representative or target member for each equivalence class. Applying a folding maps all members of the equivalence class to the target.

Repeatedly applying the same folding does not change the result, a property called idempotency. For example, case folding an already case folded string makes no further changes to the string.

Foldings have a domain of operation. All characters not in this domain are left unchanged. Two foldings that otherwise perform the same operation are distinct if their domain of operation is different. For example, case folding could be separated by script, creating a Latin case folding, Greek case folding, etc. While each implements the same operation, removing case distinction, they would be considered different foldings.

Since foldings remove distinctions, they lose information. For that reason it is not possible to construct an inverse operation, except in the trivial case of an identity folding.

Foldings can be applied transiently, for example the same folding can be applied to two strings before comparison, or they can be used to permanently transform a text, for example when applying the Positional Forms Folding to convert legacy data that uses explicit Arabic positional shapes to the generic Arabic characters with implicit directionality.

In a more general sense, the elements of equivalence classes or the target of a folding may be character sequences, such as combining character sequences. Examples are the Katakana foldings where voiced syllables are written with two characters for halfwidth Katakana and single characters otherwise (ｶ + ﾞ → ガ). A character may be folded differently when it is part of different character sequences or when it is by itself. This means that foldings in general are context sensitive. Finally, the output of a folding operation, whether context sensitive or not, may result in a string that is longer or shorter than the input string.

For formal definitions of string and folding functions and their classification see UTR #23, Unicode Character Property Model [PropModel]

2.1 Search Term Folding

For the purpose of fuzzy text matching, including both programmatic identifier matching and general text searching, it is often necessary to selectively ignore otherwise meaningful distinctions between related characters, for example upper and lower case, presence of accent marks, etc. This process can be called search term folding. Depending on the search operation, different foldings need to be applied, and possible interactions between foldings and between folded characters and adjacent text that is not folded must be carefully managed, as they can affect the result of the search and introduce both false positive or false negative matches. The remainder of the document describes various foldings and discusses their use in the context of search term folding.

In the general case, different search term foldings are applied for different languages. For example, accent distinctions are ignorable for some languages, but not for others. In English the accent in words like naïve is optional, while to a Swedish user 'o' and 'ö' are distinct letters.

A significant aspect of string foldings for programmatic identifier matching is that the set of allowable identifier characters is restricted. Limiting the repertoire of identifier characters effectively restricts the domain of any foldings applied to them, thus avoiding some of the complications for identifier matching described in Section 5.2 Problematical Foldings.

2.2 Relation to Normalization

Normalization [Normalization] is part of any robust search term folding algorithm (see Section 4.1.1 Basic Folding Algorithm). However, there are some important differences between normalization and the foldings that make up search term folding. Normalization, in particular the canonical forms (NFC or NFD), is often intended for permanent transformation of data, while search term and other foldings are by nature transient. Further, unlike most of the other foldings considered here, normalization it is not context-independent, since the equivalences are not between characters, but character sequences.

As defined, the normalization forms offer only two broad levels of distinctions (they either preserve or do not preserve compatibility distinctions). The choice of which distinctions may be ignored for search term folding needs to be more specific; it depends on the nature of the operation. One size does not fit all.

For example, two of the normalization forms depend on compatibility mappings which replace characters with their compatibility decompositions. Applying certain of these compatibility mappings may lead to unintended false positive matches, preventing their use in general text searches. In combination with whole word search it could even lead to unintended false negatives. (See Section 5.2 Problematical foldings).

Furthermore, normalization and case folding are defined as separate and independent operations, but case folding often occurs together with other foldings in search term folding. In order to avoid inconsistencies, search term folding needs to address the interaction of case folding with the other steps in the algorithm.

Search term folding includes canonical normalization; however, the choice of using the composed (NFC) or decomposed Normalization Form (NFD) is of secondary importance in terms of defining the foldings. Due to the transient nature of search term folding, the distinction between NFC and NFD is immaterial, as long as the two forms are not mixed. However, if the data is known to be in one of the normalized forms, it would be computationally less expensive to operate the search in that form.

2.3 Relation to Collation

Like foldings, the comparisons at the heart of the Unicode Collation Algorithm [UCA] also define equivalences. One can derive a specific folding by applying the collation algorithm with a particular strength and specific tailorings (see [UCA]). Foldings derived in this manner can be useful in searches that ignore similar distinctions to those ignored in collation. Such foldings are not subject of this report.

3 Terms and Conventions used in this Document

3.1 Definitions

This technical report contains no formal definitions. For formal definitions of character properties and related terms, including string function and folding function see UTR #23, Unicode Character Property Model [PropModel]. All other terms are used as defined in the [Unicode], particularly in chapter 3, Conformance, or in the online [Glossary].

3.2 Notation

The following notational conventions are used in this TR:

Notation	Description
XXXX..YYYY	indicates an inclusive range
XXXX, YYYY	indicates an alternative
<cccc>	refers to a compatibility mapping tag as defined in [Aliases]. This should not be confused with a character code sequence of length 1, which would be <XXXX> where X is an upper case hex digit.
Xy	refers to a particular value for the General Category property defined in [UnicodeData] , e.g. "Pd"
<+>	means a folding is contained in the Unicode data files, but its general use is not recommended
<*>	means that the source set are all characters listed with a mapping in the given data file
[CD]	Canonical Decomposition
[KD]	Compatibility Decomposition

4 Specifications

This section gives the specification of a number of useful foldings together with two algorithms that show how they can be applied in a consistent manner, such that folded data is normalized, and folding of normalized or unnormalized data gives the same results. The specification of the foldings transforms data in both canonical normalization forms without change in normalization form, so that they can be used with canonically composed or decomposed data. Due to the context-dependent nature of normalization, it is necessary to separately ensure that the folded data including any surrounding characters remains normalized.

4.1 Folding algorithm

All specifications of algorithms are in terms of results—all implementations that achieve the same result are fully equivalent. In particular, implementations commonly use optimization techniques, such as normalizing and folding 'on demand'. In each of the following algorithms, implementations may be able to avoid the loop implied by step (c) by performing additional transformations whose effect ensures the folding is stable under normalization.

4.1.1 Basic Folding Algorithm

The basic algorithm for search term folding can be stated as

a. Apply optional folding operations
b. Apply canonical decomposition
c. Repeat (a) and (b) until stable
d. Apply composition if necessary

where each step is applied on the whole string, and applies to the result of the preceding operation.

4.1.2 Identifier Folding Algorithm

For identifier folding, it is important to account for prohibited characters. This adds a new step (e). The modified basic algorithm then becomes:

a. Apply optional folding operations
b. Apply canonical decomposition
c. Repeat (a) and (b) until stable
d. Apply composition if necessary
e. Eliminate or flag forbidden characters

Foldings in step (a) can be modified to also disallow certain characters by mapping them to forbidden characters, which are then caught in step (e).

4.2 Specification of Folding Operations

The following table summarizes the definition of a number of important and well-defined folding operations for which the data are available in the Unicode Character Database [UCD] or as data files associated with this Technical Report. A machine readable version of this information is available in [Foldings].

Foldings that are multigraph expansions have been collected together. Such a folding replaces a digraph or higher multigraph by its expansion into an equivalent sequence of base characters, such as replacing DOUBLE PRIME or TRIPLE PRIME by two or three PRIME characters respectively. The foldings listed at the end of the table are provisional: only a provisional definition exists, and in some cases there is no associated data file.

The description column identifies the folding.
The source column identifies the set of characters subject to the folding operation by referencing a set of code points, a set of general categories, or a compatibility mapping tag. All characters matching the source condition are subject to the given folding. Note that this column does not indicate the set of characters with which the source characters are equivalenced by the folding.
The target column indicates the result of the folding, either by reference to an operation, or, in some cases, by providing the single Unicode character to which a whole set of source characters is folded.
The data file column indicates which data file carries the character by character information to implement the operation referred to in the target column.

Descriptive Name	Source Characters	Target Characters	Data file specifying the mapping
Accent removal	Latin/Greek/Cyrillic characters with canonical decomposition	base characters of [CD]	[UnicodeData]
Case folding	<*>	case fold according to CaseFolding.txt	[CaseFolding]
Canonical duplicates folding (e.g. Ohm → Omega)	0374, 037E, 0387, 1FBE, 1FEF, 1FFD, 2000, 2001, 2126, 212A, 212B, 2329..232A	[CD]	[UnicodeData]
Dashes folding	Pd	U+002D	[UnicodeData]
Greek letterforms folding	03D0..03D2, 03D5..03D6, 03F0..03F2, 03F4..03F5	[KD]	[UnicodeData]
Hebrew alternates folding	FB20..FB28	[KD]	[UnicodeData]
Jamo folding	3131..3183	[KD]	[UnicodeData]
Math symbol folding	<font> (except FB20..FB28)	[KD]	[UnicodeData]
Native digit folding	Nd	substitute ASCII digit of same numeric property	[UnicodeData]
Nobreak folding	<no-break>	[KD]	[UnicodeData]
Overline folding	FE49..FE4C	[KD] maps to: 203E	[UnicodeData]
Positional forms folding - includes Arabic ligatures	<initial>, <medial>, <final>, <isolate>	[KD]	[UnicodeData]
Small forms folding	<small>	[KD]	[UnicodeData]
Space folding	Zs	U+0020	[UnicodeData]
Spacing Accents <+>	00AF,00B4,00B8,02D8..02DD, 037A,0384,1FBD,1FBE..1FC0, 1FFE,2017,203E,309B..309C	[KD]	[UnicodeData]
Subscript folding	<sub>	[KD]	[UnicodeData]
Symbol folding <+>	00B5, 2107,2135..2138	[KD]	[UnicodeData]
Underline folding	2017, FE4D..FE4F	005E	[UnicodeData]
Vertical forms folding	<vertical>	[KD]	[UnicodeData]
Multigraph expansions
- Circled symbols expansion	<circled>	[KD]	[UnicodeData]
- Dotted	2488..249B	[KD]	[UnicodeData]
- Ellipsis expansion	2024..2026	[KD]	[UnicodeData]
- Fraction expansion	<fraction>	[KD]	[UnicodeData]
- Integral expansion	222C..222D,222F..2230	[KD]	[UnicodeData]
- Ligature expansion misc.	0587, 0675..0678, 0E33, 0EB3, 0EDC..0EDD, 0F77, 0F79, FB00..FB06, FB13..FB17, FB4F	[KD]	[UnicodeData]
- Parenthesized	2474..2487,249C..24B5, 3200..3243	[KD]	[UnicodeData]
- Primes expansion	2033..2034,2036..2037	[KD]	[UnicodeData]
- Roman numerals	2160..2183	[KD]	[UnicodeData]
- Squared	<square>	[KD]	[UnicodeData]
- Squared (unmarked)	3358..3370, 33E0..33FE, 32C0..32CB	[KD]	[UnicodeData]
- Digraphs	0132..0133, 013F..0140. 0149, 01C4..01CC, 01F1..01F3, 1E9A	[KD]	[UnicodeData]
- Other multigraphs, e.g. c/o, TEL	203C, 2047..2049, 20A8, 2100..2101, 2103, 2105..2106, 2109, 2116, 2121	[KD]	[UnicodeData]
Provisional foldings
Diacritic removal (includes stroke, hook, descender)	Latin/Greek/Cyrillic characters with diacritics <*>	related base characters	[DiacriticFolding]
Han Radical folding	2F00..2F5D, 2EF3, 2E9F	corresponding Unified Ideographs	[HanRadicalFolding]
Hiragana folding	Hiragana <*>	Katakana	[HiraganaFolding]
Katakana folding	Katakana <*>	Hiragana	[KatakanaFolding]
Letterforms folding	Variants of letter forms e.g. 017F (long s) <*>	related archetypical form e.g. 0073 (s)	[LetterformFolding]
Simplified Han Folding	traditional Han characters with a corresponding simplified Han character <*>	simplified Han characters	[SimplifiedHanFoldiing]
Superscript folding	<super>, plus modifier letters 02C0..02C1,06E5..06E6,1D2C..1D61 plus other modifier letters	[KD] with some additions	[SuperScriptFolding]
Suzhou numeral folding	3038..303A, 3021..3029	corresponding Unified Ideographs	[SuzhouFolding]
Width folding	<wide>,<narrow>	[KD] with additional handling of contraction for narrow kana sound marks	[WidthFolding]

Notes:

Some transformations (such as case, or width) in principle would allow a free choice or representative target for each equivalence class (e.g. the upper case or the lower case character for case folding), but the predefined foldings select a preferred default target.
Some target characters (such as 203E) are also subject to another folding, other than case folding.
Some source sets listed include the target characters, others do not.
[CD] = canonical decomposition, applied to characters in 'source characters' column only
[KD] = compatibility decomposition applied to characters in 'source characters' column only

5 Notes and Guidelines

5.1 General Notes

The most important guideline is "discriminate". Understand the effect before applying a folding. Do not apply any of these foldings just because it exists, and certainly never all of them at once. The following notes on individual foldings or issues may be of help in following this guideline.

5.1.1 Case Folding

The [CaseFolding] data file provides case folding information. For more information see Section 5.18 Case Mappings in [Unicode].

5.1.2 Diacritic Folding

Diacritic folding goes beyond the decomposition and removal of accents, umlauts, cedillas etc. that is provided by the canonical decompositions, but also includes barred, slashed forms etc, as well as hooks, descenders, etc., so that it is useful for purposes such as cross language searching. For example, it would allow users to search for words with accented characters in them by supplying the equivalent word spelled in base letters only, eliminating the need to have access to the correct characters on the keyboard. On the other hand, language-specific fuzzy searches would be tailored, usually by being based on collation information, rather than on generic diacritic folding. For more information about collation based searching, see [UCA]

5.1.3 Letter Forms

Greek letter forms should be folded for Greek text. They should not be folded for mathematical and scientific usage as doing so would conflate very distinct concepts. To give an example of common usage, consider physics, which distinguishes angle encoded by theta "θ" and temperature encoded by theta symbol "ϑ".

The compatibility decompositions contain only a subset of all the variant letter forms that could be folded for search purposes, including Greek final sigma and Latin long s. Greek final sigma should not be folded, unless transiently. Latin long s should be folded for modern text in roman type style but, other than transiently for searches, should not be folded for texts intended to be set in Fraktur type. Other letterform foldings include alternate Hebrew characters.

5.1.4 Han Character Foldings

There are a number of foldings applicable to Han Ideographs. While the Unicode Consortium has not yet published any data file defining these, they can be described in general terms.

Han Radical folding. Han Radical folding substitutes the corresponding Unified Ideograph for a Han Radical.

Note that the existing compatibility decomposition for Han Radicals is inconsistent and should not be used for Han Radical folding.

Simplified / traditional folding. Simplified and traditional forms of Han ideographs are separately encoded in Unicode, even where they represent the same meaning. This folding removes this distinction. In practice, there are additional differences in Han character usage between writers of simplified and traditional Chinese, some of which cannot be folded without context and semantic information. This simple folding is nevertheless useful for many purposes.

The mapping from traditional to simplified Chinese is usually 1:1 but occasionally n:1. Because of this, the default is to fold to simplified Chinese. There are some cases where the traditional to simplified mapping is 1:n or even m:n. In these cases the simplified Han folding also removes some distinctions within simplified Chinese itself. For example, the simplified Chinese character U+753B is folded to U+5212.

Variant folding. As results of historical development of the Han ideograph there are multiple variations of characters for the same concept, for example there are 47 variants for 'turtle'. Such folding would remove such variations. The process of defining such a folding for all such cases will be difficult and lengthy.

Source separation foldings. The Unicode Standard maintains duplications for certain Han ideographs based on the fact that they were separately encoded within a given source character set. A Han source separation folding would treat such separated characters as equivalent. This is a subset of the generalized variant folding.

5.1.5 Kana Foldings

Japanese Katakana and Hiragana are two generally equivalent syllabaries, where each Hiragana syllable has a corresponding Katakana syllable. Foldings in both directions are useful, depending on the situation. However, since Katakana are used to represent the pronunciation of foreign words in Japanese, there are more Katakana than Hiragana characters.

There is one important difference in orthography between the syllabaries, affecting the way the long syllables are expressed. Hiragana uses an additional vowel, while Katakana uses a length mark. If this is taken into account, the folding is no longer context free. In fact, these are better seen as examples of algorithmic transliterations, such as the ISCII transliterations between Indic scripts.

In addition, Katakana occur in both regular and halfwidth forms, with the halfwidth forms using two characters to express voiced or semi-voiced syllables, where the regular Katakana use a single character. The [HiraganaFolding] folds Hiragana to wide Katakana, while the [KatakanaFolding] folds wide Katakana characters to Hiragana.

5.1.6 Syllabic vowel foldings

Ethiopic is an example of an "open alphasyllabary" where a single symbol represents a CV (consonant + vowel) pattern. In principle, such a syllabary has several forms for each consonant C: one for each of the vowels V, and one vowel-less form. However, not all forms are actually used in a given syllabary.

Vowels often appear in the wrong form in electronic text due to input errors, incompatible input methods, different spelling conventions, or where grammatical word inflections are primarily expressed by a change of vowel, as is the case for Ethiopic. In such situations it may make sense to fold away the vowel by converting all syllables in a string that have the same consonant into a common reference form.

The choice of reference form depends on the syllabary, or more precisely on features of the languages that are using the syllabary. For Ethiopic, a vowel-less form has been found to be the most practical target for folding.

For other syllabaries, vowel folding may not be as useful.

5.1.7 Semantically neutral foldings

Semantically neutral foldings could be defined as those foldings that simply remove a distinction that is more or less purely an artifact of the encoding itself. Under the right circumstances, these foldings are in principle candidates for permanent data transformation. This is primarily true for the canonical decomposition and composition, but could also apply to text using the Arabic positional forms. If such a text is converted to use non-positional forms, but rendered via a standard Unicode rendering process, the appearance would be the same (except for deliberately odd combinations of positional shapes). In practice it is far more likely that the original data containing the positional forms will display poorly on a system that expects characters with implicit positional shaping.

The following foldings can be considered semantically neutral

Arabic positional forms folding
Vertical forms folding
Canonical decomposition and composition, excepting Han compatibility ideographs.

Note that Arabic positional folding, especially when intended as a permanent data transformation, may need to introduce ZWJ or ZWNJ characters.

5.1.8 Foldings based on tailored collation data

Foldings based on tailored collation data would fold characters that are 'nearly equivalent' in a particular language. For example, a locale-based folding for Swedish could follow common practice in Sweden and match the following pairs of character sequences, among others, based on equivalence in pronunciation.

Ä	Æ
Ö	Ø
ss	ß
y	ü
v	w

In other words, locale-based foldings would be different for some user groups using the same script (in this case Latin). The recommended way to implement locale-based searching based on sorting tables is found in [UCA ].

5.1.9 Compatibility decompositions

Compatibility decomposition provides a fixed combination of several foldings and expansions. It is in fact the source of most of the foldings in the table in section 4.0. There are two ways to subdivide compatibility decompositions:

by compatibility decomposition type (the value in <> in [UnicodeData], e.g. <super>)
by explicitly limiting the range of source characters

The specifications in Section 5.0 use the first method, whenever the compatibility tag is well defined and meaningful. Where it is too broad, e.g., for the <compat> tag, foldings are further subdivided by defining specific ranges of source characters.

Using compatibility decomposition is convenient, since existing algorithms for Normalization may provide them. However, the full decomposition includes several foldings that may not be appropriate for the given purpose. By selectively not applying the decomposition to certain character ranges given in Section 4.0, one can in effect limit the compatibility decomposition to only the desired foldings.

5.2 Problematic foldings or expansions

Some foldings can have unintended consequences, including inadvertent changes in the semantics of the text. In most cases, it is best to be conservative and avoid problematic foldings altogether. There are two general exceptions to this rule. The first is the case of identifier matching. If a folding has a prohibited character as one of the output characters, it will not match any legal identifier. Therefore, for properly restricted inputs, one may safely use fixed combinations of foldings, such as NFKC. The other exception is the case of more extensive string pre-processing, discussed below.

5.2.1 Fraction expansion

Fraction expansion as defined in the compatibility decompositions can lead to a drastic change of the semantics of a string and can lead to term boundary issues for searching. For example: Expanding the fraction in this string: <DIGIT 5, VULGAR FRACTION ONE QUARTER> turns it into <DIGIT 5, DIGIT 1, FRACTION SLASH, DIGIT 4>. This now will be found by a search for "51". Because of the semantics of FRACTION SLASH the expansion also changes the numeric value from "5 and a quarter" into "51 over 4". Fraction expansion is therefore best avoided altogether.

By modifying the fraction expansion from the standard compatibility decomposition and inserting an appropriate space character, for example THIN SPACE, before the fraction, it is possible to prevent the expanded fraction from coalescing with preceding digits. However, if there are no preceding digits, no THIN SPACE must be added, or strings containing the expanded fraction would no longer match strings with already expanded fractions which presumably would not contain THIN SPACE characters. Finally, any space character would be subject to space folding, which, if present would introduce a SPACE character, possibly affecting the search term.

5.2.2 Bullet expansions

If a circled bullet character is simply replaced by its contents, as when CIRCLED DIGIT 5 is replaced by DIGIT 5, the separation from the surrounding text is lost, and the DIGIT 5 could run together with adjacent numbers. For bullet characters using parenthesized or dotted letters or digit, this issue is somewhat mitigated by fact that the bullet itself contains punctuation. Bullet characters are commonly used like footnote marks to refer to other text, in other words, they do not occur just at the beginning of bulleted lines.

5.2.3 Spacing accents substitution

Spacing accents are mapped by compatibility decomposition to SPACE followed by a non-spacing accent. This inappropriately introduces a space character into the term, as well as introducing non-spacing marks where none were in the data before. The former is especially problematic, where the matching operation is affected by these spaces and combining characters.

5.2.4 Math folding

The set of compatibility decompositions includes the folding of letterlike mathematical symbols to their nearest ASCII or Hebrew equivalent. In particular the Hebrew characters used as letterlike symbols do not have RIGHT TO LEFT directionality and the set of such letters in mathematical usage is sufficiently restricted that such folding makes little sense in math, except in pure 'looks like' style searches.

5.2.5 Various "cluster" expansions

Unicode contains many clusters, e.g. square symbols, some of the letterlike characters that are made up of several characters. 'Decomposing' these may or may not be the right thing for search equivalence. Parenthesized characters and numbers would probably be immune to the term boundaries issues raised earlier, but the story is less clear for others.

5.2.5 Jamo expansion

For more information on Jamo expansion see Section 11.4 Hangul in [Unicode].

5.2.6 Preserving semantics

To a limited extent, the problems surrounding bullet expansion can be mitigated by inserting a THIN SPACE around the expansion to set the expanded text off from the surrounding text. However, this cannot be applied together with any space folding, as otherwise the THIN SPACE may become SPACE and might be considered a search term delimiter.

5.3 Versioning and stability

As characters are added to [Unicode] many of them need to be added to the definition of foldings described here. However, there are some exceptions, for example, few characters subject to [WidthFolding] are expected to be added. The data files that are specifically associated with this Technical Report contain a note discussing the expected level and type of future changes.

[Ed.: This description will be updated to reflect the actual versioning for approved versions.]

No attempt is made to version the data files associated with draft versions of this Technical Report. Each version replaces the preceding version. Each File will indicate which version of the Unicode Standard is required to cover all character codes referred to. In order to recover folding tables for earlier versions of the Standard, simply delete any lines that refer to any characters (whether as source or target characters) for which the [DerivedAge] is more recent than the desired version.

Where significant changes have been made to the folding data for existing characters they are noted in the change history in each data file.

Foldings derived from data in the Unicode Character Database [UCD] are fully versioned by a combination of the version of the [UnicodeData] file combined with the version of this report containing the derivation instructions. In the future, data files that are originally associated with this Technical Report may be incorporated into the UCD.

References

[Aliases]	Data files ftp://ftp.unicode.org/Public/UNIDATA/PropertyAliases.txt and http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
[CaseFolding]	Data file ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt
[Charts]	The online code charts can be found at http://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html
[DerivedAge]	The version for which a given character was added to the Unicode Standard is listed in http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
[DiacriticFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt.
[EAW]	Unicode Standard Annex #11, East Asian Width. http://www.unicode.org/reports/tr11 For a definition of East Asian Width
Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues.
[Foldings]	A machine readable listing of all folding operations described in this report can be found at: http://www.unicode.org/reports/tr30/datafiles/Foldings.txt.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[HanRadicalFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/HanRadicalFolding.txt.
[HiraganaFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/HiraganaFolding.txt.
[Normalization]	Unicode Standard Annex #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/
[KatakanaFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/KatakanaFolding.txt. Another example of a Hiragana_Katakana transliteration can be found as part of the ICU4j source code.
[LetterformFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/LetterformFolding.txt.
[PropModel]	Unicode Technical Report #23:The Unicode Character Property Model, http://www.unicode.org/reports/tr23/
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[SimplifiedHanFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/SimplifiedHanFolding.txt.
[SuperScriptFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/SuperscriptFolding.txt.
[SuzhouFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/SuzhouFolding.txt.
[Unicode]	The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/
[UCA]	Unicode Technical Standard #10: Unicode Collation Algorithm http://www.unicode.org/reports/tr10/
[UCD]	Unicode Character Database. http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html For and overview of the Unicode Character Database and a list of its associated files
[UnicodeData]	http://www.unicode.org/Public/UNIDATA/UnicodeData.txt This file contains the combining class and decomposition information needed to carry out canonical and compatibility decompositions as defined in chapter 3 of the Unicode Standard.
[UXML]	Unicode Technical Report #20: Unicode in XML and other Markup Languages http://www.unicode.org/reports/tr20/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.
[WidthFolding]	A data file can be found at: http://www.unicode.org/reports/tr30/datafiles/WidthFolding.txt. Another example of a Fullwidth_Halfwidth folding can be found as part of the ICU4j source code.
[XML]	Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6-October-2000, http://www.w3.org/TR/REC-xml/

Acknowledgements

Thanks to Mark Davis for reformatting the data files and John Cowan for creating the first draft of the DiacriticFolding data file.

Modifications

Changes from Tracking Number

3 Minor text edits throughout. Added data files for DiacriticFolding, HanRadicalFolding, SimplifiedHanFolding, SushouFolding and LetterformFolding, plus a description file Foldings.txt

2 Added a description of Syllabic folding, replaced definitions by pointer to definitions in [PropModel], improved introduction. Added data files for HiraganaFolding, KatakanaFolding, SuperscriptFolding, and WidthFolding.

1 Updated to Unicode 4.0, updated Status and References, removed conformance section, added detail throughout

0 First version

Copyright © 2001-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.