number |
|
text |
implementations |
C001 |
[S] [I] [C] |
Specifications, software and content MUST
NOT require or depend on a one-to-one correspondence between
characters and the sounds of a language. |
SSML |
C002 |
[S] [I] [C] |
Specifications, software and content MUST
NOT require or depend on a one-to-one mapping between
characters and units of displayed text. |
CSS, XSL-FO, SVG |
C003 |
[S] [I] [C] |
Protocols, data formats and APIs MUST
store, interchange or process text data in logical order. |
everything that uses Unicode, very widely implemented |
C075 |
[I] |
Independent of whether some implementation uses logical selection
or visual selection, characters selected MUST be kept in logical order in storage. |
many implementations, in particular editors; SVG |
C004 |
[S] |
Specifications of protocols and APIs that involve selection of
ranges SHOULD provide for discontiguous
logicalselections, at least to the extent necessary to support
implementation of visual selection on screen on top of those
protocols and APIs. |
XPointer |
C005 |
[S] [I] |
Specifications and software MUST NOT
require nor depend on a single keystroke resulting in a single
character, nor that a single character be input with a single
keystroke (even with modifiers), nor that keyboards are the same all
over the world. |
DOM Events,... |
C006 |
[S] [I] |
Software that sorts or searches text for users SHOULD do so on the basis of appropriate
collation units and ordering rules for the relevant language and/or
application. |
XQuery, various OSes |
C007 |
[S] [I] |
Where searching or sorting is done dynamically, particularly in a
multilingual environment, the 'relevant language' SHOULD be determined to be that of the current
user, and may thus differ from user to user. |
XQuery, various OSes |
C066 |
[S] [I] |
Software that allows users to sort or search text SHOULD allow the user to select alternative
rules for collation units and ordering. |
XQuery, various OSes |
C008 |
[S] [I] |
Specifications and implementations of sorting and searching
algorithms SHOULD accommodate text that
contains any character in Unicode. |
several implementations of UCA and others (C008 mainly warns about
some old problem) |
C009 |
[S] [I] |
Specifications, software and content MUST
NOT require or depend on a one-to-one relationship between
characters and units of physical storage. |
XSLT, XQuery,... |
C010 |
[S] |
When specifications use the term 'character' the specifications MUST define which meaning they intend. |
XML |
C067 |
[S] |
Specifications SHOULD use specific
terms, when available, instead of the general term 'character'. |
various specs |
C013 |
[S] [C] |
Textual data objects defined by protocol or format specifications
MUST be in a single character
encoding. |
HTML, XML,
CSS,... |
C014 |
[S] |
All specifications that involve processing of text MUST specify the processing of text according
to the Reference
Processing Model, namely:
- Specifications MUST define text in
terms of Unicode characters, not bytes or glyphs.
- For their textual data objects specifications MAY allow use of any character encoding
which can be transcoded to a Unicode encoding form.
- Specifications MAY choose to
disallow or deprecate some character encodings and to make others
mandatory. Independent of the actual character encoding, the
specified behavior MUST be the same
as if the processing happened as follows:
- The character encoding of any textual data object received
by the application implementing the specification MUST be determined and the data object
MUST be interpreted as a
sequence of Unicode characters - this MUST be equivalent to transcoding
the data object to some Unicode
encoding form, adjusting any character encoding label if
necessary, and receiving it in that Unicode encoding
form.
- All processing MUST take place
on this sequence of Unicode characters.
- If text is output by the application, the sequence of
Unicode characters MUST be
encoded using a character encoding chosen among those allowed
by the specification.
- If a specification is such that multiple textual data objects
are involved (such as an XML document referring to external
parsed entities), it MAY choose to
allow these data objects to be in different character encodings.
In all cases, the Reference
Processing Model MUST be applied
to all textual data objects.
|
HTML, CSS, XML, XSLT, XQuery,... |
C070 |
[S] |
Specifications SHOULD NOT
arbitrarily exclude code points from the full range of
Unicode code
points from U+0000 to U+10FFFF inclusive. |
HTML, XML,
CSS |
C077 |
[S] |
Specifications MUST NOT allow code
points above U+10FFFF. |
HTML, XML,
CSS |
C079 |
[S] |
Specifications SHOULD NOT allow the
use of codepoints reserved by Unicode for internal use. |
discouraged by XML1.1 |
C078 |
[S] |
Specifications MUST NOT allow the use
of surrogate code points. |
HTML, XML,
CSS |
C015 |
[S] |
Specifications MUST either specify a
unique character encoding, or provide character encoding
identification mechanisms such that the encoding of text can be
reliably identified. |
HTML, XML, CSS,... |
C016 |
[S] |
When designing a new protocol, format or API, specifications SHOULD require a unique character
encoding. |
DOM, IRI->URI conversion, some IETF protocols |
C017 |
[S] |
When basing a protocol, format, or API on a protocol, format, or
API that already has rules for character encoding, specifications
SHOULD use rather than change these
rules. |
HTML->MIME, XML->MIME, RFC3023-based media types |
C018 |
[S] |
When a unique character encoding is required, the character
encoding MUST be UTF-8, UTF-16 or
UTF-32. |
DOM, IRIs, some IETF protocols |
C020 |
[S] |
Specifications SHOULD avoid using the
terms 'character set' and 'charset' to refer to a character encoding,
except when the latter is used to refer to the MIME charset parameter or its IANA-registered
values. The term 'character encoding', or
in specific cases the terms 'character encoding
form' or 'character encoding
scheme', are RECOMMENDED. |
lots of specs |
C021 |
[S] |
If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset
registry names, and in particular the names identified in the
registry as 'MIME preferred names', to
designate character encodings in protocols, data formats and
APIs. |
recommended by XML |
C022 |
[S] [I] [C] |
Character encodings that are not in the IANA registry SHOULD NOT be used, except by private
agreement. |
XML |
C023 |
[S] [I] [C] |
If an unregistered character encoding is used, the convention of
using 'x-' at the beginning of the name
MUST be followed. |
XML |
C049 |
[I] [C] |
The character encoding of content SHOULD be chosen so that it maximizes the
opportunity to directly represent characters (ie. minimizes the need
to represent characters by markup
means such as character
escapes) while avoiding obscure encodings that are unlikely to be
understood by recipients. |
wide practice on the Web |
C034 |
[C] |
If facilities are offered for identifying character encoding,
content MUST make use of them; where the facilities offered for
character encoding identification include defaults (e.g. in XML 1.0
[XML
1.0]), relying on such defaults is sufficient to satisfy this
identification requirement. |
wide (but not yet wide enough) practice on the Web |
C024 |
[I] [C] |
Content and software that label text data MUST use one of the names required by the
appropriate specification (e.g. the XML specification when editing
XML text) and SHOULD use the MIME
preferred name of a character encoding to label data in that
character encoding. |
wide practice |
C025 |
[I] [C] |
An IANA-registered charset name MUST NOT be used to label text data in a
character encoding other than the one identified in the IANA
registration of that name. |
wide practice |
C026 |
[S] |
If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and
UTF-16 encoding forms of Unicode as admissible character encodings
and SHOULD choose at least one of UTF-8
or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the
specification). |
XML |
C027 |
[S] |
Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the
default, or both if they define suitable means of distinguishing
them. |
XML |
C028 |
[S] |
Specifications MUST NOT propose the
use of heuristics to determine the encoding of data. |
none known |
C029 |
[I] |
Receiving software MUST
determine the encoding of data from available information according
to appropriate specifications. |
widely implemented (although it could be better) |
C030 |
[I] |
When an IANA-registered charset name
is recognized, receiving software MUST
interpret the received data according to the encoding associated with
the name in the IANA registry. |
widely implemented |
C031 |
[I] |
When no charset is provided receiving software MUST adhere to the default character
encoding(s) specified in the specification. |
widely implemented |
C035 |
[S] |
Specifications MUST define
conflict-resolution mechanisms (e.g. priorities) for cases where
there is multiple or conflicting information about character
encoding. |
HTML, XML |
C033 |
[I] |
Software MUST completely implement the
mechanisms for character encoding identification and conflict
resolution. |
browsers, XML parsers |
C073 |
[C] |
Publicly interchanged content SHOULD
NOT use codepoints in the private use area. |
most Web pages |
C076 |
[C] |
Content MUST NOT use a code point for
any purpose other than that defined by its character encoding. |
most Web pages |
C038 |
[S] |
Specifications MUST NOT require the
use of private use area characters with particular assignments. |
most specs (bad exception that we are trying to avoid repeating:
MathML 1.0) |
C039 |
[S] |
Specifications MUST NOT require the
use of mechanisms for definingagreements of private use code
points. |
all known specs |
C040 |
[S] [I] |
Specifications and implementations SHOULD
NOT disallow the use of private use code points by private
agreement. |
HTML, XML |
C041 |
[S] |
Specifications MAY define markup
to allow the transmission of symbols not in Unicode or to identify
specific variants of Unicode characters. |
SVG, MathML |
C068 |
[S] |
Specifications SHOULD allow the
inclusion of or reference to pictures and graphics where appropriate,
to eliminate the need to (mis)use character-oriented mechanisms for
pictures or graphics. |
HTML, SVG |
C042 |
[S] |
Specifications SHOULD NOT invent a new
escaping mechanism if an appropriate one already exists. |
XHTML, SVG, SMIL,... |
C043 |
[S] |
The number of different ways to escape a character SHOULD be minimized (ideally to one). |
CSS |
C044 |
[S] |
Escape syntax SHOULD require either
explicit end delimiters or a fixed number of characters in each
character escape. Escape syntaxes where the end is determined by any
character outside the set of characters admissible in the character
escape itself SHOULD be avoided. |
HTML, XML |
C045 |
[S] |
Whenever specifications define character escapes that allow the
representation of characters using a number, the number MUST represent the Unicode code point of the
character and SHOULD be in hexadecimal
notation. |
HTML, XML, CSS |
C046 |
[S] |
Escaped characters SHOULD be
acceptable wherever their unescaped forms are; this does not preclude
that syntax-significant
characters, when escaped, lose their significance in the syntax. In
particular, if a character is acceptable in identifiers and comments,
then its escaped form should also be acceptable. |
CSS, would have been ideal for XML |
C047 |
[I] [C] |
Escapes SHOULD only be used when the
characters to be expressed are not directly representable in the
format or the character encoding of the document, or when the visual
representation of the character is unclear. |
most content on the Web |
C048 |
[I] [C] |
Content SHOULD use the hexadecimal
form of character escapes rather than the decimal form when there are
both. |
several implementations, lots of content |
C050 |
[S] |
Specifications SHOULD exclude
compatibility characters in the syntactic elements (markup,
delimiters, identifiers) of the formats they define. |
XML 1.0 |
C011 |
[S] |
Specifications SHOULD NOT define a
string as a 'byte string'. |
all W3C specs |
C012 |
[S] |
The 'character string' definition SHOULD be used by most specifications. |
HTML, XML, XSLT,... |
C051 |
[S] [I] |
The character
string is RECOMMENDED as a basis for
string indexing. |
XSLT, XQuery |
C052 |
[S] [I] |
A code
unit string MAY be used as a basis
for string indexing if this results in a significant improvement in
the efficiency of internal operations when compared to the use of character
string. |
DOM |
C071 |
[S] [I] |
Grapheme
clusters MAY be used as a basis for
string indexing in applications where user interaction is the primary
concern. |
not too much implemented yet |
C074 |
[S] |
Specifications that define indexing in terms of grapheme clusters
MUST either: a) define grapheme clusters
in terms of default grapheme clusters as defined in Unicode Standard
Annex #29, Text Boundaries [UTR
#29], or b) define specifically how tailoring is applied to the
indexing operation. |
not too much implemented yet |
C072 |
[S] [I] |
The use of byte
strings for indexing is NOT
RECOMMENDED. |
all W3C specs |
C053 |
[S] |
Specifications that need a way to identify substrings or point
within a string SHOULD provide ways
other than string indexing to perform this operation. |
regular expressions,... |
C054 |
[I] [C] |
Users of specifications (software developers, content developers)
SHOULD whenever possible prefer ways
other than string indexing to identify substrings or point within a
string. |
XSLT? |
C055 |
[S] |
Specifications SHOULD understand and
process single characters as substrings, and treat indices as
boundary positions between counting units, regardless of the
choice of counting units. |
XSLT/XQuery (for first part) |
C056 |
[S] |
Specifications of APIs SHOULD NOT
specify single characters or single 'units of
encoding' asargumentor return types. |
DOM |
C057 |
[S] |
When the positions between the units are counted for string
indexing, starting with an index of 0 for the position at the start
of the string is the RECOMMENDED
solution, with the last index then being equal to the number of
counting units in the string. |
many examples in programming languages, unfortunately not XSLT |
C062 |
[S] |
Since specifications in general need both a definition for their
characters and the semantics associated with these characters,
specifications SHOULD include a
reference to the Unicode Standard, whether or not they include a
reference to ISO/IEC 10646. |
many specs |
C063 |
[S] |
A generic reference to the Unicode Standard MUST be made if it is desired that characters
allocated after a specification is published are usable with that
specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality
depending on a particular version is available and will not change
over time. |
XML
1.1 |
C064 |
[S] |
All generic references to the Unicode Standard [Unicode]
MUST refer to the latest version of the
Unicode Standard available at the date of publication of the
containing specification. |
XML
1.1 |
C065 |
[S] |
All generic references to ISO/IEC 10646 [ISO/IEC
10646] MUST refer to the latest
version of ISO/IEC 10646 available at the date of publication of the
containing specification. |
XML
1.1 |