Ignore:
Timestamp:
Feb 17, 2001, 3:03:14 PM (25 years ago)
Author:
umoeller
Message:

Updates to XML.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/helpers/xmldefs.c

    r36 r38  
    2222
    2323/*
    24  *@@category: Helpers\XML
    25  *      see xml.c.
     24 *@@gloss: expat expat
     25 *      Expat is one of the most well-known XML processors (parsers).
     26 *      I (umoeller) have ported expat to the XWorkplace Helpers
     27 *      library. See xmlparse.c for an introduction to expat. See
     28 *      xml.c for an introduction to XML support in the XWorkplace
     29 *      Helpers in general.
     30 */
     31
     32
     33/*
     34 *@@gloss: XML XML
     35 *      XML is the Extensible Markup Language, as defined by
     36 *      the W3C. XML isn't really a language, but a meta-language
     37 *      for describing markup languages. It is a simplified subset
     38 *      of SGML.
     39 *
     40 *      You should be familiar with the following:
     41 *
     42 *      -- XML parsers operate on XML @documents.
     43 *
     44 *      -- Each XML document has both a physical and a logical
     45 *         structure.
     46 *
     47 *         Physically, the document is composed of units called
     48 *         @entities.
     49 *
     50 *         Logically, the document is composed of @markup and
     51 *         @content. Among other things, markup separates the content
     52 *         into @elements.
     53 *
     54 *      -- The logical and physical structures must nest properly (be
     55 *         @well-formed) for each entity, which results in the entire
     56 *         XML document being well-formed as well.
    2657 */
    2758
    2859/*
    2960 *@@gloss: entities entities
    30  *      An "entity" is an XML storage unit. In the simplest case, an
    31  *      XML document has only one entity, which is an XML file.
    32  *      Except for the document entity (which is nameless), all
    33  *      entities are identified by their names.
    34  *
    35  *      Entities are marked as either parsed or unparsed.
    36  *
     61 *      An "entity" is an XML storage unit. It's a very abstract
     62 *      concept, and the term doesn't make much sense, but it was
     63 *      in SGML already, and XML chose to inherit it.
     64 *
     65 *      In the simplest case, an XML document has only one entity,
     66 *      which is an XML file (or memory buffer from wherever).
    3767 *      The document entity serves as the root of the entity tree
    3868 *      and a starting-point for an XML processor. Unlike other
     
    4070 *      appear on a processor input stream without any identification
    4171 *      at all.
     72 *
     73 *      Entities are defined to be either parsed or unparsed.
    4274 *
    4375 *      Other than that, there are @internal_entities,
     
    119151 *          They must be escaped unless used in a @CDATA section.
    120152 *
    121  *      --  To allow values in an @attribute to contain both single and double
     153 *      --  To allow values in @attributes to contain both single and double
    122154 *          quotes, the apostrophe or single-quote character (') may be
    123155 *          represented as "'", and the double-quote character
    124156 *          (") as """.
    125157 *
    126  *      In addition, a @character_reference is a special case of an entity reference.
     158 *      A numeric @character_reference is a special case of an entity reference.
    127159 *
    128160 *      An internal entity is always parsed.
     
    210242 *
    211243 *      Markup is either @elements, @entity_references, @comments, @CDATA
    212  *      section delimiters, @DTD's, and
    213  *      @processing_instructions.
     244 *      section delimiters, @DTD's, or @processing_instructions.
    214245 *
    215246 *      XML "text" consists of markup and @content.
     
    226257 *      or may not be interested in white space. Whitespace
    227258 *      handling can therefore be handled differently for each
    228  *      element with the use of the special "xml:space" @attribute.
     259 *      element with the use of the special "xml:space" @attributes.
    229260 */
    230261
     
    291322 +           <P /> <IMG align="left" src="http://www.w3.org/Icons/WWW/w3c_home" />
    292323 *
    293  *      An @attribute contains additional an parameter to an element.
     324 *      In addition, @attributes contains extra parameters to elements.
    294325 *      If the element has attributes, they must be in the start-tag
    295326 *      (or empty-element tag).
     
    311342
    312343/*
    313  *@@gloss: attribute attribute
     344 *@@gloss: attributes attributes
    314345 *      "Attributes" are name-value pairs that have been associated
    315346 *      with @elements. Attributes can only appear in start-tags
     
    370401 *      document's @content; an XML processor may, but
    371402 *      need not, make it possible for an application to retrieve
    372  *      the text of comments (expat has a handler for this).
     403 *      the text of comments (@expat has a handler for this).
    373404 *
    374405 *      Comments may contain any text except "--" (double-hyphen).
     
    464495 *@@gloss: valid valid
    465496 *      XML @documents are said to be "valid" if they have a @DTD
    466  *      associated and they confirm to it.
     497 *      associated and they confirm to it. While XML documents
     498 *      must always be @well-formed, validation and validity is up
     499 *      to the implementation (i.e. at option to the application).
    467500 *
    468501 *      Validating processors must report violations of the constraints
     
    473506 *      referenced in the document.
    474507 *
    475  *      Non-validating processors are required to check only the
    476  *      document entity (see @entitites), including the entire
    477  *      internal DTD subset, for whether it is @well-formed. While
    478  *      they are  not required to check the document for validity,
     508 *      Non-validating processors (such as @expat) are required to
     509 *      check only the document entity (see @entitites), including the
     510 *      entire internal DTD subset, for whether it is @well-formed.
     511 *
     512 *      While they are  not required to check the document for validity,
    479513 *      they are required to process all the declarations they
    480514 *      read in the internal DTD subset and in any parameter entity
     
    482516 *      entity that they do not read; that is to say, they must
    483517 *      use the information in those declarations to normalize
    484  *      @attribute values, include the replacement text of
     518 *      values of @attributes, include the replacement text of
    485519 *      @internal_entities, and supply default attribute values.
    486520 *      They must not process entity declarations or attribute-list
     
    492526/*
    493527 *@@gloss: encodings encodings
    494  *      In an encoding declaration, the values "UTF-8", "UTF-16",
    495  *      "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
    496  *      for the various encodings and transformations of Unicode /
    497  *      ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ...
    498  *      "ISO-8859-9" should be used for the parts of ISO 8859, and
    499  *      the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should
    500  *      be used for the various encoded forms of JIS X-0208-1997.
     528 *      XML supports a wide variety of character encodings. These
     529 *      must be specified in the XML @text_declaration.
     530 *
     531 *      There are too many character encodings on the planet to
     532 *      be listed here. The most common ones are:
     533 *
     534 *      --  "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4"
     535 *          should be used for the various encodings and transformations
     536 *          of Unicode / ISO/IEC 10646.
     537 *
     538 *      --  "ISO-8859-x" (with "x" being a number from 1 to 9) represent
     539 *          the various ISO 8859 ("Latin") encodings.
     540 *
     541 *      --  "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for
     542 *          the various encoded forms of JIS X-0208-1997.
     543 *
     544 *      Example of a @text_declaration:
     545 *
     546 +          <?xml version="1.0" encoding="ISO-8859-2"?>
    501547 *
    502548 *      All XML processors must be able to read @entities in either
    503  *      UTF-8 or UTF-16.
     549 *      UTF-8 or UTF-16. See XML_SetUnknownEncodingHandler for additional
     550 *      encodings directly supported by @expat.
    504551 *
    505552 *      Entities encoded in UTF-16 must begin with the ZERO WIDTH NO-BREAK
     
    508555 *      XML processors must be able to use this character to differentiate
    509556 *      between UTF-8 and UTF-16 encoded documents.
    510  *
    511  *      See XML_ParserCreate for the encodings directly supported
    512  *      by expat.
    513557 */
    514558
     
    576620 *      nature of their content. They look like this:
    577621 +
    578  +          <!ELEMENT name contentmodel>
     622 +          <!ELEMENT name contentspec>
    579623 +
    580  *      The "name" of the element is obvious. The "contentmodel"
     624 *      No element may be declared more than once.
     625 *
     626 *      The "name" of the element is obvious. The "contentspec"
    581627 *      is not. This specifies what may appear in the element
    582  *      and can be a list of:
    583  *
    584  *      --  "#PCDATA", meaning "parsed character data" -- in
    585  *          other words, @content.
    586  *
    587  *      --  Another element name with a specification about
    588  *          whether the element may or must appear once or
    589  *          more than once.
    590  *
    591  *      --  "EMPTY" marks the element as being empty (i.e. no
    592  *          start- and end-tags, but a single tag only).
    593  *
    594  *      The element specifyer can be:
    595  *
    596  *      --  None: the subelement _must_ appear exactly once.
    597  *
    598  *      --  "+": the subelement _must_ appear at _least_ once.
    599  *
    600  *      --  "?": the subelement _may_ appear exactly once.
    601  *
    602  *      --  "*": the subelement _may_ appear once or more than
    603  *          once or not at all. Note that this must always be
    604  *          specified with "#PCDATA".
    605  *
    606  *      The list items can be separated with:
     628 *      and can be one of the following:
     629 *
     630 *      --  "EMPTY" marks the element as being empty (i.e.
     631 *          having no content at all).
     632 *
     633 *      --  "ANY" does not impose any restrictions.
     634 *
     635 *      --  (mixed): a "list" which declares the element to have
     636 *          mixed content. See below.
     637 *
     638 *      --  (children): a "list" which declares the element to
     639 *          have child elements only, but no content. See below.
     640 *
     641 *      <B>(mixed): content with elements</B>
     642 *
     643 *      With the (mixed) contentspec, an element may either contain
     644 *      @content only or @content with subelements.
     645 *
     646 *      While the (children) contentspec allows you to define sequences
     647 *      and orders, this is not possible with (mixed).
     648 *
     649 *      "contentspec" must then be a pair of parentheses, optionally
     650 *      followed by "*". In the brackets, there must be at least the
     651 *      keyword "#PCDATA", optionally followed by "|" and element
     652 *      names. Note that if no #PCDATA appears, the (children) model
     653 *      is assumed (see below).
     654 *
     655 *      Examples:
     656 *
     657 +          <!ELEMENT name (#PCDATA)* >
     658 +          <!ELEMENT name (#PCDATA | subname1 | subname2)* >
     659 +          <!ELEMENT name (#PCDATA) >
     660 *
     661 *      Note that if you specify sub-element names, you must terminate
     662 *      the contentspec with "*". Again, there's no way to specify
     663 *      orders etc. with (mixed).
     664 *
     665 *      <B>(children): Element content only</B>
     666 *
     667 *      With the (children) contentspec, an element may contain
     668 *      only other elements (and @whitespace), but no other @content.
     669 *
     670 *      This can become fairly complicated. "contentspec" then must be
     671 *      a "list" followed by a "repeater".
     672 *
     673 *      A "repeater" can be:
     674 *
     675 *      --  Nothing: the preceding item _must_ appear exactly once.
     676 *
     677 *      --  "+": the preceding item _must_ appear at _least_ once.
     678 *
     679 *      --  "?": the preceding item _may_ appear exactly once.
     680 *
     681 *      --  "*": the preceding item _may_ appear once or more than
     682 *          once or not at all.
     683 *
     684 *      Here's the most simple example (precluding that "SUBELEMENT"
     685 *      is a valid "list" here):
     686 *
     687 +          <!ELEMENT name (SUBELEMENT)* >
     688 *
     689 *      In other words, in (children) mode, "contentspec" must always
     690 *      be in brackets and is followed by a "repeater" (which can be
     691 *      nothing).
     692 *
     693 *      About "lists"... since these declarations may nest, this is
     694 *      where the recursive definition of a "content particle" comes
     695 *      in:
     696 *
     697 *      --  A "content particle" is either a sub-element name or
     698 *          a nested list, followed by a "repeater".
     699 *
     700 *      --  A "list" is defined as an enumeration of content particles,
     701 *          enclosed in parentheses, where the content particles are
     702 *          separated by list separators.
     703 *
     704 *      There are two types of list separators:
    607705 *
    608706 *      --  Commas (",") indicate that the elements must appear
    609  *          in the same order.
     707 *          in the specified order ("sequence").
    610708 *
    611709 *      --  Vertical bars ("|") specify that the elements may
    612  *          occur alternatively.
    613  *
    614  *      Examples:
    615  +
    616  +          <!ELEMENT oldjoke  (burns+, allen, applause?)>
    617  +          <!ELEMENT burns    (#PCDATA | quote)*>
    618  +          <!ELEMENT allen    (#PCDATA | quote)*>
    619  +          <!ELEMENT quote    (#PCDATA)*>
    620  +          <!ELEMENT applause EMPTY>
    621  *
    622  *      This defines that the element "oldjoke" must contain
    623  *      "burns" and "allen" and may contain "applause".
    624  *      Only "burns" may appear more than once.
     710 *          occur alternatively ("choice").
     711 *
     712 *      The list separators cannot be mixed; the list must be
     713 *      either completely "sequence" or "choice".
     714 *
     715 *      Examples of content particles:
     716 *
     717 +              SUBELEMENT+
     718 +              list*
     719 *
     720 *      Examples of lists:
     721 *
     722 +          ( cp | cp | cp | cp )
     723 +          ( cp , cp , cp , cp )
     724 *
     725 *      Full examples for (children):
     726 *
     727 +          <!ELEMENT oldjoke  ( burns+, allen, applause? ) >
     728 +                             | |       +cp-+          | |
     729 +                             | |                      | |
     730 +                             | +------- list ---------+ |
     731 +                             +-------contentspec--------+
     732 *
     733 *      This specifies a "seqlist" for the "oldjoke" element. The
     734 *      list is not nested, so the content particles are element
     735 *      names only.
     736 *
     737 *      Within "oldjoke", "burns" must appear first and can appear
     738 *      once or several times.
     739 *
     740 *      Next must be "allen", exactly once (since there's no repeater).
     741 *
     742 *      Optionally ("?"), there can be "applause" at the end.
     743 *
     744 *      Now, a nested example:
     745 *
     746 +          <!ELEMENT WARPIN (REXX*, VARPROMPT*, MSG?, TITLE?, (GROUP | PCK)+), PAGE+) >
     747 *
    625748 */
    626749
     
    754877 *      in whole or in part within @parameter_entities.
    755878 */
     879
     880/*
     881 *@@gloss: DOM DOM
     882 *      DOM is the "Document Object Model", as defined by the W3C.
     883 *
     884 *      The DOM is a programming interface for @XML @documents.
     885 *      (XML is a metalanguage and describes the documents
     886 *      themselves. DOM is a programming interface -- an API --
     887 *      to access XML documents.)
     888 *
     889 *      The W3C calls this "a platform- and language-neutral
     890 *      interface that allows programs and scripts to dynamically
     891 *      access and update the content, structure and style of
     892 *      documents. The Document Object Model provides
     893 *      a standard set of objects for representing HTML and XML
     894 *      documents, a standard model of how these objects can
     895 *      be combined, and a standard interface for accessing and
     896 *      manipulating them. Vendors can support the DOM as an
     897 *      interface to their proprietary data structures and APIs,
     898 *      and content authors can write to the standard DOM
     899 *      interfaces rather than product-specific APIs, thus
     900 *      increasing interoperability on the Web."
     901 *
     902 *      In short, DOM specifies that an XML document is broken
     903 *      up into a tree of "nodes", representing the various parts
     904 *      of an XML document. Such nodes represent @documents,
     905 *      @elements, @attributes, @processing_instructions,
     906 *      @comments, @content, and more.
     907 *
     908 *      See xml.c for an introduction to XML and DOM support in
     909 *      the XWorkplace helpers.
     910 *
     911 *      Example: Take this HTML table definition:
     912 +
     913 +          <TABLE>
     914 +          <TBODY>
     915 +          <TR>
     916 +          <TD>Column 1-1</TD>
     917 +          <TD>Column 1-2</TD>
     918 +          </TR>
     919 +          <TR>
     920 +          <TD>Column 2-1</TD>
     921 +          <TD>Column 2-2</TD>
     922 +          </TR>
     923 +          </TBODY>
     924 +          </TABLE>
     925 *
     926 *      In the DOM, this would be represented by a tree as follows:
     927 +
     928 +                          ÚÄÄÄÄÄÄÄÄÄÄÄÄ¿
     929 +                          ³   TABLE    ³        (only ELEMENT node in root DOCUMENT node)
     930 +                          ÀÄÄÄÄÄÂÄÄÄÄÄÄÙ
     931 +                                ³
     932 +                          ÚÄÄÄÄÄÁÄÄÄÄÄÄ¿
     933 +                          ³   TBODY    ³        (only ELEMENT node in root "TABLE" node)
     934 +                          ÀÄÄÄÄÄÂÄÄÄÄÄÄÙ
     935 +                    ÚÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄ¿
     936 +              ÚÄÄÄÄÄÁÄÄÄÄÄÄ¿          ÚÄÄÄÄÄÁÄÄÄÄÄÄ¿
     937 +              ³   TR       ³          ³   TR       ³
     938 +              ÀÄÄÄÄÄÂÄÄÄÄÄÄÙ          ÀÄÄÄÄÄÂÄÄÄÄÄÄÙ
     939 +                ÚÄÄÄÁÄÄÄÄÄÄ¿            ÚÄÄÄÁÄÄÄÄÄÄ¿
     940 +            ÚÄÄÄÁÄ¿     ÚÄÄÁÄÄ¿     ÚÄÄÄÁÄ¿     ÚÄÄÁÄÄ¿
     941 +            ³ TD  ³     ³ TD  ³     ³ TD  ³     ³ TD  ³
     942 +            ÀÄÄÂÄÄÙ     ÀÄÄÂÄÄÙ     ÀÄÄÄÂÄÙ     ÀÄÄÂÄÄÙ
     943 +         ÉÍÍÍÍÍÊÍÍÍÍ» ÉÍÍÍÍÊÍÍÍÍÍ» ÉÍÍÍÍÊÍÍÍÍÍ» ÉÍÍÊÍÍÍÍÍÍÍ»
     944 +         ºColumn 1-1º ºColumn 1-2º ºColumn 2-1º ºColumn 2-2º    (one TEXT node in each parent node)
     945 +         ÈÍÍÍÍÍÍÍÍÍÍŒ ÈÍÍÍÍÍÍÍÍÍÍŒ ÈÍÍÍÍÍÍÍÍÍÍŒ ÈÍÍÍÍÍÍÍÍÍÍŒ
     946 */
     947
     948/*
     949 *@@gloss: DOM_DOCUMENT DOCUMENT
     950 *      representation of XML @documents in the @DOM.
     951 *
     952 *      The xwphelpers implementation has the following differences
     953 *      to the DOM specs:
     954 *
     955 *      -- The "doctype" member points to the documents @DTD, or is NULL.
     956 *         In our implementation, this is the pvExtra pointer, which points
     957 *         to a _DOMDTD.
     958 *
     959 *      -- The "implementation" member points to a DOMImplementation object.
     960 *         This is not supported here.
     961 *
     962 *      -- The "documentElement" member is a convenience pointer to the
     963 *         document's root element. We don't supply this field; instead,
     964 *         the llChildren list only contains a single ELEMENT node for the
     965 *         root element.
     966 *
     967 *      -- The "createElement" method is implemented by xmlCreateElementNode.
     968 *
     969 *      -- The "createAttribute" method is implemented by xmlCreateAttributeNode.
     970 *
     971 *      -- The "createTextNode" method is implemented by xmlCreateTextNode,
     972 *         which has an extra parameter though.
     973 *
     974 *      -- The "createComment" method is implemented by xmlCreateCommentNode.
     975 *
     976 *      -- The "createProcessingInstruction" method is implemented by
     977 *         xmlCreatePINode.
     978 *
     979 *      -- The "createDocumentFragment", "createCDATASection", and
     980 *         "createEntityReference" methods are not supported.
     981 */
     982
     983
Note: See TracChangeset for help on using the changeset viewer.