CDATA isn’t special

Wednesday 8 September 2004This is more than 20 years old. Be careful.

Here’s one of those recurring misconceptions about XML. I’ve run into it a number of times, most recently from a co-worker. There’s no difference between character data in XML (you know, the text that isn’t part of any tag), and CDATA (the stuff inside the wacky <![CDATA[ .. ]]> tag).

Dealing with an XML stream that would include HTML text, my co-worker asked if we could put the HTML into a CDATA section, because that would make it easier to render later. I assured him it wouldn’t make a bit of difference how the text was communicated, so long as it was properly quoted.

As Tim Bray comments in the Annotated XML Specification:

CDATA sections don’t mean anything; they are strictly a convenience to make XML document authors’ lives easier.

These two chunks of XML behave exactly the same:

<thingy type='html'>
&lt;p&gt;Here is a paragraph&lt;/p&gt;
</thingy>
<thingy type='html'>
<![CDATA[<p>Here is a paragraph</p>]]>
</thingy>

It’s just two different ways of quoting text lexically in XML. The data being represented is exactly the same, just as these C strings are all the same:

"Hello"
"\x48\x65\x6c\x6c\6f"
"He" "llo"

Once they are read in, the data is the same. When the time comes to print them out, there are many options, but they have no correlation with how they were read in.

When the time comes to transform that XML into HTML with XSLT, you’ll use disable-output-escaping:

<xsl:template match='thingy[@type="html"]'>
    <html><body>
        <xsl:value-of disable-output-escaping='yes' select='text()'/>
    </body></html>
</xsl:template>

Of course, the picture is muddied by things like SAX’s startCDATA method, which is called when a CDATA section is first encountered. Thankfully, SAX pipes all of the text in the CDATA section through the single characters event, but the existence of startCDATA seems to encourage assigning meaning to a CDATA section. It’s just another way to escape characters.

Another misuse of CDATA is to see it as a no-quoting-needed free-for-all. Have a chunk of text you want in your XML, and you don’t want to bother stuffing in the &lt; and &gt; escapes? Just use a CDATA section, and you don’t have to look at all the data. Wrong. You still have to escape the ]]> sequence if it appears in your data. (If you think that won’t happen, you should see the XML source for this page!)

Comments

[gravatar]
"When the time comes to transform that XML into HTML with XSLT, you'll use disable-output-escaping:"

This is true, of course, but it gets nasty when you have to actually interpret the un-escaped HTML with something other than a full-fledged web browser. :-) (It's not a common situation, but not something unheard of.) ((And this is of course what many RSS readers do, in a form or another.))

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.