Jesper Tverskov, August 6, 2007

Deleting comments, CDATA sections and PIs from unparsed text with REGEX in XSLT

When loading XML with the unparsed-text() function to test it with Regular Expressions, we must avoid false positives. What looks like markup in comments, PIs and CDATA sections gets escaped exactly like real markup. To make testing of markup easier it can be necessary to delete comments, CDATA sections and PIs.

The unparsed-text() function and Regular Expressions in XSLT 2.0 were primarily introduced to make it possible to transform not only XML but also text. But since an XML document is also a text document, we can also use unparsed-text() and REGEX to load and test our XML document as a string when necessary.

1. XML as XML and as text

In XSLT we have all sorts of elements and functions to get to the nodes of XML. For text nodes we have Regular Expressions. But with XML as XML we can not test if an XML declaration exists and the value of its pseudo attributes or if a DTD declaration exists. [1]

With XML as XML we have no problem finding out what whitespace exists in text nodes if the text nodes also contain text. But the text() function ignores whitespace only text nodes like indention. Whitespace outside attribute values inside tags and are not even reported by the XML parser to the XSLT processor. The XML parser also normalizes attribute values. Tab (	), Carriage Return (
), newline (
) and space ( ) are replaced by spaces. [2]

That is why we have a few situations where it can be necessary to load and analyze an XML document as unparsed text.

2. XML and other markup

Text in comments, CDATA sections and PI's looking like markup is not only a problem when we load XML as unparsed text but also when we load not well-formed XML or HTML. It gives us two general use-cases:

  1. When we need to test XML declaration, DTD declaration or when whitespace only text nodes, whitespace in attribute values or whitespace inside tags outside the attribute values are important.
  2. When we need to analyze not well-formed XML or markup not being XML like HTML.

Let me give a concrete example. If we want to count the lines of an XML document we need to load it as unparsed-text(). Newline characters can not only exist in text nodes side by side with text but also in prolog and as whitespace only text nodes, and in attribute values and inside tags.

3. Testing for DTD as example

Let us say that we want to test if an XHTML document has a DTD declaration for XHTML 1.0 Strict! The tricky thing is that text looking like such a declaration can exist inside comments and Processing Instructions that can be placed even right after the XML declaration, and that we can also have what looks like a DTD in CDATA sections in element data.

4. Comment

What looks like markup in comments is the order of the day. We use comments for out-commenting code and markup:

<!-- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -->

4.1 REGEX to delete comments

The variable named "unparsed" contains our XHTML document loaded with the unparsed-text() function.

replace($unparsed, '&lt;!--.*?--&gt;', '', 's')

The replace() function says that in the variable named "unparsed", find the text occurrences looking like '&lt;!--.*?--&gt;' and replace them with "''" that is delete them. The 's' parameter means that the "." should be used in the "dot all" mode. A dot by default means any character except a newline. In "dot all" mode, the dot also means newline.

The REGEX is matching any string beginning with "&lt;!" and followed by any combination of characters, ".*" (in dot all mode), but (and that is what the "?" after "*" means) as soon as you get a match that is it, and the patterns must end with "--&gt;".

If we don't use the *?", and if we have more than one comment, the end of the first comment and the next comments are also to be considered part of the first comment using the end of the very last comment as its end! [3]

5. CDATA section

What looks like markup is also very common in CDATA sections. They were made to make it possible to have markup not being markup but only text:

<!CDATA[[A DTD can look like this: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">.]]>

5.1 REGEX to delete CDATA sections

The variable named "unparsed" contains our XHTML document loaded with the unparsed-text() function.

  replace($unparsed, '&lt;!\[CDATA\[.*?\]\]&gt;', '', 's')

The replace() function works like explained in the section for comments. Note that in the REGEX it is necessary to escape the square parenthesis.

6. Processing Instruction

Processing Instructions starts with "<?" and ends with "?>". Right after the first "?" we must have some word being the name of the PI. The rest of the PI is just one big string. It is only a convention that the string is most often made to look like one or more pseudo attributes. What looks like markup is not common in PIs. But some jester could have made a PI like this:

<?jester hi hi just to fool you: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">?>

6.1 REGEX to delete Processing Instructions

In the REGEX we must escape "?" if it is not a quantifier. The rest is like our expression for finding and replacing comments and CDATA sections.

replace($unparsed, '&lt;\?.*?\?&gt;', '', 's')

But we have a problem. The XML declaration is not a Processing Instruction but it looks like one. An XML declaration will also be deleted. One way to overcome this problem is by putting the XML declaration back in like this:

concat(substring-before($unparsed, '?&gt;'), '?&gt;', replace($unparsed, '&lt;\?.*?\?&gt;', '', 's'))

The concat() functions starts with the XML declaration, that is "what is before the first "?&gt;" found. In the next parameter we add "?&gt" to get the whole XML declaration in place. Note that we don't escape the "?". Of the three functions we use, only replace() takes regular expressions as one of its parameters. The third parameter in the concat() function is our expression from before deleting every Processing Instruction including the XML declaration looking like one.

7. Cleaning in multiple steps

Depending on what we are up to, we can tidy up our XML document loaded with the unparsed-text() function step by step, or we could do all the cleaning in one expression. The step by step approach is the easiest to understand and the most flexible:

<xsl:variable name="unparsed" select="unparsed-text(document-uri(.))"/>
<!-- unparsed - comment -->
<xsl:variable name="unparsed2" select="replace($unparsed, '&lt;!--.*?--&gt;', '', 's')"/>
<!-- unparsed - comment - cdata -->
<xsl:variable name="unparsed3" select="replace($unparsed2, &lt;!\[CDATA\[.*?\]\]&gt;', '', 's')"/>
<!-- unparsed - comment - cdata - pi -->
<xsl:variable name="unparsed4" select="if (starts-with($unparsed, '&lt;?xml')) then concat(substring-before($unparsed3, '?&gt;'), '?&gt;', replace($unparsed3, '&lt;\?.*?\?&gt;', '', 's')) else ''"/>''

The "unparsed4" variable contains our unparsed text cleaned for comments, CDATA sections and Processing Instructions, that is for all text that could contain sequences of characters that could be mistaken for markup. We now have four variables to choose from depending on what we might find most convenient.

8. Well-formed XML

All our REGEX examples so far have one prerequisite: that our XML is XML that is well-formed in the first place. If it is not well-formed it can be very difficult to test with Regular Expressions. The best way to proceed is to modify it until it becomes well-formed.

In XSLT we have an easy way to test if an XML document is well-formed. The doc-available() function (XPath) returns true if the document exists and is well-formed. This is a nice function to use before we use document() (XSLT function) or doc() (XPath function) to make sure that the XML document exists and is loaded.

When we load an XML document with the unparsed-text() function, the function is only happy if the document is not well-formed. That is why we have the function. But if we want it to be well-formed, we can still use doc-available():

<xsl:variable name="unparsed-test" select="if(doc-available(document-uri('uri of doc'))) then unparsed-text(document-uri('uri of doc')) else 'not found or not well-formed'"/>
<xsl:if test="$unparsed-test eq 'not found or not well-formed'">
  <xsl:message select="$unparsed-test" terminate="yes"/>
</xsl:if>

9. Whitespace in XML

Even in well-formed markup, we can have whitespace characters the most unlikely places, not just in the form of whitespace only text nodes but also in tags. Let us take the XML declaration as example. We are so used to it looking like the following when all three pseudo-attributes are used:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

We often forget that the same XML declaration is also well-formed when it looks like this:

<?xml
    version            =
  "1.0"
      encoding      =            "UTF-8"
          standalone      =          "no"
                  ?>

Whitespace in tags that does not break the rules of well-formedness is not reported by the XML parser to the XSLT processor. We must keep that in mind when we start REGEXing our XML document loaded as unparsed text. To make testing easier it is often a good idea to use the normalize-space() function:

<xsl:variable name="unparsed-normalized" select="normalize-space($unparsed)"/>

The normalize-space() function replaces tab (&#x9;), carriage return (&#xD;) newline (&#xA;) and space (&#x20;) characters with spaces, and all consecutive spaces with only one space, and it also removes leading and trailing whitespace.

10. Only deleting or replacing "&lt;"

In the examples so far we have deleted comments, CDATA sections and PIs to avoid false positives. We could also do something less drastic. We could delete or replace "&lt;" only. It takes more code. To delete or replace inside comments, CDATA sections and PI's in an XML or HTML document loaded as unparsed text, we need to use the xsl:analyze-string element.

<xsl:variable name="unparsed2">
  <xsl:analyze-string select="$unparsed" regex="&lt;!--.*?--&gt;|&lt;!\[CDATA\[.*?\]\]&gt;|&lt;\?.*?\?&gt;" flags="s">
    <xsl:matching-substring>
      <xsl:value-of select="replace(., '&lt;', '')"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

Footnotes

[1]

Considering that XSLT ought to be the state of the art getting to anything in an XML document, I consider it a grave mistake that we in XSLT have no functions to get to the XML and DTD declarations except as unparsed text. We have no way at all in XPath.

[2]

Attribute values are not normalized like the normalize-space() function would have done it. Leading and trailing whitespace are not removed and consecutive occurrences of spaces are not replaced with one space.

[3]

In Regular Expressions "?", "+", "*" are quantifiers and we use them all the time. Putting a "?" after a quantifier is less common and many REGEX users often forget that this option exists. It is needed in rare situations to avoid matches inside matches:

? = Zero or one time preferring one time.
?? = Zero or one time preferring zero.
+ = One or more times as many times as possible.
+? = One or more times as few times as possible.
* = Zero or more times as many times as possible.
*? = Zero or more times as few times as possible.

Updated 2009-08-06