Jesper Tverskov, August 6, 2007

Schematron for XML and XHTML prolog validation

With DTD, XML Schema or other grammar-based schema languages we can not validate the prolog that is what is before the top-element of an XML document. We can use Schematron to validate XML declaration and DTD and to validate comments, processing instructions and whitespace before the top-element.

In XPath 2.0 and XSLT 2.0 we have no functions that can test the prolog directly. We can test if it contains comments and processing instructions, but we need to use the unparsed-text() function to load our XML document as an unparsed string and use Regular Expressions if we want to validate XML and DTD declaration and whitespace in the prolog.

1. Additional requirements for prolog

Since XHTML is the most widespread and well-known XML application, we use XHTML as example. In order to give us something to work with, we have made up the following more or less sensible requirements for how we want the prolog of our XHTML documents:

  1. XML declaration must be used.
  2. XML version must be 1.0.
  3. UTF-8 encoding must be used explicitly.
  4. Standalone pseudo attribute must not be used.
  5. XML declaration must not contain insignificant whitespace.
  6. A DTD declaration must be used.
  7. The DTD must be for XHTML 1.0 Strict!
  8. No comments or PIs between XML and DTD declarations.
  9. Only one newline between XML and DTD declarations. [1]

Some of the tests above could be merged into one test. But we want to be able to return specific error messages to make the life easier for the user. We only validate what is not already taken care of by well-formedness check and validation against the DTD. We could have decided to duplicate some of these tests in order to report "easy to understand" error messages.

How many errors should be reported at a time can always be debated. Some errors are related and the more errors the more difficult they are to detect and the more the user could be confused. If the XML declaration is missing, there is no need to report that the XML version is not "1.0" and that the "encoding" pseudo-attribute is not used. [2]

2. Prolog validation in XSLT 2.0

In the following we first make the testing in XSLT in a way that could easily be transferred to Schematron. Next we transfer our solution to Schematron. Our sample XSLT stylesheet, validateprolog.xsl, is included below annotated with footnotes.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:xhtml="http://www.w3.org/1999/xhtml" exclude-result-prefixes="xhtml"> [3]
<!-- Loads the XHTML document as unparsed text -->
<xsl:variable name="unparsed" select="unparsed-text(document-uri(.))"/> [4]

<!-- unparsed - comment --> [5]
<xsl:variable name="unparsed2" select="replace($unparsed, '&lt;!--.*?--&gt;', '', 's')"/> [6]

<!-- unparsed - comment - cdata --> [7]
<xsl:variable name="unparsed3" select="replace($unparsed2, '&lt;!\[CDATA\[.*?\]\]&gt;', '', 's')"/>

<!-- unparsed - comment - cdata - pi --> [8]
<xsl:variable name="unparsed4" select="if (starts-with($unparsed, '&lt;?xml')) then concat(substring-before($unparsed3, '?&gt;'), '?&gt;', replace($unparsed3, '&lt;\?.*?\?&gt;', '', 's')) else replace($unparsed3, '&lt;\?.*?\?&gt;', '', 's')"/> [9]

<xsl:template match="/">
<test>  
<!-- Validate if document starts with an XML declaration. -->
<xsl:if test="not(starts-with($unparsed, '&lt;?xml'))"><error>Your XHTML document must start with an XML declaration.</error></xsl:if>

<!-- Validate if XML declaration is version 1.0. --> [10]
<xsl:variable name="declaration" select="if (starts-with($unparsed, '&lt;?xml')) then concat(substring-before($unparsed, '?>'), '?>') else ''"/>
<xsl:if test=" starts-with($unparsed, '&lt;?xml') and not(matches($declaration, '1.0'))"><error>Your must use XML version 1.0.</error></xsl:if> [11]

<!-- Validate if encoding="UTF-8" is used explicitly. -->
<xsl:if test="starts-with($unparsed, '&lt;?xml') and not(matches($declaration, 'utf-8|UTF-8'))"><error>You must use XML encoding="UTF-8" explicitly.</error></xsl:if>

<!-- Validate that the standalone pseudo attribute is not used. -->
<xsl:if test="starts-with($unparsed, '&lt;?xml') and matches($declaration, 'standalone')"><error>The standalone pseudo attribute must not be used.</error></xsl:if>

<!-- Validate that the XML declaration has no insignificant whitespace. --> [12]
<xsl:if test="matches($declaration, 'utf-8|UTF-8') and not(matches($declaration, 'standalone')) and not(string-length($declaration) eq 38)"><error>The XML declaration must not contain insignificant whitespace.</error></xsl:if>

<!-- Validate that a DTD exists. --> [13]
<xsl:if test="not(matches($unparsed4, '&lt;!DOCTYPE'))"><error>The DOCTYPE declaration is missing.</error></xsl:if>

<!-- Validate that there are no comments or processing instruction between XML declaration and DTD. --> [14]
<xsl:if test=" matches($declaration, 'utf-8|UTF-8') and not(matches($declaration, 'standalone')) and string-length($declaration) eq 38 and matches($unparsed4, '&lt;!DOCTYPE') and matches(substring-after(substring-before($unparsed, '&lt;!DOCTYPE'), '?&gt;'), '&lt;(!--|\?)')"><error>There must be no comments or PIs between XML declaration and DTD.</error></xsl:if>

<!-- Validate that there is exactly one newline between XML declaration and the DTD. --> [15]
<xsl:if test="string-length($declaration) eq 38 and not(matches(substring-after(substring-before($unparsed, '&lt;!DOCTYPE'), '?&gt;'), '&lt;(!--|\?)')) and not(string-length(substring-before($unparsed4, '&lt;!DOCTYPE')) - string-length(replace(substring-before($unparsed4, '&lt;!DOCTYPE'), '&#xA;', '')) = 1)"><error>There must be exactly one newline between XML declaration and the DTD.</error></xsl:if>

<!-- Validate that DTD for XHTML 1.0 Strict! is used. --> [16]
<xsl:if test="matches($unparsed4, '&lt;!DOCTYPE') and not(matches($unparsed4, 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd')) and not(matches($unparsed4, '-//W3C//DTD XHTML 1.0 Strict//EN'))"><error>The DTD must be for XHTML 1.0 Strict!.</error></xsl:if>

</test>
</xsl:template>
</xsl:stylesheet>

3. Prolog validation in Schematron

Now when we understand our tests in XSLT, they should be easy to understand also in Schematron, validateprolog.sch.xml (I have added ".xml" so more browser will show it. Remember to look in the source code to get to the unformatted XML). Note that it is necessary to negate the tests in the sch:assert element. The xsl:if element in XSLT is fired if the test is true. The sch:assert element in Schematron is fired when the test is not true.

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
<sch:ns uri="http://www.w3.org/1999/xhtml" prefix="xhtml"/>

<!-- Loads the XHTML document as unparsed text -->
<sch:let name="unparsed" value="unparsed-text(document-uri(.))"/>

<!-- unparsed - comment -->
<sch:let name="unparsed2" value="replace($unparsed, '&lt;!--.*?--&gt;', '', 's')"/>

<!-- unparsed - comment - cdata -->
<sch:let name="unparsed3" value="replace($unparsed2, '&lt;!\[CDATA\[.*?\]\]&gt;', '', 's')"/>

<!-- unparsed - comment - cdata - pi -->
<sch:let name="unparsed4" value="if (starts-with($unparsed, '&lt;?xml')) then concat(substring-before($unparsed3, '?&gt;'), '?&gt;', replace($unparsed3, '&lt;\?.*?\?&gt;', '', 's')) else replace($unparsed3, '&lt;\?.*?\?&gt;', '', 's')"/>

<!-- Variable for the XML declaration --><sch:let name="declaration" value="if (starts-with($unparsed, '&lt;?xml')) then concat(substring-before($unparsed, '?>'), '?>') else ''"/>

<sch:pattern>
<sch:title>Validating XML declaration</sch:title>
<sch:rule context="xhtml:html">

<!-- Validate if document starts with an XML declaration. -->
<sch:assert test="starts-with($unparsed, '&lt;?xml')">Your XHTML document must start with an XML declaration.</sch:assert>

<!-- Validate if XML declaration is version 1.0. -->
<sch:assert test="not(starts-with($unparsed, '&lt;?xml') and not(matches($declaration, '1.0')))">You must use XML version 1.0.</sch:assert>

<!-- Validate if encoding="UTF-8" is used explicitly. -->
<sch:assert test="not(starts-with($unparsed, '&lt;?xml') and not(matches($declaration, 'utf-8|UTF-8')))">You must use XML encoding="UTF-8" explicitly.</sch:assert>

<!-- Validate that the standalone pseudo attribute is not used. -->
<sch:assert test="not(starts-with($unparsed, '&lt;?xml') and matches($declaration, 'standalone'))">The standalone pseudo attribute must not be used.</sch:assert>

<!-- Validate that the XML declaration has no insignificant whitespace. -->
<sch:assert test="not(matches($declaration, 'utf-8|UTF-8') and not(matches($declaration, 'standalone')) and not(string-length($declaration) eq 38))">The XML declaration must not contain insignificant whitespace.</sch:assert>
</sch:rule>
</sch:pattern>

<sch:pattern>
<sch:title>Validating DTD</sch:title>
<sch:rule context="xhtml:html">

<!-- Validate that DTD is used. -->
<sch:assert test="matches($unparsed4, '&lt;!DOCTYPE')">The DOCTYPE declaration is missing.</sch:assert>

<!-- Validate that there are no comments or processing instruction between XML declaration and DTD. -->
<sch:assert test="not(matches($declaration, 'utf-8|UTF-8') and not(matches($declaration, 'standalone')) and string-length($declaration) eq 38 and matches($unparsed4, &lt;!DOCTYPE') and matches(substring-after(substring-before($unparsed, '&lt;!DOCTYPE'), '?&gt;'), '&lt;(!--|\?)'))">There must be no comments or PIs between XML declaration and DTD.</sch:assert>

<!-- Validate that there is exactly one newline between XML declaration and the DTD. -->
<sch:assert test="not(string-length($declaration) eq 38 and not(matches(substring-after(substring-before($unparsed, '&lt;!DOCTYPE'), '?&gt;'), '&lt;(!--|\?)')) and not(string-length(substring-before($unparsed4, '&lt;!DOCTYPE')) - string-length(replace(substring-before($unparsed4, '&lt;!DOCTYPE'), '&#xA;', '')) = 1))">There must be exactly one newline between XML declaration and the DTD.</sch:assert>

<!-- Validate that DTD for XHTML 1.0 Strict! is used. -->
<sch:assert test="not(matches($unparsed4, '&lt;!DOCTYPE') and not(matches($unparsed4, 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd')) and not(matches($unparsed4, '-//W3C//DTD XHTML 1.0 Strict//EN')))">The DTD must be for XHTML 1.0 Strict!</sch:assert>
</sch:rule>
</sch:pattern>
</sch:schema>

Footnotes

[1]

I want to repeat that some of our requirements for how we want the prolog to look are only a question of a more or less tidy source code.

[2]

I have not decided yet for what constitute best practice. In this tutorial I only report one error at a time except if both XML and DTD declarations are missing. In some validation environments we would probably prefer to get as many independent errors reported at a time as possible.

[3]

Note that in our XSLT stylesheet we declare the XHTML namespace and that we use "XHTML" for prefix. We could use anything not already used for prefix. But we must declare the XHTML namespace with a prefix in order to get to markup of the XHTML document in an easy way. See my article Transform XHTML to XHTML with XSLT.

[4]

In the variable named "unparsed" we load our XHTML or XML document as unparsed-text using the unparsed-text() function. In our stylesheet the source document to be transformed and the loaded document is one and the same.

[5]

In the variable named "unparsed2", we take the variable named "unparsed" and delete all comments from it. We do it to make it easier to prevent false positives. When we later test if our unparsed text contains a DTD in the prolog, we are not fooled by out commented DTD declarations.

[6]

The replace() function takes a Regular Expression as its second argument. The forth argument is "s" meaning that the "." should be used in "dot all" mode also including newline characters. The "?" after the "*" quantifier means that as soon as a match is satisfied it is a match. If we forget to use the "?" we end up with matches inside matches. See my article Deleting comments, CDATA sections and PIs with REGEX from unparsed text in XSLT.

[7]

We delete all CDATA sections from our unparsed text to make testing for markup easier. Note that it is necessary to escape the square parenthesis in the Regular Expression.

[8]

We delete all processing instructions. The XML declaration is not a processing instruction but it looks like one, so it will also be deleted. The expression first test if an XML declaration exists. If it does, the concat() function is used. In the first argument the XML declaration is put back in, in the next argument the unparsed text with no comments, and no CDATA sections, and no PIs are added. If no XML declaration is found the PIs are deleted.

[9]

In the Regular Expression we must escape the question mark when it is not a quantifier.

[10]

The variable named "declaration" is only made to make testing marginally easier. If the unparsed text start with an XML declaration it is safe to make a variable consisting of what is before the first "?>" encountered. Concat is used to put "?>" back in. The first argument of concat only contains what is before "?>".

[11]

We only bother about the "version", "encoding" and "standalone" pseudo attributes if the XML declaration exists.

[12]

An XML declaration with only a version and an encoding="utf-8" pseudo attributes consists of 38 characters if there is no insignificant whitespace.

[13]

Notice that we use the "unparsed4" version of our unparsed text. All comments, CDATA sections and PIs have been deleted. Nice that we don't have to worry about false positives when we test for the existence of "&lt;!DOCTYPE".

[14]

We only test if the XML declaration is exactly as we want it and that the DTD exists to prevent the error message from appearing together with other error messages. The '&lt;(!--|\?)' matches comments or PIs. The "&lt;!--" (the beginning of a comment) or "&lt;\?" (the beginning of a PI). When the question mark is not a quantifier we must escape it.

[15]

We start with a lot of tests to prevent the error message from appearing together with other error messages. The best way to count newline characters is to delete them from the string and use the string-length() function to count how much shorter the string is.

[16]

We use the "unparsed4" version of our unparsed text so we don't have to worry about false positives.

Updated 2009-08-06