Jesper Tverskov, February 14, 2006

Using unparsed-text() in XSLT 2.0 to test prolog

In XSLT 2.0 we can use the unparsed-text() function to test the XML declaration and the DOCTYPE declaration. We can read the pseudo-attributes of the XML declaration and the values of PUBLIC or SYSTEM in the DTD in order to recreate or modify them as we please.

In XSLT 1.0 it is impossible to detect if the input XML document uses an XML or a DTD declaration. We can not read the values of the XML declaration's pseudo-attributes, "version", "encoding" and "standalone". Likewise we can not read the name and path and filename of the DTD. This is surprising since it is easy to make up use cases.

1. Use case for testing declarations

Let us say we have 1000 XHTML documents. Some of them are XHTML 1.0 Transitional, some are XHTML 1.0 Strict! and some XHTML 1.1. Not all the documents have an XML declaration, and they have different encodings not all being UNICODE.

We want to make an XSLT stylesheet that can transform all the XML input documents in order to add a Table Of Contents to the top of the body section of each document. We only want to add the TOC. Except for the TOC, all the documents most not be changed in any way.

Wouldn't it be nice if we could use one and the same XSLT stylesheet to transform all the documents with some sort of "identity"-template (recursive copying) leaving everything as it was except that we want to add, delete or modify something? In this case we want to add a TOC to all the documents?

2. The unparsed-text() function in XSLT 2.0

The unparsed-text() function is new in XSLT 2.0. Unparsed-text() is similar to "document()" in XSLT 1.0 and still with us in XSLT 2.0, and to doc() in XPath 2.0. But unparsed-text() is not for loading and including XML documents in the transformation process but for loading and including text-documents as a string. Since an XML document is also a text document, we can use unparsed-text() to load an XML document as an unparsed string.

We can even use unparsed-text() to load the XML input document itself. In this way the input document is not just transformed, it is also available in the transformation process as a string value we can test with string functions.

Having tested the XML declaration and the DOCTYPE declaration and their values, we can use the "result-document" element to recreate or modify the XML declaration and the DOCTYPE declaration of the output XML document dynamically.

3. Testing with string functions

It can be very tricky to test for something in a string. In our case we should remember that XML documents can have out commented XML and DTD declarations at the very top of the document, and such declarations can also show up in CDATA sections. Even processing instructions can have string content very similar to an XML and DTD declaration, and the text of the document could talk about DTDs and XML declarations an show what the look like escaping "<" with '&lt;'

We can either test with one masterpiece (which no one else understands) of a Regular Expression or, as I have done, with many small and easy to understand steps. Often less meticulous testing is enough but even my testing is probably not good enough in all situations. Feel free to suggest better testing.

In XSLT 1.0 and 2.0 there is no way to create or recreate the content of an internal DTD subset. Some XSLT processors like SAXON, have extensions that can do it but this is outside the scope of a generic XSLT 2.0 solution. In the XSLT stylesheet I have made, as we are going to see in a moment, the transformation is terminated with a warning, if an internal DTD subset is detected.

4. "Prolog identity" template

I have made an XSLT stylesheet, prolog.xslt, that can be used as a "prolog-identity-template" for XML and DOCTYPE declarations. "Prolog.xslt" can be included in other XSLT stylesheets using the <xsl:include> element. I have tried to make a generic solution but just regard it as a test showing how it could be done. [1]

Prolog.xslt can be very useful when transforming any XML document and especially an XML document using a DTD. "Prolog.xslt" is often relevant when we want to transform something to "itself", like transforming DOCBOOK to DOCBOOK or XHTML to XHTML. Basically we need to include "prolog.xslt" in an XSLT stylesheet using the famous "identity template". When transforming XHTML documents, it is important to remember, that they must be XML that is they must be well-formed. [2]

Often we don't want to recreate everything in the XML and DOCTTPE declarations as they were, e.g.: we want to use UTF-8 as encoding in the output documents no matter what encoding was used in the input document. Maybe we want to get rid of irrelevant standalone declarations in the input documents, et cetera. In order to make it easy to modify "prolog.xslt", I have added a "configuration" section to the stylesheet.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Here we load the input document as an uparsed string -->
<xsl:variable name="unparsed" select="unparsed-text(document-uri(.))"/>
<!-- The next four varaibles are preparations to make it easier to modify the name of the output file -->
<xsl:variable name="input-file" select="tokenize(document-uri(.), '(\\|/)')[last()]"> [3]
<xsl:variable name="input-filename" select="substring-before($input-file, '.')"/>
<xsl:variable name="input-path" select="substring-before(document-uri(.), $input-filename)"/>
<xsl:variable name="input-extension" select="substring-after($input-file, '.')"/>
<!-- The next two variables are for single and double quote to make the next variable easier to read -->
<xsl:variable name="apos" select=""'""/>
<xsl:variable name="quot" select="'"'"/>
<!-- Here we normalize-space and single qoutes are replaced by double qoutes -->
<xsl:variable name="unparsed-normalized" select="normalize-space(translate($unparsed, $apos, $quot))"/>
<!-- Here we remove all comments so they don't fool our testing -->
<xsl:variable name="unparsed-no-comment" select="replace($unparsed-normalized, '&lt;!--.+-->', '')"/>
<!-- Here we remove all PI's so they don't fool our testing -->
<xsl:variable name="unparsed-no-pi">
  <xsl:choose>
    <xsl:when test="contains(substring($unparsed-normalized, 1, 5), '&lt;?xml')">
      <xsl:value-of select="replace(substring-after($unparsed-no-comment, '?>'), '&lt;\?.+\?>', '')"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="replace($unparsed-no-comment, '&lt;\?.+\?>', '')"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- Here we remove all CDATA sections so they don't fool our testing -->
<xsl:variable name="unparsed-no-cdatasection" select="replace($unparsed-no-pi, '&lt;!\[CDATA\[.+\]\]>', '')"/>
<!-- Here we remove the rest, that is everything but what is left of the prolog -->
<xsl:variable name="unparsed-no-topelement" select="replace($unparsed-no-cdatasection, '&lt;\i.*', '')"/>
<!-- dtd-declaration-string -->
<!-- The string contains nothing but the DTD declaration and only if it exists. If the DTD declaration don't exist the string is empty -->
<xsl:param name="dtd-declaration-string">
  <xsl:choose>
    <xsl:when test="normalize-space($unparsed-no-topelement) ne ''">
      <xsl:value-of select="normalize-space($unparsed-no-topelement)"/>
    </xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:param>
<!-- **************************************** -->
<!-- Template match "/" -->
<!-- **************************************** -->
<xsl:template match="/">
<!-- If an internal DTD subset is detected the transformation stops here, if "xsl:message" is supported. -->
<xsl:variable name="afterDoctype" select="substring-after($dtd-declaration-string, 'DOCTYPE')"/>
<xsl:if test="contains($afterDoctype, '&lt;!')">
  <xsl:message terminate="yes">Internal DTD subset detected. Subsets are not recreated since it is not possible in XSLT 2.0 without extensions.</xsl:message>
</xsl:if>
<!-- xml declaration -->
<xsl:variable name="input-xml-declaration">
  <xsl:choose>
    <xsl:when test="contains(substring($unparsed-normalized, 1, 5), '&lt;?xml')">true</xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml declaration string -->
<xsl:variable name="xml-declaration-string">
  <xsl:choose>
    <xsl:when test="$input-xml-declaration eq 'true'">
      <xsl:value-of select="concat(substring-before($unparsed-normalized,'?>'), '?>')"/>
    </xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml version pseudo-attribute -->
<xsl:variable name="input-xml-version">
  <xsl:choose>
    <xsl:when test="$input-xml-declaration eq 'true'">
      <xsl:variable name="version-plus" select="substring-after($xml-declaration-string, 'version="')"/>
      <xsl:value-of select="substring-before($version-plus, '"')"/>
    </xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml endoding pseudo-atttribute -->
<xsl:variable name="input-xml-encoding">
  <xsl:choose>
    <xsl:when test="$input-xml-declaration eq 'true'">
      <xsl:choose>
        <xsl:when test="contains($xml-declaration-string, 'encoding=')">
          <xsl:variable name="encoding-plus" select="substring-after($xml-declaration-string, 'encoding="')"/>
          <xsl:value-of select="substring-before($encoding-plus, '"')"/>
        </xsl:when>
        <xsl:otherwise>UTF-8</xsl:otherwise>
      </xsl:choose>
    </xsl:when>
    <xsl:otherwise>UTF-8</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml standalone pseudo-atttribute -->
<xsl:variable name="input-xml-standalone">
  <xsl:choose>
    <xsl:when test="$input-xml-declaration eq 'true'">
      <xsl:choose>
        <xsl:when test="contains($xml-declaration-string, 'standalone=')">
          <xsl:variable name="standalone-plus" select="substring-after($xml-declaration-string, 'standalone="')"/>
          <xsl:value-of select="substring-before($standalone-plus, '"')"/>
        </xsl:when>
        <xsl:otherwise>omit</xsl:otherwise>
      </xsl:choose>
    </xsl:when>
    <xsl:otherwise>omit</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- DOCTYPE public -->
<xsl:variable name="input-dtd-public">
  <xsl:choose>
    <xsl:when test="matches($dtd-declaration-string, '&lt;!DOCTYPE.+PUBLIC.+>')">
      <xsl:variable name="dtd-system-plus" select="substring-after($dtd-declaration-string, 'PUBLIC "')"/>
      <xsl:variable name="dtd-system-plus2" select="substring-after($dtd-system-plus, '"')"/>
      <xsl:variable name="dtd-system-plus3" select="substring-after($dtd-system-plus2, '"')"/>
      <xsl:value-of select="substring-before($dtd-system-plus3, '"')"/>
    </xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- DOCTYPE system -->
<xsl:variable name="input-dtd-system">
  <xsl:choose>
    <xsl:when test="contains($dtd-declaration-string, 'SYSTEM')">
      <xsl:variable name="dtd-system-plus" select="substring-after($dtd-declaration-string, 'SYSTEM "')"/>
      <xsl:value-of select="substring-before($dtd-system-plus, '"')"/>
    </xsl:when>
    <xsl:when test="contains($dtd-declaration-string, 'PUBLIC')">
      <xsl:variable name="dtd-system-plus" select="substring-after($dtd-declaration-string, 'PUBLIC "')"/>
      <xsl:variable name="dtd-system-plus2" select="substring-after($dtd-system-plus, '"')"/>
      <xsl:variable name="dtd-system-plus3" select="substring-after($dtd-system-plus2, '"')"/>
      <xsl:value-of select="substring-before($dtd-system-plus3, '"')"/>
    </xsl:when>
    <xsl:otherwise>false</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- Omit XML declaration -->
<xsl:variable name="input-omit-xml-declaration">
  <xsl:choose>
    <xsl:when test="$input-xml-declaration eq 'true'">no</xsl:when>
    <xsl:otherwise>yes</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- ******************************* -->
<!-- Configuration -->
<!-- ******************************* -->
<!-- Output path and filename -->
<xsl:variable name="output-filename" select="concat($input-path, '/output/', $input-filename, '.', $input-extension)"/>
<!-- omit-xml-declaration -->
<xsl:variable name="output-omit-xml-declaration">
  <xsl:choose>
    <xsl:when test="$input-omit-xml-declaration eq 'no'">no</xsl:when>
    <xsl:when test="$input-omit-xml-declaration eq 'yes'">yes</xsl:when>
  </xsl:choose>
</xsl:variable>
<!-- xml-version -->
<xsl:variable name="output-xml-version">
  <xsl:choose>
    <xsl:when test="$input-xml-version eq '1.0'">1.0</xsl:when>
    <xsl:when test="$input-xml-version eq '1.1'">1.1</xsl:when>
    <xsl:otherwise>1.0</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml-encoding -->
<xsl:variable name="output-xml-encoding">
  <xsl:choose>
    <xsl:when test="upper-case($input-xml-encoding) eq 'US-ASCII'">US-ASCII</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'UTF-8'">UTF-8</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'UTF-16'">UTF-16</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-10646-UCS-2'">ISO-10646-UCS-2</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-10646-UCS-4'">SO-10646-UCS-4</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-1'">ISO-8859-1</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-2'">ISO-8859-2</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-3'">ISO-8859-3</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-4'">ISO-8859-4</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-5'">ISO-8859-5</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-6'">ISO-8859-6</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-7'">ISO-8859-7</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-8'">ISO-8859-8</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-9'">ISO-8859-9</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-10'">ISO-8859-10</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-11'">ISO-8859-11</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-13'">ISO-8859-13</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-14'">ISO-8859-14</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-8859-15'">ISO-8859-15</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-2022-JP'">ISO-2022-JP</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'SHIFT_JIS'">SHIFT_JIS</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'EUC-JP'">EUC-JP</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'BIG5'">BIG5</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'GB2312'">GB2312</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'KOI6-R'">KOI6-R</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-2022-KR'">ISO-2022-KR</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'EUC-KR'">EUC-KR</xsl:when>
    <xsl:when test="upper-case($input-xml-encoding) eq 'ISO-2022-CN'">ISO-2022-CN</xsl:when>
    <!-- here you can add more encodings -->
    <xsl:otherwise>UTF-8</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- xml-standalone -->
<xsl:variable name="output-xml-standalone">
  <xsl:choose>
    <xsl:when test="$input-xml-standalone eq 'yes'">yes</xsl:when>
    <xsl:when test="$input-xml-standalone eq 'no'">no</xsl:when>
    <xsl:when test="$input-xml-standalone eq 'omit'">omit</xsl:when>
    <xsl:otherwise>omit</xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- dtd-public -->
<xsl:variable name="output-dtd-public">
  <xsl:choose>
    <xsl:when test="$input-dtd-public eq 'false'">false</xsl:when>
    <xsl:when test="$input-dtd-public ne 'false'">
      <xsl:value-of select="$input-dtd-public"/>
    </xsl:when>
  </xsl:choose>
</xsl:variable>
<!-- dtd-system -->
<xsl:variable name="output-dtd-system">
  <xsl:choose>
    <xsl:when test="$input-dtd-system eq 'false'">false</xsl:when>
    <xsl:when test="$input-dtd-system ne 'false'">
      <xsl:value-of select="$input-dtd-system"/>
    </xsl:when>
  </xsl:choose>
</xsl:variable>
<!-- ******************************* -->
<!-- Result document -->
<!-- ******************************* --> [4]
<xsl:choose>
  <xsl:when test="$output-dtd-public ne 'false' and $output-dtd-system ne 'false'">
    <xsl:result-document href="{$output-filename}" method="xml" omit-xml-declaration="{$output-omit-xml-declaration}" output-version="{$output-xml-version}" encoding="{$output-xml-encoding}" doctype-public="{$output-dtd-public}" doctype-system="{$output-dtd-system}" standalone="{$output-xml-standalone}" indent="yes"> [5]
      <xsl:apply-templates/>
    </xsl:result-document>
  </xsl:when>
  <xsl:when test="$output-dtd-system ne 'false'">
    <xsl:result-document href="{$output-filename}" method="xml" omit-xml-declaration="{$output-omit-xml-declaration}" output-version="{$output-xml-version}" encoding="{$output-xml-encoding}" doctype-system="{$output-dtd-system}" standalone="{$output-xml-standalone}" indent="yes">
      <xsl:apply-templates/>
    </xsl:result-document>
  </xsl:when>
  <xsl:otherwise>
    <xsl:result-document href="{$output-filename}" method="xml" omit-xml-declaration="{$output-omit-xml-declaration}" output-version="{$output-xml-version}" encoding="{$output-xml-encoding}" standalone="{$output-xml-standalone}" indent="yes">
      <xsl:apply-templates/>
    </xsl:result-document>
  </xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

5. "Identity" template and "prolog.xslt"

Since it is no fun to transform input to output without adding, deleting or modifying anything, let us add something new. At the beginning of this article I mentioned that creating a table of contents could be a nice thing to do, but that would be overkill making this article too long. Let us instead just add a comment about the use of "prolog.xslt" and a datetime stamp at the beginning of the head section of the input XHTML documents just to see how we get the templates working.

In our example identity.xslt has a template matching the "head" element. This is only relevant if we use XHTML documents as input. The "prolog.xslt" will work with any input XML document in need of recreating or modifying XML declaration and DTD based on their use in the input document. Please remember that an internal DTD subset will not be recreated since this is only possible with extensions provided by very few products like SAXON.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml" [6]
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xhtml">
<xsl:include href="prolog.xslt"/>
<xsl:template match="@*|node()"> [7]
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>
<xsl:template match="xhtml:head">
  <xsl:variable name="dateTimeStamp" select="format-dateTime(xs:dateTime(current-dateTime()), '[Y]-[M01]-[D01]T[H01]:[m01]:[s01]')"/>
  <xsl:comment>
    <xsl:value-of select="concat('The transformation, ', $dateTimeStamp, ', is using prolog.xslt')"/>
  </xsl:comment>
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

Footnotes

[1]

At the time of the publication of this article, 2006-02-14, it was difficult to find an XSLT processor with enough XSLT 2.0 support to make use of my "prolog.xslt". Saxon 8.0 was probably the only one. Today, 2006-09-28, it still works fine in Saxon 8B but it does not work in XMLSpy version 2006, release 3, sp 2.

[2]

At the web we have millions of documents using the XHTML DOCTYPE. They are only XML if they are well-formed and only XHTML if they are also valid.

[3]

In the "tokenize(document-uri(.), '(\\|/)')[last()]" expression we split the document-uri using "\" or "/" as separator and pick the last piece containing filename and extension of the input file. Some XSLT processors like XMLSpy wrongly use "\" as separator, so to provide for this "bug" the "\" is included in the other half of the choice "(\\|/)". The first "\" is to escape the next.

[4]

We have three options because doctype-system and doctype-public don't have a "turn me off"-value. A DTD will either be of type PUBLIC or of type SYSTEM. "Otherwise" is used if no DTD exists. If the DTD is of type PUBLIC both doctype-public (DTD name) and doctype-system (path and filename) will be used. If the DTD is of type SYSTEM only doctype-system will be used.

[5]

In XSLT 1.0 "standalone" had the same problem as doctype-public and doctype-system still has: it could not be turned off, only yes/no existed. In XSLT 2.0 "omit" has been added as value.

[6]

In order to transform an input document in a default namespace, we must declare the default namespace of the input XML document twice in the XSLT stylesheet: both with and without a prefix. We use the prefix to get to the nodes of the input XML document and we need the declaration without a prefix to get rid of empty namespace declarations, xmlns="", in the output document. For more information read my article Transform XHTML to XHTML with XSLT.

[7]

This is the famous "identity" template. For more information read my article: Identity Template: xsl:copy with recursion.

Updated 2009-08-06