Jesper Tverskov, March 14, 2007, 2. edition

Valid XHTML with schema-aware XSLT 2.0

With a schema-aware XSLT 2.0 processor we can test if XML output is valid as we create it. Nice that no XML with a schema, e.g. xhtml, can be generated in our system if it is not valid. Sorry, but schema-awareness is not mature yet.

Before we go into details of how to implement schema-aware validation of XHTML as part of an XSLT transformation, let us remind us of the fact that an XHTML document can use not only a DTD but also XML schemas to declare its structure and data types. DTD and schemas can even be used simultaneously.

1. XHTML using DTD and schema

XHTML 1.0, a Recommendation (standard) 2000-01-26, was created at a time when the work to create the XML Schema language was still in the making. Since XHTML 1.0 is just a reformulation of HTML (being SGML) as XML, and since DTD is defined both in SGML and XML, it was natural that XHTML was born with structure and data types declared in a DTD.

When XML Schema became a Recommendation (standard), 2001-05-02, and since structure and data types can be declared more precisely in a schema, it was only natural that W3C has now also provided schemas for XHTML 1.0 (strict, transitional, frames) and XHTML 1.1. [1]

W3C has published a note, XHTML™ 1.0 in XML Schema, showing how a schema can be used instead of a DTD or how both a DTD and a schema can be used at the same time. [2] The note links to schemas for XHTML 1.0, and I also give you the schema for XHTML 1.0 strict (xsd) and a link to a schema for XHTML 1.1 (xsd).

2. Schema-aware XSLT 2.0 processors

At the time of this writing we have two major XSLT 2.0 processors, Saxon made by Michael Kay and AltovaXML made by the makers of XMLSpy. Microsoft has announced that it will make its own XSLT 2.0 processor for .NET. [3]

Saxon comes in a basic royalty-free version, 8B (B for basic), at the moment, and in a schema-aware version, 8SA (SA = schema-aware), at around 250 GBP or 486 USD or 369 EUR for a license. Like AltovaXML Saxon is an all-in-one processor: XML, XML Schema, XSLT and XQuery.

The AltovaXML processor only comes in a royalty-free schema-aware version (nice), and is build into XMLSpy but can also, like Saxon, be used in other XML editors, and at the command line, and can be integrated into Java and .NET.

3. A comparison of AltovaXML and Saxon

It is not the time and place to look at all the features of schema-awareness, like type-annotation, matching by types, etc. We will put focus on result-document validation of XHTML. We want to validate the XHTML document we generate inside XSLT. If it is not valid we want to be told right away or for the transformation to stop.

For both AltovaXML and Saxon we need to import the XHTML schema into XSLT by using the import-schema element. In Saxon we have several parameters to use at the command line or already set depending of what other software we use. Only Saxon has input document validation. It is triggered by the -val parameter. If it is used the input document must also be XHTML, or the schema of the input document must also be imported or the -vlax parameter must be used.

3.1 Runtime validation

In AltovaXML we can only do runtime validation. In Saxon we can also do compile time validation. For runtime validation of output you need to start the transformation just to get it stopped if validation errors are reported.

Runtime validation works well in AltovaXML. All we need to do it to use the validation="strict" attribute in the result-document element. We get a good error message for the first error found and the proper place is highlighted in the stylesheet.

In Saxon runtime validation works the same except that we have two modes for reporting errors. The first mode works as in AltovaXML. The transformation stops when the first error is found. The error message is god but nothing is highlighted in the stylesheet when Saxon is used in the Oxygen XML editor.

In Saxon we must use a parameter at the command line to treat errors more like warnings. Now the error message is useless in Oxygen, "one or more errors found", and nothing is highlighted in the stylesheet. This bug in Oxygen 8.1 has now been fixed for the next release. I will rewrite this section when I have checked that it actually works in Oxygen 8.2.

3.2 Compile-time validation

Saxon has also compile-time validation, that is, the errors are reported right away, and you don't need to start the transformation process. To trigger it you must use the validation attribute in all the top-elements generated by templates or the xsl:validation attribute if the top-elements are generated the literal way.

<xsl:element name="html" validation="strict">etc.
or
<html xsl:validation="strict">etc.

For compile-time validation Saxon's two modes of error messages are good in Oxygen and the proper place of violation is highlighted in the stylesheet. Compile-time validation is, as I see it, more useful than run-time validation. Nice that we are told about the errors right away and that the validation does not depend on the input. [4]

4. Bugs and problems

4.1 HTTP bug in AltovaXML

Schema-awareness in AltovaXML 2007-sp2, has a major bug. A schema can not be imported using an http address. I had to download the XHTML schemas and place them locally at my own computer to test schema-awareness. [5]

4.2 The schema/DTD problem

Both AltovaXML and Saxon have an interesting problem, if we forget to use the exclude-result-prefixes attribute in the stylesheet element. If namespace declarations other than for XHTML are copied to the result-document it becomes not valid XHTML 1.0 if we have specified a DTD for the result-document. This is not nice when both processors have just reported "no validation errors".

It is only logical that a document accepted by schema validation not necessarily make it if validated against a DTD. Even though the schema and the DTD are basically identical, there are minor differences, making it possible for the schema to accept things that will later be rejected by the DTD.

Actually the problem is not only the use of DTD for the result-document validated against a schema. The result-document could also point to another schema than the one imported. Many users would expect more clever software that could validate against all the relevant schemas and DTDs. DTD is not in scope for schema-awareness in XSLT but nothing disallows Saxon and AltovaXML, being "all-in-one" processors, from taking DTD into account if they decided to do so. Just like Saxon supports also DTD for input document validation. [6]

4.3 Junk in the result-document

Saxon copies all sorts of default attributes and fixed attributes out of the schema and into the result-document. If you use anchor elements and tables in the XHTML output, you have shape="rect" and colspan="1" and rowspan="1" all over the place for nothing. The result-document is still using the same schema or DTD and don't need those values to be explicitly present in the XHTML document. [7]

It has been interesting to follow the discussion about this issue at the XSL-mailing list as a result of the first edition of this article. We have the usual first reaction: "the spec says so". Such reply is most often just a bad excuse for not doing things right.

Then follows all sorts of proposals mostly hacks and bad practices as a substitute for doing things right in Saxon: let us modify the schema, that is remove alt those defaulted and fixed attributes. Someone even suggested using two identical result-documents, one for validation and one for real. If the one for validation succeeded, then the one for real would have been generated nice and clean.

Fortunately someone raised the question: "should the serialised version of the validated result be the per or post validation instance?" Finally the maker of Saxon, Michael Kay, usually very responsive to user feedback and user needs, saw the light and declared that: one can probably do something pragmatic.

5. An XHTML result-document example

I have made a small "product catalog" input document, input.xml, with no schema. For that reason we must use "lax" validation of the source document in Saxon or turn the validation of source documents off in Oxygen. The XSLT stylesheet transform the source document into an XHTML table.

I have annotated the XSLT stylesheet, xml2xhtml.xsl, with footnotes. Please have a look at the XHTML result-document as Saxon will generate it, myxhtml-saxon.html. If you look at it in a browser remember to look at the source code. Note the space="preserve" in the style element making the output invalid. [8] Also note the shape="rect" junk in the anchor element, and note all the colspan="1" and rowspan="1" junk in the table. In contrast have a look at the nice output AltovaXML is producing, myxhtml-altova.html.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"> [9]
<xsl:import-schema namespace="http://www.w3.org/1999/xhtml" schema-location="http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"/> [10]
<xsl:strip-space elements="*"/> [11]
<xsl:template match="/">
<xsl:result-document method="xml" href="myxhtml.html"
  doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
  doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" indent="yes"> [12]
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xsl:validation="strict">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <style type="text/css"> [13]
    body{font-family:verdana, arial, sans-serif}
    tr.heading{background-color:silver; color:black}
    tr.alt{background-color:lightgreen; color:black} td.count{text-align:center}
    td.number{text-align:right}
    table, th, td{border: 1px solid silver; border-collapse:collapse}
  </style>
  <title>Catalog of Products</title>
</head>
<body>
  <h1>Catalog of Products</h1>
  <p>
    <a href="http://www.xmlplease.com/xsltcases">www.xmlplease.com/xsltcases</a>
  </p>
  <table cellpadding="5" cellspacing="0">
  <tr class="heading">
    <th>No</th>
    <th>Name</th>
    <th>Price</th>
    <th>Stock</th>
    <th>Country</th>
  </tr>
  <xsl:apply-templates/>
  </table>
</body>
</html>
</xsl:result-document>
</xsl:template>

<xsl:template match="product">
  <tr xsl:validation="strict"> [14]
  <xsl:if test="position() mod 2 = 0"> [15]
  <xsl:attribute name="class">alt</xsl:attribute>
  </xsl:if>
    <td class="count"><xsl:value-of select="position()"/></td>
    <td><xsl:value-of select="name"/></td>
    <td class="number"><xsl:value-of select="price"/></td>
    <td class="number"><xsl:value-of select="stock"/></td>
    <td><xsl:value-of select="country"/></td>
  </tr>
</xsl:template>
</xsl:stylesheet>

6. The winner is

Let us give AltovaXML and Saxon a chance to fix their respective bugs, just like I have hopefully fixed my "bugs" from first to second edition of this article.

If Saxon is going to give us nice and clean output like AltovaXML, not copying default attributes and fixed attribute values found in the schema back to our document, Saxon comes out marginally ahead for the benefits of compile time validation over runtime and for the two error reporting modes.

If Saxon continues to pollute my result-document with all sorts of junk, I don't find it worthwhile using Saxon for result-document validation of my XHTML.

7. Give schema-awareness a chance

I believe validation of result-documents to be important not the least for XHTML output. It would be very nice that no XHTML could be transformed or generated if not valid. The earlier and the closer at the source of origin we find and correct problems the better. Nice to know that your XHTML is not just well-formed but that it is also always valid.

I am sure that the problems with schema-awareness listed in this article and similar feedback from other end-users, will soon result in bug fixes and in better implementation of schema-awareness. I will update this article accordingly.

Footnotes

[1]

In some future edition of XHTML it is probably even more important that schemas make it possible for markup from different XML application to co-exist in the same document.

[2]

On the web XHTML 1.0 and XHTML 1.1 should use a DTD. Many browsers use the DTD declaration as a switch to a more "standards compliant" mode. Behind the scene, server-side, we can use XHTML with or without DTD and or XML schemas as we please.

[3]

2007-01-29, a couple of days after XSLT 2.0 finally became a Recommendation (standard), Microsoft XML Team's WebLog announced: "Our users have made it very clear that they want an XSLT 2.0 implementation once the Recommendation is complete. A team of XSLT experts is now in place to do this…".

[4]

We have many ways of doing validation in the stylesheet. Not only the xsl:result-element and the xsl:element can use the validation attribute but also xsl:attribute, xsl:copy, xsl:copy-of and xsl:document, etc.

[5]

I have reported the bug to Altova, and they have accepted it as such.

[6]

I am not saying that the "schema/DTD" problem can necessarily be solved in a way that is worthwhile. It could create new problems like a more complex interface making extra parameters necessary. The important thing for now is to be aware of the problem.

[7]

We have exactly the same problem when transforming XHTML using the identity template, see my article: The shape="rect" attribute in xhtml to xhtml. But for the input document we can always make additional templates deleting all the unnecessary junk that should never have been copied out of the schema or DTD. For schema-aware validation of result-documents we have no easy method to get rid of the junk. Don't put it there, please.

[8]

This is a bug not related to the junk problem. The XHTML style element, the script element and the pre element have the attribute xml:space="preserve" declared in the schema. This attribute is copied out without the "xml" prefix, like space="preserve", making the result-document invalid. Michael Kay has acknowledged this as a bug.

[9]

If we forget to use the exclude-result-prefixes attribute, namespaces not allowed in XHTML could find their way to the result-document.

[10]

The import-schema element is new in XSLT 2.0 and only relevant for schema-aware processors. AltovaXML 2007-sp2 has a bug not accepting an http-address. You must use a file path or a relative URL.

[11]

The strip-space element is only relevant for XSLT processors respecting the spec making it possible to use the strip-space and the preserve-space elements. See my article: Tricky whitespace handling in XSLT. If whitespace only text nodes are not stripped, the position() function will get the counting wrong and the alternating colors in the table rows will not be generated.

[12]

Note that the XHTML result-document uses a DTD but it is validated inside the XSLT stylesheet against an XML Schema schema. Also note that "xml" is used as output method not "xhtml". The last option didn't exist in XSLT 1.0 where it could have been useful in the early days of that spec. Today the "xhtml" method is only relevant for output that should work in some obscure software in some software museum.

[13]

I only use the style element for tests. In webpages I always use external CSS stylesheets. In this case the test was nice revealing an ugly bug in Saxon.

[14]

To get the wonderful compile-time validation working in Saxon, we need to use the validation attributes in all the XHTML top-elements generated by templates.

[15]

The test ensures that even rows get a different background color. The test will go wrong if the strip-space element is not used at the top of the XSLT stylesheet. Some XSLT processors have the bad habit of stripping whitespace only text nodes themselves making the strip-space and the preserve-space elements irrelevant.

Appendix

Revision history

2007-03-14

Second edition published making use of input and insight from the XSL-mailing list about the first edition.

2007-03-08

First edition published.

Updated 2009-08-06