Jesper Tverskov, January 1, 2006

Transform XHTML documents into one big document

It is easy to transform many XML documents into one XML document. But a few tricks and a little experience is needed especially when using XHTML as input and output. In the following we will look at different ways of loading all the XML files into the transformation process in XSLT 1.0 and XSLT 2.0.

We will transform the individual chapters of a book marked-up with XHTML into one big XML file, the book. We will transform the individual "items" of a book, like introduction.xml, chapter-1.xml, chapter-2.xml, chapter-3.xml, appendix-a.xml, etc., into one big xml document, book.xml. [1]

We use XHTML both as input and as output, but we could have used any XML markup as example. XHTML gives us an additional opportunity to show how to solve the problems of input.xml being in a default namespace. [2]

The article is sketchy in one sense: you need to know XML and the basics of XSLT in advance. Footnotes are used to help you out with some of the details. All the XML files are available online. You can download them for testing. [3]

1. What we can do    

  1. We can load and include as many XML documents in the XSLT transformation as we like.
  2. If we use XSLT only, we must tell the stylesheet the path or http addresses to the input files. We can not use wildcards like "*.xml". If we use an ordinary programming language together with XSLT, like C#, Visual Basic, Java, PHP, etc. we can also use wildcards.

  3. In XSLT we need to include a list of the filenames in some xml markup either in the stylesheet itself or in some xml-file loaded as part of the transformation. Such extra XML file could of cause have been generated automatically with an ordinary programming language in a split second before the actual XSLT transformation takes place.

  4. In XSLT/XPath 1.0 the xml files we want to load as part of the transformation must exist or we get an error message when we try to load them. In XSLT/XPath 2.0 we can make use of the fn:doc:available() function to test if files exist before we load them.

  5. Most XML editors when doing a transformation will ask for an input document. But we don't need to have an XML document "for real" as input file. When the XSLT stylesheet is loaded, all the xml documents needed in the transformation can be loaded using the document() function. in XPath 2.0 we can also use the doc() function. When the XML editor asks for an input file any XML file will do if the XSLT stylesheet is not making use of it.

  6. The XSLT stylesheet itself can be loaded as "external" file using document(""), that is: by not specifying a path/filename or an http address. At global level, outside the templates, we can have markup to be used as "internal" lookup-tables, e.g.: supplying the list of filenames.

2. Input: chapter-1.xml, etc

We have five XML test files to represent the different chapters or items of the book: Introduction.xml, chapter-1.xml, chapter-2.xml, chapter-3.xml and appendix-a.xml. Below you see chapter-1.xml, the first test chapter of the book. The other files are similar, see footnote 3.

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>First Chapter</title>
</head>
<body>
  <h1>1. First Chapter</h1>
<p>In sagittis tempus orci. Vivamus euismod. Ut at libero non justo ullamcorper vehicula. Nullam et ante. Nunc iaculis eros. Duis libero arcu, condimentum at, volutpat id, placerat lacinia, magna. Vivamus dignissim, leo sit amet egestas lobortis, metus lorem adipiscing nulla, vitae egestas quam sem a odio. Etiam adipiscing ornare urna. Phasellus in mauris. Morbi magna massa, rutrum ornare, porta quis, lobortis vel, eros.</p>
  <h2>1.1 Section Heading</h2>
  <p>Duis vehicula lorem a arcu. Duis ultrices, nunc in lacinia dictum, enim massa iaculis magna, vitae elementum velit sapien vel arcu. Praesent bibendum posuere tortor. Aenean euismod, magna eget placerat interdum, sapien leo porta orci, in hendrerit est risus non felis. Proin sodales, tortor at pellentesque malesuada, enim libero placerat purus, nec semper enim lorem egestas mi.</p>
</body>
</html>

3. Input: filenames.xml

If we just want to transform a few files, we can load them in the XSLT stylesheet with the document() function when needed. If we have many similar files, like the items of a book, it is nice to list them with some markup in order to include them using "xsl:for-each" or "xsl:apply-templates" using recursion.

The XML file, filenames.xml, shown below, is an example of how to do it. In the XSLT stylesheets xhtmls2xhtml-1b.xslt and xhtmls2xhtml-2.xslt, as we are going to see later, an alternative method is used listing the filenames inside the XSLT stylesheet itself.

<?xml version="1.0"?>
<book>
  <item>introduction.xml</item>
  <item>chapter-1.xml</item>
  <item>chapter-2.xml</item>
  <item>chapter-3.xml</item>
  <item>appendix-a.xml</item>
</book>

4. XSLT 1.0: xhtmls2xhtml-1a.xslt

Below you see the XSLT stylesheet transforming introduction.xml, chapter-1.xml, chapter-2.xml, chapter-3.xml, appendix-a.xml into book.xml. The XSLT stylesheet get the filenames of the input files from filenames.xml above.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml" [4]
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="xhtml">
<xsl:output method="xml" version="1.0" indent="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
<xsl:variable name="filenames" select="document('filenames.xml')"/> [5]
<xsl:template match="/">
<html xml:lang="en">
  <head>
    <link rel="stylesheet" href="xhtmls2xhtml.css" type="text/css"/>
    <title>A Book of Merged Items</title>
  </head>
<body>
<h1>A Book of Merged Items</h1>
<xsl:for-each select="$filenames//item"> [6]
  <xsl:variable name="a" select="."/>
  <xsl:comment>
  <xsl:value-of select="."/>
  </xsl:comment>
  <xsl:apply-templates select="document($a)/xhtml:html/xhtml:body"/>
</xsl:for-each>
</body>
</html>
</xsl:template>
<!--the famous identity template-->
<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>
<!--template body-->
<xsl:template match="xhtml:body">
  <xsl:apply-templates select="@*|node()"/>
</xsl:template>
<!--template h1--> [7]
<xsl:template match="xhtml:h1">
<h2>
  <xsl:apply-templates select="@*|node()"/>
</h2>
</xsl:template>
<!--template h2-->
<xsl:template match="xhtml:h2">
<h3>
  <xsl:apply-templates select="@*|node()"/>
</h3>
</xsl:template>
</xsl:stylesheet>

5. Output: book.xml

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> [8]
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <link rel="stylesheet" href="xhtmls2xhtml.css" type="text/css"/>
  <title>A Book of Merged Items</title>
</head>
<body>
<h1>A Book of Merged Items</h1>
<!--introduction.xml-->
<h2>Introduction</h2>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Cras tempor, tellus vel cursus porttitor, dolor ipsum bibendum dolor, nec facilisis tortor orci ut leo. Nulla eu nulla et nulla adipiscing facilisis. Vestibulum nonummy consequat purus. Duis at metus quis nisl tristique pretium.</p>
<ol>
  <li>Maecenas fermentum</li>
  <li>Morbi lobortis</li>
  <li>Sed elementum</li>
  <li>Pellentesque dui purus</li>
  <li>Vestibulum ipsum nunc</li>
</ol>
<p>Maecenas fermentum, pede et suscipit faucibus, neque neque adipiscing orci, a gravida sapien mauris at enim. Morbi lobortis, est vel consequat accumsan, mi mauris lacinia augue, ut viverra nisi lacus ac tortor. Sed elementum, nunc id auctor facilisis, libero pede sodales quam, eget interdum odio nibh a lacus. Vestibulum aliquam purus quis tortor.</p>
<p>
  <img src="hansen1.jpg" alt="Hansen my Alpha dog"/>
</p>
<!--chapter-1.xml-->
<h2>1. First Chapter</h2>
<p>In sagittis tempus orci. Vivamus euismod. Ut at libero non justo ullamcorper vehicula. Nullam et ante. Nunc iaculis eros. Duis libero arcu, condimentum at, volutpat id, placerat lacinia, magna. Vivamus dignissim, leo sit amet egestas lobortis, metus lorem adipiscing nulla, vitae egestas quam sem a odio. Etiam adipiscing ornare urna. Phasellus in mauris. Morbi magna massa, rutrum ornare, porta quis, lobortis vel, eros.</p>
<h3>1.1 Section Heading</h3>
<p>Duis vehicula lorem a arcu. Duis ultrices, nunc in lacinia dictum, enim massa iaculis magna, vitae elementum velit sapien vel arcu. Praesent bibendum posuere tortor. Aenean euismod, magna eget placerat interdum, sapien leo porta orci, in hendrerit est risus non felis. Proin sodales, tortor at pellentesque malesuada, enim libero placerat purus, nec semper enim lorem egestas mi.</p>
<!--chapter-2.xml-->
<h2>2. Second Chapter</h2>
<p>Sed fringilla tortor vitae eros. Nulla fermentum fringilla dolor. Nullam interdum, lectus eu vestibulum ultricies, risus dolor condimentum libero, commodo laoreet nunc nisi eget est. Vivamus pretium lorem sit amet sem.</p>
<ul>
  <li>Suspendisse tellus massa</li>
  <li>Sed dictum tempor purus</li>
  <li>Fusce commodo</li>
</ul>
<p>Suspendisse tellus massa, porttitor ac, porta nec, mattis eget, est. Sed dictum tempor purus. Proin erat pede, lobortis sagittis, accumsan et, pellentesque et, ante. Fusce commodo. Nullam nec tellus vitae erat aliquet volutpat. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p>
<!--chapter-3.xml-->
<h2>3. Third Chapter</h2>
<p>Suspendisse leo mi, egestas vel, lobortis sit amet, faucibus at, justo. Nam convallis nulla sit amet metus. Aenean turpis. Sed tempor lobortis ante. Mauris aliquam commodo leo. Nam ut nisi. Praesent adipiscing. Maecenas tincidunt auctor augue. Curabitur sit amet urna fermentum ligula eleifend ultrices.</p>
<!--appendix-a.xml-->
<h2>Appendix A</h2>
<p>Duis ut mauris. Nunc sed odio. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Sed faucibus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Ut suscipit semper metus. Sed ante massa, pharetra a, rhoncus ut, fringilla eu, mauris. Nunc faucibus, nisi a rhoncus ullamcorper, felis nisi cursus mauris, et gravida odio pede eu neque. Quisque ut arcu. Integer lectus est, lobortis ac, rhoncus sit amet, bibendum sed, tortor.</p>
</body>
</html>

6. XSLT 1.0: xhtmls2xhtml-1b.xslt

Here we have included the external xml-file with the filenames directly in the stylesheet as a sort of internal lookup table. Note the markup in the "oo" namespace at the end. This can only be done at global level outside the templates.

Also note that when the XSLT stylesheet is loaded it loads itself as an external xml document using document(""), that is without a path or an http address meaning "self".

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:oo="lookup" [9]xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="xhtml oo">
<xsl:output method="xml" version="1.0" indent="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
<xsl:variable name="lookup" select="document('')"/> [10]
<xsl:template match="/">
<html xml:lang="en">
<head> [11]
  <link rel="stylesheet" href="xhtmls2xhtml.css" type="text/css"/>
  <title>A Book of Merged Items</title>
</head>
<body>
<h1>A Book of Merged Items</h1>
<xsl:for-each select="$lookup//oo:item">
  <xsl:variable name="a" select="."/>
  <xsl:comment>
    <xsl:value-of select="."/>
  </xsl:comment>
  <xsl:apply-templates select="document($a)/xhtml:html/xhtml:body"/>
</xsl:for-each>
</body>
</html>
</xsl:template>
<!-- the famous identity template -->
<xsl:template match="@*|node()">
<xsl:copy>
  <xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- template body -->
<xsl:template match="xhtml:body">
  <xsl:apply-templates select="@*|node()"/>
</xsl:template>
<!-- template h1 -->
<xsl:template match="xhtml:h1">
<h2>
  <xsl:apply-templates select="@*|node()"/>
</h2>
</xsl:template>
<!-- template h2 -->
<xsl:template match="xhtml:h2">
<h3>
  <xsl:apply-templates select="@*|node()"/>
</h3>
</xsl:template>
<!-- filenames to include --> [12]
<oo:filenames>
  <oo:item>introduction.xml</oo:item>
  <oo:item>chapter-1.xml</oo:item>
  <oo:item>chapter-2.xml</oo:item>
  <oo:item>chapter-3.xml</oo:item>
  <oo:item>appendix-a.xml</oo:item>
</oo:filenames>
</xsl:stylesheet>

7. XSLT 2.0: xhtmls2xhtml-2.xslt

This XSLT 2.0 stylesheet is exactly as the previous XSLT 1.0 stylesheet except for:

  1. The fn:doc-available() function. This function is new in XPath 2.0. We can now use the full list of filenames before they are all ready and exist as files. Note that several additional files have been added even though they don't exist at the moment. In XSLT 1.0 the non-existing files would have returned an error.
  2. We also use the <xsl:result-document> element new in XSLT 2.0 inside the main template in order to create two output documents (XHTML 1.0 Strict! and XHTML 1.1) with a name in a certain directory.

  3. In XSLT/XPath 1.0 unless used together with an ordinary programming language in a loop, the output file is always just one file.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml"xmlns:oo="lookup" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="xhtml oo xsl xs fn xdt">
<xsl:output method="xml" version="1.0"/>
<xsl:variable name="lookup" select="document('')"/>
<xsl:template match="/">
<!-- We create two output documents, XHTML 1.0 Strict! and XHTML 1.1 -->
<xsl:for-each select="$lookup//oo:version">
<xsl:result-document omit-xml-declaration="{oo:omit-xml-declaration}" href="{concat('book-', position())}.html" doctype-public="{oo:doctype-public}" doctype-system="{oo:doctype-system}" indent="yes"><html xml:lang="en">
<head>
  <xsl:if test="oo:mimetype eq 'text/html'"> [13]
    <meta http-equiv="Content-Type" content="text/html; charset="utf-8"/>
  </xsl:if>
  <link rel="stylesheet" href="xhtmls2xhtml.css" type="text/css"/>
  <title>A Book of Merged Items</title>
</head>
<body>
<h1>A Book of Merged Items</h1>
<xsl:for-each select="$lookup//oo:item">
  <xsl:variable name="a" select="."/>
  <xsl:if test="doc-available($a)"> [14]
    <xsl:text> </xsl:text>
    <xsl:comment>
      <xsl:value-of select="."/>
    </xsl:comment>
    <xsl:text> </xsl:text>
    <xsl:variable name="articles" select="document($a)/xhtml:html/xhtml:body"/>
    <xsl:apply-templates select="$articles"/>
  </xsl:if>
</xsl:for-each>
</body>
</html>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
<!-- the famous identity template -->
<xsl:template match="@*|node()">
<xsl:copy>
  <xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- template body -->
<xsl:template match="xhtml:body">
  <xsl:apply-templates select="@*|node()"/>
</xsl:template>
<!-- template h1 -->
<xsl:template match="xhtml:h1">
  <h2>
    <xsl:apply-templates select="@*|node()"/>
  </h2>
</xsl:template>
<!-- template h2 -->
<xsl:template match="xhtml:h2">
  <h3>
    <xsl:apply-templates select="@*|node()"/>
  </h3>
</xsl:template>
<!-- supplying the filenames --> [15]
<oo:filenames>
  <oo:item>introduction.xml</oo:item>
  <oo:item>chapter-1.xml</oo:item>
  <oo:item>chapter-2.xml</oo:item>
  <oo:item>chapter-3.xml</oo:item>
  <oo:item>chapter-4.xml</oo:item>
  <oo:item>chapter-5.xml</oo:item>
  <oo:item>chapter-6.xml</oo:item>
  <oo:item>chapter-7.xml</oo:item>
  <oo:item>chapter-8.xml</oo:item>
  <oo:item>chapter-9.xml</oo:item>
  <oo:item>appendix-a.xml</oo:item>
  <oo:item>appendix-b.xml</oo:item>
  <oo:item>appendix-c.xml</oo:item>
</oo:filenames>
<!-- supplying the DOCTYPES --> [16]
<oo:doctypes>
  <oo:version>
    <oo:mimetype>application/xhtml+xml</oo:mimetype>
    <oo:omit-xml-declaration>no</oo:omit-xml-declaration>
    <oo:doctype-public>-//W3C//DTD XHTML 1.1//EN</oo:doctype-public>
    <oo:doctype-system>http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd</oo:doctype-system>
  </oo:version>
  <oo:version>
    <oo:mimetype>text/html</oo:mimetype>
    <oo:omit-xml-declaration>yes</oo:omit-xml-declaration>
    <oo:doctype-public>-//W3C//DTD XHTML 1.0 Strict//EN</oo:doctype-public>
    <oo:doctype-system>http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd</oo:doctype-system>
  </oo:version>
</oo:doctypes>
</xsl:stylesheet>

Footnotes

[1]

What file extension should we use for XHTML? When used for storage or as input files over the Internet "xml" is a good choice. "Xml" is recognised all over the place. When looked at in the browsers, Internet Explorer and most professional XML Editors will show a nicely indented XML node three. But browsers like Firefox and Opera will in my opinion wrongly use their default XHTML CSS stylesheet.

When XHTML is used for presentation "html" is a good choice for file extension. All browsers and XML Editors will do their best to render a "html" file using their default XHTML CSS stylesheet.

Only if you work on your own could other file extensions like "xhtml" be relevant. That is the approach XMLSpy is taking but such a filetype will not work all over the Internet and you must even set up your own web server to handle such a filetype. Most other XML Editors also need to be told what to do with it.

[2]

It is extremely tricky to transform XML in a default namespace like XHTML. Read my article at www.xmltraining.biz: Transform XHTML to XHTML with XSLT.

[3]

Input files

introduction.xml, chapter-1.xml, chapter-2.xml, chapter-3.xml, appendix-a.xml, filenames.xml

XSLT files

xhtmls2xhtml-1a.xslt, xhtmls2xhtml-1b.xslt, xhtmls2xhtml-2.xslt

Output files

book.xml (XHTML 1.0 Strict!), book-1.html (XHTML 1.1, mimetype application/xhtml+xml could be added at the webserver or with scripting before leaving the webserver), book-2.html (XHTML 1.0 Strict!, ready to use with mimetype "text/html").

[4]

We declare the XHTML namespace both as default namespace and with the "xhtml" prefix. The default namespace is necessary to avoid xmlns="" in the output file, and the "xhtml" prefix is necessary to get to markup in a default namespace in the XHTML input files.

[5]

Here we load the XML file with the filenames.        

[6]

Here we get the content of each item of the book one by one.

[7]

We need template xhtml:h1 and xhtml:h2 to change the heading levels in the input documents. What is h1 in the input files must be changed to h2 in "book.xhtml", etc.

[8]

XHTML 1.0 Strict! is good as "all-round" XHTML doctype since it works without problems served with mimetype "text/html". We can also served it with mimetype "application/xhtml+xml" to browsers understanding it. As long as we use our XHTML as XML data store we don't even bother about mimetypes. We markup and transform without thinking about it.

[9]

"Lookup" is not a professional http address for a namespace but it works well as a test.

[10]

The mono-spaced font makes it look like a space between the single quotes but there is none. This "no content" trick loads the XSLT stylesheet itself as if it is an external XML file you want to include in the transformation. In this way the stylesheet is loaded twice.

[11]

If the XHTML 1.0 Strict! file is to be served with mimetype "application/xhtml+xml" this is as it could look. If it is to be served as "text/html" for old and less advanced browsers like Internet Explorer 6 and 7, we should add the the following metatag to the head section : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />.

[12]

It is probably news to many beginners in XSLT that we can have all sorts of internal "lookup tables" in an XSLT stylesheet, but we must keep them at global level outside the templates. I most often have them in seperate files.

[13]

If the XHTML output file is not served with mimetype "application/xhtml+xml" but with good old "text/html it is necessary to add the metatag.

[14]

This is a nice new function in XPath 2.0. We can now make the full list of filenames ready from the beginning even if some of them don't exist yet. In XSLT/XPath 1.0 we must wait until the files are ready or comment them out.

[15]

Note that several new not yet existing files have been added in order to test the doc-available() function.

[16]

It would have been better to place this lookup section in a seperate file so other XSLT stylesheets could share it.

Updated 2009-08-06