Jesper Tverskov, May 17, 2007

Collection() with REGEX in XSLT

Collection() is a non-standardized standard function. It can be used as a better version of document() and doc() with wild cards and Regular Expressions to load a collection of XML documents. Or it can use a catalog file.

I have doubts about the concept of non-standardized standard functions [1]. But with some luck all XSLT processors will one day implement collection() the same way. The definition of fn:collection in the W3C spec is not easy reading. The documentation of collection in Saxon could also have been better. It is hard to find any down to earth information about this powerful function.

1. Collection() in XSLT processors

At the moment we have three XSLT 2.0 processors, Saxon, AltovaXML and Gestalt. In this tutorial we will look at how the collection()functions is implemented in Saxon. When better documentation for collection() in AltovaXML is published, I will rewrite the tutorial.

2. Wild cards and patterns

One common use case is to load all files in some directory. With document() and doc() we need to know all the file names. With collection() we can select the files using wild cards like: *.*.

<xsl:for-each select="collection('file:///c:/someDir/?select=*.*')">
  <!-- et cetera -->
</xsl:for-each>

We can use patterns made with Regular Expressions. Let us say we want to load all .xsl and .xslt documents in some directory having file names beginning with a letter followed by a number, e.g.: "b34.xsl". The Regular Expression could look like this: [a-z][0-9]+.(xsl|xslt).

The Regular Expression becomes part of the URI. If the Regular Expression uses characters like "|" not allowed in a URI, we must escape them using the iri-to-uri() function inside the collection() function.

<xsl:for-each select="collection(iri-to-uri('file:///c:/someDir/?select=[a-z][0-9]+.(xsl|xslt)'))">
  <!-- et cetera -->
</xsl:for-each>

3. Relative URI and subdirectories

We can use a relative URI, and we can add more keywords like "recurse" to include subdirectories (see the Saxon documentation for all options).

<xsl:for-each select="collection(iri-to-uri('../someDir/?select=[a-z][0-9]+.(xsl|xslt);recurse=yes'))">
  <!-- et cetera -->
</xsl:for-each>

In the above URI the XSLT stylesheet only works if it is placed in a sibling directory. All relevant files in "someDir" and in the subdirectories will be returned.

4. Stylesheet example

Let us say that we want to make a stylesheet that transform the filename and XSLT version number out of all the xsl and xslt documents in a directory. We will place the file names and the version number in a table in an XHTML result document.

We use the collection() function with Regular Expressions to find the documents. We need the uri-to-uri() function if the Regular Expression contains characters not allowed in a URI. We use document-uri() function to extract the path/filename from the found documents. We use the tokenize() function to extract the filename from path/filename.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
<xsl:template match="/">
<xsl:result-document href="filenames.html" indent="yes">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Test of collection() function</title>
  <style type="text/css">
    table{border-collapse: collapse; empty-cells:show}
    th, td{border: 1px solid silver}
    th{background-color:whitesmoke}
  </style>
</head>
<body>
  <h1>My XSLT stylesheets</h1>
  <table cellspacing="0" cellpadding="5">
  <tr><th>No</th><th>Filename</th><th>Version</th></tr>
  <xsl:for-each select="collection(iri-to-uri('file:///c:/someDir/anotherDir/?select=*.(xsl|xslt)'))">
  <tr><td><xsl:value-of select="position()"/></td><td><xsl:value-of select="tokenize(document-uri(.), '/')[last()]"/></td><td><xsl:value-of select="./xsl:stylesheet/@version"/></td></tr>
  </xsl:for-each>
</table>
</body>
</html>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>

In the above stylesheet the xsl:result-document, the collection() function, the iri-to-uri() function and the tokenize() function are all new in XSLT 2.0. How could we ever live with XSLT 1.0!

We get to the loaded documents in these two xsl:value-of:

<xsl:value-of select="tokenize(document-uri(.), '/')[last()]"/>
<xsl:value-of select="./xsl:stylesheet/@version"/>

In both the "." is the loaded file in context. In the first we extract the URI and then the filename. In the second, we move from the loaded file in context into its top element, "xsl:stylesheet", etc.

The above stylesheet, collection.xsl, does not make use of an input file. You can use some dummy input file like the stylesheet itself to start it up. Or you can change the template's match attribute to a name attribute and specify an initial template if you prefer. If you have problems understanding the code, it might help to take a look at an output file for one of my directories, filenames.html.

5. XML Schema and DTD

The documentation of collection in Saxon shows that we can use the "validate" keyword to configure how the loaded documents should be validated against a schema. We don't have similar options for DTDs.

If we load a collection of documents using DTDs, also the DTDs are loaded and that can take very many seconds if not minutes for e.g. XHTML documents. If loading of the DTDs is a big issue, we can use additional software [2] or we must transform through a programming language that can solve such problems. [3]

6. Using a catalog file

The URI in the collection() function can also point to a catalog file containing the filenames:

collection('file:///c:/someDir/someCatalogFile.xml')
or
collection('someDir/someCatalogFile.xml')
or
collection('http://www.someWebsite/someDir/someCatalogFile.xml')

In Saxon the catalog file can look like below. The href in the catalog file can not use Regular Expressions. The catalog file itself could have been generated with XSLT using the collection() function.

<collection stable="true">
  <doc href="dir/chap1.xml"/>
  <doc href="dir/chap2.xml"/>
  <doc href="dir/chap3.xml"/>
</collection>

The idea is that since the file format is known, we only need to supply the URI to it.

7. REGEX flavors

We don't have a standard for Regular Expressions, they always come in flavors. The use of a mixture of "*" wild card and REGEX in the collection() implementation in Saxon is a little strange but I find it useful.

In "*.(xsl|xslt)" the asterisk or star is a traditional filename wild card, and (xsl|xslt) is more like REGEX syntax. A "proper" REGEX could have looked like this: ".+\.(xsl|xslt)" but is not working. My tests indicate that a period is always a period only not any character. When the asterisk, "*", can not be mistaken for a quantifier it is a wild card for legal filename characters.

We must make use of parentheses where they are not needed in other flavors. "\w+" is not working, but it works here: (\w)+[0-9]+.(xsl|xslt). The expression says: "return all xslt or xsl files having filenames with any mixture of these characters: [a-zA-Z0-9_] but they must end with one or more digits. We could add "-" like this (\w|\-), etc. But "*[0-9]+" is the new better way to do it! [4]

8. Loading documents

In XSLT we can load documents with the document(), doc(), unparsed-text() and the collection() functions. Only the last can use Regular Expression like syntax. There are many tutorials and examples around about how to load documents one at a time. It is rare to see examples using the document() function to load many files simultaneously (doc() can't do that). Here is the simple way:

<xsl:variable name="a" select="document(('a.xml', 'b.xml'))"/>
<!-- et cetera -->
<items>
  <xsl:for-each select="$a">
    <item><xsl:value-of select="./topelement/etc."/></item>
  </xsl:for-each>
</items>

The sequence of URIs can also be supplied a little similar to the way the collection() function works but much more flexible: the XML file can have any format. Below we use the catalog file in section 6 as example but with the document() function the file format is not known (any format could work), we need to get to the sequence of URIs with a proper XPath expression:

<items>
  <xsl:for-each select="document(document('someCatalog.xml')/collection/doc/@href)">
    <item><xsl:value-of select="./topelement/etc."/></item>
  </xsl:for-each>
</items>

This tutorial is only to get you started. You must look into the documentation for collection in Saxon, collection in AltovaXML, collection in Gestalt, etc., for the finer details.

Footnotes

[1]

See my question to the XSL-mailing list at mulberrytech.com, The collection() function.

[2]

We have open source tools like Andrew Welch's Kernow designed to make it faster and easier to repeatedly run transforms.

[3]

See also Andrew Welch, Using collection() and saxon:discard-document() to create reports.

[4]

Please note that this REGEX MK manual is based on one hour of testing. I could easily have overlooked something.

Updated 2009-08-06