Jesper Tverskov, August 9, 2007

User-defined function for line-number in XSLT

In XSLT 2.0 we have 130 functions but we don't have a function to return the line-number of an element node. It is a challenging exercise to make a user-defined function for line-number. We need to get a lot of the new stuff in XSLT 2.0 working, like sequences, unparsed text and Regular Expressions.

The user-defined stylesheet function for line-number in this tutorial is nice if we need to get the line-number reported using standard XSLT only. It has been tested with Saxon and AltovaXML and should work with any XSLT 2.0 Processor. [1]

If portability is not a big issue we have alternatives. The Saxon XSLT processor has a saxon:line-number() extension function. We could also make a user-defined extension function with a programming language our XSLT processor supports.

1. Recipe for line-number

To find the line-number of an element we need to know the element name, e.g. "p", the element number, e.g. "3", that is the third "p" in the document, and we need to know the URI of the document of the element. These three pieces of information we must work out first in order to use them as parameters in the line-number function.

In the line-number function we use the unparsed-text() function to load the document, and we use the xsl:analyze-string element to detect and delete all "<" from comments, CDATA sections and Processing Instructions to avoid false positives when we use the tokenize() function to split the unparsed text using "<" and the node-name as splitter. Next we use the string-join() function to assemble the string again but only from the first item until the item number of the element node in the subsequence.

The rest is piece of cake. We use string-length() to count the characters in our new string, then we delete all linefeeds from the string and count the characters again and subtract the last count from the first to get the line-number. [2]

2. User-defined functions

We can make functionality similar to functions using named templates. But a named template is often a little clumsy if we just want some value to be returned, and they can only be called with the xsl:call-template element. We need a method to make our own functions that can be called from inside an XPath expression.

In XSLT 2.0 we have such a new method in the form of a new element called xsl:function. We can use it to make our own functions using parameters as we are used to from almost any other programming language. Our own functions must be in some namespace of our own.

2.1 Namespace for user-defined function

In this tutorial I use "xmlplease" as namespace alias, and the line-number function looks like this: xmlplease:line-number(). The "xmlplease" prefix is my choice and must be an alias for a namespace declared in the top-element of the XSLT stylesheet.

I use "http://www.xmlplease.com/xslt" for namespace. For a test anything will do. You can call your prefix "test", and the namespace declaration could look like this: xmlns:test="mynamespace".

3. XSLT for line-number

The easiest way to understand our user-defined function for line-number is to walk you through an XSLT stylesheet, linenumber.xsl, making use of it.

Let us use XHTML as example, and let us say that we have decided that our paragraphs must not be longer than 500 characters to make them easy to read. When writing or editing XHTML it is nice to get a warning if a paragraph is longer than 500 characters and to get the line-number for the paragraph reported as part of the warning.

We could use all sorts of tools and programming languages to obtain such functionality. If we were to implement it using standard XSLT only, a basic test document for how the functionality should work could look like the following.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:xhtml="http://www.w3.org/1999/xhtml" [3]xmlns:xmlplease="http://www.xmlplease.com/xslt" [4]exclude-result-prefixes="#all">
<xsl:output indent="yes"/>
<xsl:variable name="document-uri" select="document-uri(.)"/> [5]

<xsl:function name="xmlplease:line-number"> [6]
  <xsl:param name="document-uri"/><!-- similar to document-uri() --> [7]
  <xsl:param name="node-name"/><!-- e.g.: 'p' --> [8]
  <xsl:param name="node-number"/><!-- e.g.: '3', that is the third p --> [9]
  <xsl:variable name="unparsed" select="unparsed-text($document-uri)"/> [10]
  <xsl:variable name="unparsed2"> [11]
    <xsl:analyze-string select="$unparsed" regex="&lt;!--.*?--&gt;|&lt;!\[CDATA\[.*?\]\]&gt;|&lt;\?.*?\?&gt;" flags="s"> [12]
      <xsl:matching-substring> [13]
        <xsl:value-of select="replace(., '&lt;', '')"/>
      </xsl:matching-substring>
      <xsl:non-matching-substring> [14]
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:variable>
  
  <xsl:value-of select="string-length(string-join(subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number), ' ')) - string-length(replace(string-join(subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number), ' '), '&#xA;', '')) + 1"/> [15]
</xsl:function>

<xsl:template match="/"> [16]
  <test>
    <xsl:apply-templates select="//xhtml:p"/>
  </test>
</xsl:template>

<xsl:template match="xhtml:p">
  <xsl:if test="string-length(.) gt 500">
  <xsl:variable name="n"> [17]
    <xsl:number level="any"/>
  </xsl:variable>

  <warning>
    <xsl:value-of select="concat('Paragraph in line ', xmlplease:line-number($document-uri, name(), $n)), 'has', string-length(), 'characters.' "/> [18]
  </warning>
  </xsl:if>
</xsl:template>
</xsl:stylesheet>

In the code above the xsl:value-of element is the core of the logic calculating our line-number. Let us explain it in some detail.

  1. If "$node-name" contains "p" then concat('&lt;', $node-name) gives us "<p" and we use that to split the document, tokenize($unparsed2, concat('&lt;', $node-name)).
  2. In the sequence created we only need the items from the first until the "node-number" of the node we are looking for, subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number).
  3. We use the string-join() function to make the subsequence into a string, string-join(subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number)).
  4. Finally we count the characters of the string, string-length(string-join(subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number), ' ')).
  5. In the second half of the expression the first half is repeated except that newline characters are deleted from the string before we count the characters, replace(string-join(subsequence(tokenize($unparsed2, concat('&lt;', $node-name)), 1, $node-number), ' '), '&#xA;', '').
  6. When we subtract the second count from the first we get the line-number until the element node in question. We add "1" to get the line-number of our element node.

Footnotes

[1]

At the moment Saxon and AltovaXML are the only commersial XSLT 2.0 processors I know of. We also have an XSLT 2.0 processor named Gestalt, but I feel it is mostly experimental. Microsoft has announced in some MS XML team blog that they are working on an XSLT 2.0 processor of their own.

[2]

Making a user-defined function for line-number is a good exercise because it covers several of the new elements and functions in XSLT 2.0:

  1. xsl:function
  2. unparsed-text()
  3. replace()
  4. tokenize()
  5. string-join()
  6. subsequence()
  7. xsl:analyze-string
  8. xsl:matching-substring
  9. xsl:non-matching-substring
[3]

The easiest way to get to makup in a default namespace is to declare the same namespace with a prefix in the XSLT stylesheet. See my article Transform XHTML to XHTML with XSLT.

[4]

This is my self-made namespace for my user-defined function.

[5]

We need to supply the document URI for the function with a hard coded URI or we must use the document-uri() function at global level where the document is context or get the URI supplied as a parameter from outside the XSLT stylesheet. Inside the function element we can only get context information supplied with a parameter. Even in the template calling the function we can not use document-uri() because the context here is the actual match of the template.

[6]

The xsl:function element can only be used at global level as child of the top-element.

[7]

It is a little confusing that I call the parameter for "document-uri". We already have a variable with the same name and a function. But they are all related. We use the function to put the URI into the variable, and we then use the variable as value for the parameter.

[8]

The line-number function only works for element nodes and the "node-name" parameter is simply the name of the node we are looking for like "p" for "<p>".

[9]

The "node-number" parameter is the number of the node in question in the XML node tree only counting nodes with the same name.

[10]

Here we load our XML document as unparsed text.

[11]

We use the xsl:analyze-string element to delete all "&lt;" from comments, CDATA sections and PIs in order to avoid false positives when we use the element node we are looking for to split the unparsed text. We need to keep the comments, CDATA sections and PIs because they could contain newline characters. See my article Deleting comments, CDATA sections and PI from unparsed text with REGEX in XSLT.

[12]

Note the "s" flag to put the "." into dot-all mode also matching newline characters. Also note the question mark after the asterisk. It is all explained in my article mentioned above.

[13]

For each match the "&lt;", that is "<", is deleted to avoid false positives when we use the node as splitter.

[14]

All non-matches are just copied over from the unparsed text.

[15]

This xsl:value-of is explained later in the article but we basically use the "node-name" to split our unparsed text creating a sequence of items. We then use subsequence() from first until the number of our element node in question and string-join(), to create a new string. We then count the characters, delete the linefeed characters, count again, and subtract the last count from the first to get the line-number.

[16]

Our two templates is just to have a small example making use of our xmlplease:line-number() function.

[17]

In the "n" variable we use the xsl:number element to get the number of our element node in question. We want its position in the XML node tree only counting elements with the same name.

[18]

In the concat() function we finally make use of our xmlplease:line-number($document-uri, name(), $n) using our three parameters. The "$document-uri" is our global variable containing the URI of our document. The name() function supply the element node name for the context node. The $n variable contains the number of our element node in question.

Updated 2009-08-06