Jesper Tverskov, January 24, 2009

XHTML sections: implicit2explicit-hierarchy.xsl

How to unflatten XML often comes up in XSLT Help Fora like the xsl-list. In this tutorial I give a complete and detailed example of how to transform an XHTML document using h1-h6 headings into an explicit hierarchical structure of nested sections.

If an XHTML document makes proper use of h1-h6 heading levels, it is easy to make the implicit hierarchial structure explicit. An explicit XHTML hierarchy is overkill for most webpages but can be very handy for long reports, articles, etc.

1. Benefits of explicit hierarchical structure

An explicit hierarchical structure makes further processing and styling easier than if we only have an implicit structure. The h and section elements in XHTML 2.0 have an additional benefit of making it easy to merge documents. The heading levels will alwas adjust automatically. Unfortunately there is not much support for XHTML 2.0 in standard browsers but we can use XHTML 2.0 server-side.

Below I provide two XSLT stylesheets for transforming XHTML h1-h6 type of documents into explicit hierarchies of nested sections. The first stylesheet transforms to div elements and keeps the heading level numbers. The other stylesheet transform to XHTML 2.0 type of section and h elements.

The XSLT styleheets build on Michael Kay's example in 4th Edition of "XSLT 2.0 and XPath 2.0", ISBN: 978-0-470-19274-0, 2008, page 340. Also thanks for the help from members of the XSL-List -- Open Forum on XSL.

2. XHTML with implicit hierarchical structure

See my tutorial Validating implicit XHTML hierarchy with Schematron for details about how h1-h6 elements should be used.

<body>
  <h1/>
  <h2/>
  <h3/>
  <h3/>
  <h2/>
  <h3/>
  <h4/>
  <h5/>
  <h6/>
  <h6/>
  <h2/>
</body>

3. XHTML with nested sections made of div elements

Making sections of div elements and keping the heading numbering is a nice compromize. The new document is still a valid XHTML 1.0/1.1 document. An XSLT stylesheet like xhtml-hierarchy-1.xsl would transform the above implicit structure into the following explict structure (a complete XHTML document is needed as input):

<body>
  <h1/>
  <div class="section">
    <h2/>
    <div class="section">
      <h3/>
    </div>
    <div class="section">
      <h3/>
    </div>
  </div>
  <div class="section">
    <h2/>
    <div class="section">
      <h3/>
      <div class="section">
        <h4/>
        <div class="section">
          <h5/>
          <div class="section">
            <h6/>
          </div>
          <div class="section">
            <h6/>
          </div>
        </div>
      </div>
    </div>
  </div>
  <div class="section">
    <h2/>
  </div>
</body>

4. Elements "h" and "section" in XHTML 2.0

XHTML 2.0, still only a draft and probably never to be finished, has h and section elements as an interesting alternative to h1-h6 headings, see: 8.5. The heading elements. If you want to use XHTML 2.0 it could be relevant to transformn XHTML 1.0/1.1 or XHTML 2.0 documents using h1-h6 headings into this new explicit hierarchical structure. The XSLT stylesheet needed is the same as before with a few modifications, xhtml-hierarchy-2.xsl, creating the following explict structure:

<body>
  <h/>
  <section>
    <h/>
    <section>
      <h/>
    </section>
    <section>
      <h/>
    </section>
  </section>
  <section>
    <h/>
    <section>
      <h/>
      <section>
        <h/>
          <section>
          <h/>
          <section>
            <h/>
          </section>
          <section>
            <h/>
          </section>
        </section>
      </section>
    </section>
  </section>
  <section>
    <h/>
  </section>
</body>

5. XSLT stylesheet line for line

Unless you already know the ins and outs of xsl:for-each-group and the "identity" template, the XSLT stylesheets provided with this tutorial are not easy to understand. Let us take the last one and explain what is going on line for line.

The input document could be any XHTML 1.0/1.1 document using h1-h6 implicit structure. The headings must be children of body. If not you must modify the stylesheet.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns="http://www.w3.org/1999/xhtml" xpath-default-namespace="http://www.w3.org/1999/xhtml">
  <xsl:output method="xml" indent="yes"/>[1]

  <xsl:template match="@*|node()">[2]
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="body">
    <xsl:copy>[3]
      <xsl:for-each-group select="element()|comment()|processing-instruction()" group-starting-with="h1">[4]
        <xsl:apply-templates select="." mode="group"/>[5]
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="h1" mode="group">[6]
    <h>
      <xsl:apply-templates select="@*|node()"/>
    </h>
    <xsl:for-each-group select="current-group() except ." group-starting-with="h2">[7]
      <xsl:apply-templates select="." mode="group"/>[8]
    </xsl:for-each-group>
  </xsl:template>

  <xsl:template match="h2|h3|h4|h5|h6" mode="group">
    <xsl:variable name="this" select="name()"/>[9]
    <xsl:variable name="next" select="translate($this, '23456', '34567')"/>
    <section>[10]
      <h>
        <xsl:apply-templates select="@*|node()"/>
      </h>
      <xsl:for-each-group select="current-group() except ." group-starting-with="*[name() = $next]">[11]
        <xsl:apply-templates select="." mode="group"/>
      </xsl:for-each-group>
    </section>
  </xsl:template>

  <xsl:template match="element()|comment()|processing-instruction()" mode="group">[12]
    <xsl:apply-templates select="current-group()"/>
  </xsl:template>

<!-- Templates to delete what is copied out of DTD. -->
  <xsl:template match="@shape"/><!-- a element -->[13]
  <xsl:template match="@colspan[. = 1]"/><!—th, td element -->
  <xsl:template match="@rowspan[. = 1]"/><!—th, td element -->
  <xsl:template match="@version"/><!-- html element, 1.1 -->
  <xsl:template match="@profile"/><!-- head element, 1.1 -->

<!-- Additional templates can be added here -->[14]
</xsl:stylesheet>

To change the above stylesheet to create div elements and to keep the heading numbering we must change the two instances of h elements to xsl:copy and section must be changed to "div class='section'"  etc.

The beauty of the stylesheet is that not only are we creating an explicit hierarchical structure but at the same time the "identity" template is in place and we can add as many additional templates of exceptions as we need. For almost any node in input we can change it to something else.

Footnotes

[1]

When transforming XHTML to XHTML I prefer using method="xml". XHTML is XML and should be treated as XML. Beginners are best served using method="xhtml". You will probably like to add serialization parameters like doctype-public="…" and doctype-system="…" to create a proper DOCTYPE.

[2]

This is the famous "identity" template. If you are new to "identity" template, see my tutorial, Identity Template: xsl:copy with recursion. In our case, based on input, it creates:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    Etc.
  </head>
  Here the template matching "body" takes over.
<
/html>

[3]

Here the body element is copied over and everytning in body before h1.

[4]

As children of body we have heading elements, other elements, processing instructions and comments. We group all we have found into groups starting with h1. There is only one such group containing h1 and the rest of the document.

[5]

For our group containing h1 and all other elements, processing-instructions and comments in body after h1, we now apply all relevant templates but only if mode="group". The template matching h1 is now fired.

[6]

The h1 element is matched and the h element is created. We apply templates and the "identity" template takes over coping attributes and content of h1 element to the new h element. Unless overruled by other templates with a more specific match attribute.

[7]

The current group except h1 already processed is now grouped starting-with h2. That is what is between h1 and the first h2 is mathched by the template matching element()|processing-instruction()|comment(). We now process all the groups starting-with h2.

[8]

For each h2 group we now apply all relevant templates. The first matching template is the one matching h2.

[9]

This and the following variable make sure that the next for-each-group will start with h3, etc.

[10]

The first section element with h element etc. is created.

[11]

New groupings starting with what? "*[name() = $next]" means: "start with the element having a name that is equal to the content of the variable named "next".

[12]

For elements, processing-instructions and comments in each group except for heading elemements, and for content in body before h1, this template is fired. By itself it does nothing. Nothing is copied or created but default templates takes over, probably the "identity template", and the relevant markup and content is copied over. Unless there are other default templates with a more exact matching.

[13]

This template prevents the shape attribute from being copied out of XHTML DTD and into all a elements. Templates for @rowspan and @colspan can also be necessarry to prevent a lot of colspan="1" and rowspan="1" in tables in output. If you are new to this issue, see my tutorial: The shape="rect" attribute in xhtml to xhtml.

[14]

This is why the "identity" template is the pride of XSLT. We can always add templates of exceptions. Let us say we notice that the input XHTML document has a lot of markup like <b> where we would prefer <strong>. All we need to do is to add a template like the following:

<xsl:template match="b">
  <strong>
    <xsl:apply-templates select="@*|node"/>
  </strong>
</xsl:template>

The template says: "Find all b elements and replace them with strong elements. Then copy attributes and content of the b elements over into the new strong elements".

Updated 2009-08-03