Jesper Tverskov, July 10, 2007

Validating implicit XHTML hierarchy with Schematron

Most XHTML documents benefit from an implicit hierarchical structure made with h1-h6 heading elements. The hierarchy is making further processing easier, and it adds to usability, accessibility and to Search Engine Optimization. We can use Schematron to validate an implicit hierarchy.

1. Some benefits of hierarchy

Here is a list of some of the tasks made easier, if a document has a hierarchical structure, implicit or explicit:

  1. Generation of TOC and section numbers.
  2. Merging or splitting up XHTML documents.
  3. Transforming or querying XHTML into other documents or views.
  4. Transforming XHTML into other formats.
  5. Navigating the document, e.g. in screen readers.

2. Schematron is necessary

Considering the usefulness of implicit hierarchical structures, it is a little surprising that we can not validate such XHTML structures with DTD or with other grammar-based schema languages. [1] They are simply not fit to validate implicit hierarchies. [2]

That is one of the reasons why an explicit hierarchical structure using nested section elements (new) and h elements (new) is proposed as an option for XHTML 2.0. If we only have an h element and its level is determined by the depth of nesting, the hierarchy is always correct no matter how we make or merge or split up XHTML documents.

For the time being the only easy way to validate an implicit XHTML hierarchy is by setting up additional validation for our XHTML documents using Schematron, a rules-based schema language.

3. The rules of implicit XHTML hierarchy

An implicit hierarchy made with heading elements consists of the following rules:

  1. There can only be one h1, and it must be the first heading.
  2. The first heading after h1 must be h2.
  3. The first heading after h2 must be h2-h3.
  4. The first heading after h3 must be h2-h4.
  5. The first heading after h4 must be h2-h5.
  6. The first heading after h5 and h6 must be h2-h6. [3]

In bad webdesign and even in the HTML 4.01 spec [4], we can see code examples using more than one h1 element. This is of course not possible if we want an implicit hierarchical structure. One and only one h1 element must start the hierarchy just like an XML document can only have one top element.

4. Generic XPath in XSLT

Testing if a sequence of XHTML headings is correct, that is if they actually provide an implicit hierarchical structure, is not one of the easier tasks in XPath 1.0. It is much easier in XPath 2.0 having Regular Expressions.

With the help of members from the XSL-mailing list, we have this beautiful generic XSLT stylesheet:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:variable name="h" select="//xhtml:*[matches(local-name(),'^h[1-6]')]/number(substring(local-name(),2))"/>
  <xsl:template match="/">
    <xsl:if test="not($h[1]=1 and count($h[.=1])=1)">h1 not right</xsl:if>
    <xsl:for-each select="1 to count($h)-1">
      <xsl:if test="$h[current()+1]-$h[current()] gt 1">section head jumped by more than one level</xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

We can not use xsl:for-each in Schematron and we would like more exact error messages, so we will need to modify the XSLT stylesheet above. But let us first make sure that we understand the code.

The first step is to remove the "h" of the heading elements (h1-h6) and put the remaining number (1-6) into a variable as a sequence of numbers: "1, 2, 2, 3, 4, 2", etc. This is what happens in the "h" variable finding all heading elements (h1-h6), //xhtml:*[matches(local-name(),'^h[1-6]')], and putting the last character , substring(local-name(), 2), into the variable. The number() function is used to convert the last character to a number.

The next step is to make sure that h1 is the first element, not($h[1]=1), and that we only have one h1, not(count($h[.=1])=1). We use the not() function to negate our test. If it is not true an error is raised.

The final step is to make a for-each for as many items we have in the sequence except the last, 1 to count($h)-1. The test inside the for-each simply compares each item in the sequence with the next item. If the difference is greater then 1 a heading level has been skipped and it raises an error. "$h" is our variable with the sequence, the current() function is in this case the item position we are processing.

4.1 The "xhtml" namespace prefix

In XSLT and also in our Schematron schema we use "xhtml" as prefix for XHTML, xmlns:xhtml="http://www.w3.org/1999/xhtml". Our XHTML instance document is not using a prefix but a default namespace. Default namespaces are tricky. The easiest way to get to markup in such a namespace is by using a prefix. There is none so we must invent one, and it is more or less a tradition to use "xhtml". See my article Transform XHTML to XHTML with XSLT.

5. Modified XPath in XSLT

We are now going to take our generic "XPath in XSLT solution" from above and modify it to get rid of for-each which we can not use in Schematron, and we will also add exact error messages. All this is done to prepare for our XPath to be transferred to Schematron.

Basically we make a template for each heading level (h1-h4). Templates for h5 and h6 are not necessary since they can be followed by h2-h6. We keep the "h" variable at global level, but in order to get to the right item in the sequence we make an "a" variable in each template. The "a" variable simply counts how many headings there are in front of the heading being processed and add "1" to get the position of the heading being processed by the template.

First we list the modified XPath in XSLT solution. Next we make sure that we understand it by taking a closer look at some of the templates.

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml" exclude-result-prefixes="xhtml">
<xsl:output indent="yes"/>

<xsl:variable name="h" select="//xhtml:*[matches(local-name(),'^h[1-6]')]/number(substring(local-name(),2))"/>

<xsl:template match="/">
<test>
  <xsl:apply-templates select="//xhtml:h1|//xhtml:h2|//xhtml:h3|//xhtml:h4"/>
</test>
</xsl:template>

<xsl:template match="//xhtml:h1[not(preceding::xhtml:h1)]">
<xsl:variable name="a" select="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
  <xsl:if test="not($h[1]=1)"><error>h1 must be the first heading.</error></xsl:if>
  <xsl:if test="not(count($h[.=1])=1)"><error>There can only be one h1.</error></xsl:if>
  <xsl:if test="$h[$a + 1]-$h[$a] gt 1"><error>The first heading after h1 can not be h<xsl:value-of select="$h[$a + 1]"/>. Only h2 is allowed.</error></xsl:if>
</xsl:template>

<xsl:template match="//xhtml:h2">
<xsl:variable name="a" select="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
<xsl:if test="$h[$a + 1]-$h[$a] gt 1">
  <error>The first heading after h2(<xsl:value-of select="count(preceding::xhtml:h2) + 1" />) can not be h<xsl:value-of select="$h[$a + 1]"/>. Only h2 or h3 is allowed.</error>
</xsl:if>
</xsl:template>

<xsl:template match="//xhtml:h3">
<xsl:variable name="a" select="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
<xsl:if test="$h[$a + 1]-$h[$a] gt 1">
  <error>The first heading after h3(<xsl:value-of select="count(preceding::xhtml:h3) + 1" />) can not be h<xsl:value-of select="$h[$a + 1]"/>. Only h2 or h3 or h4 is allowed.</error>
</xsl:if>
</xsl:template>

<xsl:template match="//xhtml:h4">
<xsl:variable name="a" select="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
<xsl:if test="$h[$a + 1]-$h[$a] gt 1">
  <error>The first heading after h4(<xsl:value-of select="count(preceding::xhtml:h4) + 1" />) can not be h6. Only h2 or h3 or h4 or h5 is allowed.</error>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

5.1 Templates for h1-h4

The template for h1 is different from the templates for h2-h4 because we can only have one h1, and it must be the first heading. To make sure that we only get one set of error messages if there are two or more h1, the match attribute of the template is set to find only the first h1: //xhtml:h1[not(preceding::xhtml:h1)]. That is, find any h1 element but only if there is no h1 element before it.

The "a" variable counts how many headings we have before the heading we are processing and add "1" to that count to get the position of the current item in the sequence of heading elements in the "h" variable in the third if statement.

To get more exact error messages for the h2-h4 heading being processed, a (<xsl:value-of select="count(preceding::xhtml:h2) + 1" />) has been added. It counts all the headings of same level, in this case h2, before the one processed and add "1" to that number to get the count for the heading processed. Normally we would have used the xsl:number element to calculate the count, but xsl:number can not be transferred to Schematron.

6. XPath testing in Schematron

When we transfer our "Modified XPath in XSLT solution" to Schematron we must remember to negate our testing. The xsl:if element in XSLT tests if something is true. The sch:assert element in Schematron tests if something is not true.

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
<sch:ns uri="http://www.w3.org/1999/xhtml" prefix="xhtml"/>
<sch:let name="h" value="//xhtml:*[matches(local-name(),'^h[1-6]')]/number(substring(local-name(),2))"/>
<sch:pattern>
  <sch:title>Testing h1</sch:title>
  <sch:rule context="xhtml:body">
    <sch:assert test="$h[1]=1">h1 must be the first heading.</sch:assert>
    <sch:assert test="count($h[.=1])=1">There can only be one h1.</sch:assert>
  </sch:rule>
  <sch:rule context="xhtml:h1[not(preceding::xhtml:h1)]">
    <sch:let name="a" value="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
    <sch:assert test="not($h[$a + 1]-$h[$a] gt 1)">The first heading after h1 can not be h<sch:value-of select="$h[$a + 1]"/>. Only h2 is allowed.</sch:assert>
  </sch:rule>
</sch:pattern>

<sch:pattern>
  <sch:title>Testing h2</sch:title>
  <sch:rule context="xhtml:h2">
    <sch:let name="a" value="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
    <sch:assert test="not($h[$a + 1]-$h[$a] gt 1)">The first heading after h2(<sch:value-of select="count(preceding::xhtml:h2) + 1" />) can not be h<sch:value-of select="$h[$a + 1]"/>. Only h2 or h3 is allowed.</sch:assert>
  </sch:rule>
</sch:pattern>

<sch:pattern>
<sch:title>Testing h3</sch:title>
  <sch:rule context="xhtml:h3">
    <sch:let name="a" value="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
    <sch:assert test="not($h[$a + 1]-$h[$a] gt 1)">The first heading after h3(<sch:value-of select="count(preceding::xhtml:h2) + 1" />) can not be h<sch:value-of select="$h[$a + 1]"/>. Only h2 or h3 or h4 is allowed.</sch:assert>
  </sch:rule>
</sch:pattern>

<sch:pattern>
<sch:title>Testing h4</sch:title>
  <sch:rule context="xhtml:h4">
    <sch:let name="a" value="count(preceding::xhtml:*[matches(local-name(),'^h[1-6]')]) + 1"/>
    <sch:assert test="not($h[$a + 1]-$h[$a] gt 1)">The first heading after h3(<sch:value-of select="count(preceding::xhtml:h2) + 1" />) can not be h<sch:value-of select="$h[$a + 1]"/>. Only h2 or h3 or h4 or h5 is allowed.</sch:assert>
  </sch:rule>
</sch:pattern>

</sch:schema>

7. Using our Schematron schema

We can now set up validation for our XHTML documents. First we validate them against their DTD or their other grammar-based XML schemas, and next against the Schematron schema.

At the moment our Schematron schema only validates the implicit hierarchical structure. We could also add tests for minimum and maximum count of paragraphs in a section, minimum and maximum count of characters in each paragraph and overall for a document, test for desired level of readability, etc.

Footnotes

[1]

Even the today almost forgotten ISO version of HTML, explicitly saying in its DTD that an HTML document should have an implicit hierarchy, has no way of validating such an hierarchy:

"The <H1> element shall not be followed by an <H3>, <H4>, <H5> or <H6> element without an intervening <H2> element. The <H2> element shall not be followed by an <H4>, <H5> or <H6> element without an intervening <H3> element. The <H3> element shall not be followed by an <H5> or <H6> element without an intervening <H4> element. The <H4> element shall not be followed by an <H6> element without an intervening <H5> element. An <H2> element shall be preceded by an <H1> element. An <H3> element shall be preceded by an <H2> element. An <H4> element shall be preceded by an <H3> element. An <H5> element shall be preceded by an <H4> element. An <H6> element shall be preceded by an <H5> element."

[2]

Both DTD and XML Schema can easily validate implicit hierarchical structures for a subset of simple markup. All we need is to declare a choice between sequences. But in XHTML things get too complex too fast. Even in sensible XHTML the heading elements can be children of body, div and td. It is easy to declare that h1 can only exist once as child of body, but we run into problems when we want to declare that h1 can not be child of div or of td if it is already a child of body.

In less sensible but still valid XHTML things get even worse. Heading elements can also be children of e.g. span and p elements if we make the heading elements inline with CSS. I actually do that for h1 in some very simple XHTML documents. If a bread-crumb trail is the only navigation, I like to integrate the h1 heading element into the bread-crumb trail for simplicity.

[3]

As long as we only have one h1 and it is the first heading, we don't need to test what follows h5 and h6. They can be followed by any other heading element except h1.

[4]

In the HTML 4.01 spec we have this wonderful formulation:

"Some people consider skipping heading levels to be bad practice. They accept H1 H2 H1 while they do not accept H1 H3 H1 since the heading level H2 is skipped."

No! Some people do not accept H1 H2 H1 breaking the implicit hierarchy. No! We are not talking about "bad practice". The question is simply: Do we want an implicit hierarchical structure that can be used for something, or should heading levels just be a quick and dirty way to implement a certain font-size?

Even in the ISO version of HTML, as we can see from above, the use of more than one h1 is not explicitly disallowed. It was simply taken for granted, I guess, that a hierarchy must start with only one element, just like an XML document can only have one top element.

Updated 2009-08-06