XML Matcher Reference Guide

This document contains detailed description of each matching strategy. If you need brief introduction into XML Matcher read this tutorial.

Table of Contents

Text values matching
Structure matching
Matching XML structures with regular expressions
Advanced matching
Internal design

How Matcher works

The comparison is defined in terms of element tag names, element text values, attribute names and values. Implementation has one limitation: mixed context elements are not supported. That is each element can be compared using containing text (e.g. "Joe" in <name>Joe</name>)  or childen elements (e.g. <name><first>...</first><last>...</last></name>), not both.

Matcher descends down XML tree. On each step it matches single element from tempalte with one ore more actual elements.  It delegates actual matching to a chain of specialized matching strategies. For example, one strategy compares compares text values of elements while another matches elements children.

Strategies are selected based on template annotations. For example, when template element has attrute "xm:regex-text" selects strategy that matches template text as regular expression.

Each strategy performs shallow comparison of template with actual document. That is it delegates decision
selection happens on each level, that is parent strategy does not influence .










Conventions used in this document


  XML Matcher namespace declaration is omitted to save space from all examples in this guide!


All Matcher-specific elements and attributes that are prefixed by 'xm' and belong to namespace http://xml.sf.net/xmlmatcher/1.0. Do not forget to use namespace declaration in your templates:
<root xmlns:xm='http://xml.sf.net/xmlmatcher/1.0'>
...
</root>

Text matching

This section describes various strategies for matching text value of XML elements (CDATA).

Matching texts for equality

By default XML Matcher requires text values to be exactly the same:

Template:
<name>John Doe</name>


Will match with the same text value after normalization:
<name>John Doe</name>
Will not match text that has different character case:
<name>jOHN dOE</name>
or extra spaces
<name> John Doe
</name>
Template specifying empty element:
<name/>
Will match with another form of empty element:
<name></name>
Empty element will not match with text that contains some space characters:
<name> 
</name>
or
<name> </name>
<name xm:ignorecase='true' xm:trim='true'>
John doe
</name>
Will match using case-insensitive matching, with leading and trailing space characters removed:
<name> jOHN dOE </name>

Matching text with wildcard pattern

This strategy allows the definition of a wildcard-based pattern in template that will be matched with actual value.  Notice that attribute xm:wildcard on template element instruct XML Matcher to use wildcard matching strategy (instead of default text-based .


Template:
<street xm:wild="true">Great*St*</street>
Will match:
<street>Great Plain St</street>
Will not match:
<street>Great Ridge Ave</street>

Matching text with regular expression pattern

This strategy allows the definition of a regex-based pattern in template that will be matched with actual value. 

Template:
<street xm:regex-text="true">.*30.*</street>
Will match:
<street>Route 30</street>
Will not match:
<street>31 Washington St</street>
Note: this strategy is applicable for text node matching. There is a similarly named strategy for matching XML structure.

Wildcard matching described in the previous section is similar, but in many cases provides simpler and faster alternative.

Matching numbers

This strategy compares two elements as numbers of type double.

Optional attribute xm:tolerance provides allowable difference between actual and template values (default is 0).

Template:
<x xm:tolerance='0.01'>-72.98</x>
Will match:
<x>-72.9873170132</x>

Matching angle values

Extension of number matching strategy above that allows comparison of two elements with given tolerance as angles.
Template defines that turn angle should be equal to 10 plus/minus 15:
<turnAngle xm:tolerance='15.0' xm:period='360'>10</turnAngle>
The following actual element will match, since 359 lies within -5...+25 :
<turnAngle>359</turnAngle>

Matching time-of-day values

Matching strategy that compares two time of day values specified by two texts (in HH:MM:SS format, see below).

Please note that this strategy uses custom thread-safe parser which does have I18N support.

Time of day string format: H[H][:M[M][:S[S]]] [am|pm|AM|PM], in other words:

For example,

5 same as 05:00:00
0:5 same as 00:05:00
5 p.m. same as 17:00:00

Template:
<start xm:time-tolerance='0:01'>10:00 am</start>
Will match:
<start>10:00:56</start>

Matching complex elements

Strategies described in this secion compare DOM trees. TODO: one level.

Matching set of elements

This strategy verifies that actual element has matching set of children.
TODO: we have an ability to accept more elements in actual document, we need a symmetric feature to support less.


Exact element match

In simple case, template element matches with instance element when both have the same XML tag names and matching sequence of children elements.

Template:
<street>16 Tech Circle</street>
Will match with instance:
<street>16 Tech Circle</street>
Will not match with different element tag name
(note that tag names are case-sensitive):
<Street>16 Tech Circle</Street>
Will not match with instance containing different text value:
<street>120 Oak St</street>
Will not match with instance without text value:
<street/>

Template:
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
</address>
Will match with instance:
<address>
<!-- comments are ignored -->
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
</address>
Will not match with instance containing extra element:
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <county>Middlesex</county>
    <state>MA</state>
</address>
Will not match with instance that has missing element (<city>):
<address>
    <street>16 Tech Circle</street>
    <state>MA</state>
</address>
Will not match with instance that has different order of elements:
<address>
    <state>MA</state>
    <city>Natick</city>
    <street>16 Tech Circle</street>
</address>
Will not match with instance that has extra text element
(mixed content case):
<address>  Some Unexpected text
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
</address>
Will not match with instance that has non-matching child
(here child has different text value):
<address>
    <street>120 Oak St</street>
    <city>Natick</city>
    <state>MA</state>
    <zip>01760</zip>
</address>


Wild cards

If you want to specify single element with any tag name and any content, use special <xm:any> element.

Note: <xm:any> element in template document may not have any sub-elements, but can match to actual elements with or without sub-elements.

Template:
<address xm:regex-dom="true">
    <xm:any/>
</address>
Will match with instance of <address> that has any single element as a content:
<address>
    <street>16 Tech Circle</street>
</address>
Note that xm:any can match with element that has complex element:
<address>
    <zip>
<base>01760</base>
<ext>1029</ext>
  </zip>
</address>
Will not match with instance of <address> that has empty content:
<address/>
Will not match with more  than one element:
(see maxOccurence attribute description below on how to match the same template element multiple times):
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
</address>


Repetitions

If you want to specify multiplicity of an element, use optional xm:minOccurs and  maxOccurs attributes. By default xm:minOccurs and maxOccurs values are equal to 1 (when left unspecified). Use special value "unbounded" to specify "zero or many" type of occurrence. Value of minOccurs must be less than or equal to maxOccurs.

These attributes can be defined on any elements including elements from xr namespace (any,  group, choice, not).

Note: current version only supports the following values: 0, 1, unbounded.

Template:
<address xm:regex-dom="true">
<street xm:minOccurs="0">16 Tech Circle</street>
</address>
Will succeed matching with instance with or without <street> child:
<address>
    <street>16 Tech Circle</street>
</address>
or
<address/>
Will not match more than one occurrence:
<address>
    <street>16 Tech Circle</street>
    <street>16 Tech Circle</street>
</address>
You also can specify occurrences on <xm:any> element.
Template:
<address xm:regex-dom="true">
    <xm:any xm:minOccurs="0" xm:maxOccurs="unbounded"/>
    <state>MA</state>
    <xm:any xm:inOccurs="0" xm:maxOccurs="unbounded"/>
</address>
Will match any instance that contains the same <state> element:
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
</address>
or
<address>
    <state>MA</state>
</address>
or
<address>
    <city>Natick</city>
    <state>MA</state>
    <zip>01760</zip>
</address>

Will not match with any instance missing a <state> element:
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
</address>
Will not match with instance containing different value of <state> element:
<address>
    <state>NH</state>
</address>
Will match multiple occurrences of <state>
<address>
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
    <street>16 Tech Circle</street>
    <city>Natick</city>
    <state>MA</state>
</address>


Sequence (group)

Sequence allows applying the same multiplicity to the ordered set of of elements.

Template:
<address xm:regex-dom="true">
<xm:group xm:minOccurs="0">
<street>16 Tech Circle</street>
<city>Natick</city>
<state>MA</state>
<xm:group/>
</address>
Will match with instance when all elements of the sequence appear exactly once in order they defined in template:
<address>
 <street>16 Tech Circle</street>
 <city>Natick</city>
 <state>1MA</state>
</address>
or when entire group of elements is missing (minOccurs is 0):
<address/>
Will not match when elements appear in different order:
<address>
    <street>16 Tech Circle</street>
    <state>MA</state>
    <city>Natick</city>
</address>
Will not match when one element from the group is missing:
<address>
    <street>16 Tech Circle</street>
<!-- state>MA</state -->
    <city>Natick</city>
</address>



Choice

Choice provides set of alternatives for matching. Matching will succeed if at least one alternative is matched.

In addition to simple elements the following is allowed as choice alternatives:


Template:
<xm:choice>
<nickname/>
<xm:group>
<first/>
<last/>
</xm:group>
</xm:choice>

Will match with either element <nickname/> or pair of elements <first/><last/>.




Exception (Negation)

Analogue of reverse choice. Matches any single element that doesn't match with any of alternatives specified inside <xm:except-any-of>.

Note: Present version of matcher does not support nullable alternatives (e.g. element with minOccurs=0) or alternatives that may be longer than one element (Let me know if this support is required).

As result of this rule:
Element xm:except-any-of allows xm:choice inside. In that context alternatives of xm:choice are simply combined with alternatives of xm:except-any-of.

Template
<xm:except-any-of>
<red/>
<green/>
</xm:except-any-of>
Will match any single element, except element with tagname "red" or "green".
Template
<xm:except-any-of xm:maxOccurs="unbound">
<red/>
<green/>
</xm:except-any-of>
Will match any number of elements, each can be anything except simple element with tagname "red" or "green".
Template:
<xm:except-any-of xm:minOccurs='unbounded'>
<left/>
</xm:except-any-of>
<xm:group xm:minOccurs='1'>
<left/>
<right/>
</xm:group>
<xm:except-any-of xm:minOccurs='unbounded'>
<left/>
</xm:except-any-of>


Will match any sequence of elements that contains one only one element <left/> immediately followed by <right/>.


Advanced Matching

Javascript-based assertions

Processing instruction in template document that have javascript target are interpreted as JavaScript may perform additional assertions. JavaScript context is initialized with positions in matched documents. There are several predefined functions that can navigate template and actual documents using XPath. Here is full list of predefined variables and functions:

Java Script object name
Description
out
java.lang.System.out
err
java.lang.System.err
a
Object of type org.w3c.dom.Element, in current context is initialized to current element of actual document
t
Object of type org.w3c.dom.Element, in current context is initialized to current element of template document
assert.pathExists (xpath)
Verifies that given XPath string selects at least one node in actual document, XPath context node is current element.
assert.equals(xpath1, xpath2)
assert.equals(xpath1, xpath2, tolerance)
Verifies that textual values of two nodes selected by XPath strings in actual document are equal, XPath context node is current element.
assert.isTrue(condition)
assert.isFalse(condition)
Verifies that given JavaScript condition is true/false.
... what else do we need ? ...


Examples:

Template:
<?javascript asserts.pathExists("/step/[street='Route 30']") ?>
Ensures that actual document has element that match XPath:
/step/[street='Route 30']
<steps>
...
<step>
<street>Route 30</street>
</step>
...
</steps>

<?javascript 
asserts.equals(
"/step[1]/street",

"/step[
last()]/street")
?>
Ensures that first and last step elements use the same street:
<steps>
<step>
<street>12 Main St</street>
</step>
...
<step>
 <street>12 Main St</street>
</step>
</steps>


Current implementation uses Mozilla Rhino, but can be switched to support other scripting languages.

Future plans:


Equality sets

Use this feature if you want to validate that two or more elements have similar values. Elements do not need to be declared on the same level, they can appear anywhere in your XML document.
In the following template street elements are compared using wildcard mask:
...
<street xm:equ="sameStreet">* Main St</street>
...
<street xm:equ="sameStreet">* Main St</street>
...
Will match when two elements match '* Main St' wildcard and identical to each other:
...
<street>120 Main St</street>
...
<street>120 Main St</street>
...
The following fragment will not match because two values are not identical (although both match their own template wildcards) :
...
<street>120 Main St</street>
...
<street>666 Main St</street>
...

The following template shows numeric equality with tolerance:
<?equ-tolerance sameCoordinates=0.00001 ?>
...
<x xm:equ="sameCoordinates">-72.123</x>
...
<x xm:equ="sameCoordinates" xm:tolerance="0.001">-72.123</x>
...
The following fragment will not match because difference between two numbers in the same equality set exceed defined tolerance (although they are within their own tolerances):
...
<x>-72.12300000</x>
...
<x>-72.1239999</x>
...



Internal Design

Main

There is a front-end class Main that hides details of Matcher configuration and can be used in most simple cases.

Matcher

Matcher recursively performs tête-à-tête matching of template and actual elements. There are several different matchers (matcher that compares tag names, matcher that compares attribute set, matcher that compares content, etc). They are organized in a chain. Two elements are matched when every matcher in the chain verifies them.

Matching Strategy

StrategyBasedMatcher delegates the actual task of matching to instances of MatchingStrategy interface, each instance may specialize in matching one kind of XML data. For example there are strategies for matching text values of elements, number values, etc. Once the most appropriate strategy is selected it is solely responsible for matching result of current elements.

Matching Strategy Selection

Default strategy selection in controlled by class RegistryBasedStrategySelector which contains simple list of defined strategies. Order of appearance is the following:

Order
Strategy
Accept
elements
without
children?
Accept
elements
with
children
?
Selection Criteria
1
FloatingPointNumbersMatchingStrategy
yes
no
Presence of xm:tolerance attribute
2
RegExTextMatchingStrategy
yes
yes
Presence of  xm:regextext='true' attribute value
3
WildcardMatchingStrategy
yes
yes
Presence of  xm:wild='true' attribute value
4
TimeOfDayMatchingStrategy
yes
no
Presence of  xm:time-tolerance attribute
5
ChildrenOkMatchingStrategy
no
no
Presence of xm:children='ignore' attribute value
6
AngleMatchingStrategy
yes
no
Presence of  xm:period attribue
7
EqualTextValueMatchingStrategy
yes
yes
Default for text-only elements, otherwise presence of xm:ignorecase attribute.
8
RegExElementsMatchingStrategy
no
yes
Presense of xm:regexdom='true' attribute value
9
ElementSequenceMatchingStrategy
yes
yes
Presence of xm:children='sequence'.
10
ElementSetMatchingStrategy yes
yes
Presence of xm:children='set'.
11
ElementBagMatchingStrategy yes
yes
Presence of xm:children='bag', also default for complex elements.






Regular expression constrained DOM structures


Traditional Regular
Expressions construct
Description
XML analogue
Description

single symbol
<x>...</x>
Matches single element with tagname x. See more.
.  
any symbol <xm:any/>
Matches single element with any tagname and any content. See more.
x?   x+    x*  {n:m}
repetition <x xm:minOccurs='n' xm:maxOccurs="m">...</x>
Matches content zero or more times. See more.
(xyz)         
group
<xm:group>  <x/><y/><z/> </xm:group>
Defines group of elements. See more.
(x | y | z )
choice
<xm:choice> <x/><y/><z/> </xm:choice>
Defines matching alternatives. See more.
(^ xyz)
negation
<xm:except-any-of>  <x/><y/><z/> </xm:except-any-of>
Matches any single element that doesn't match with given alternative(s). See more.


NOTE: All examples in this section assume that Regular Expression structure matcher is selected by means of providing xm:regex-dom="true" attribute on parent element. Strategy selection is explained later in this document.

As with traditional regular expression these constructs can be combined into complex patterns. For example:

<!-- anything element except step via 'Mass Pike' route -->
<xm:except-any-of xm:minOccurs='0' xm:maxOccurs='unbounded'> <!-- Note 'unbounded' represents "zero or more" multiplicity -->
<step>
<route>Mass Pike</route>
</step>
</xm:except-any-of>

<!-- followed by two steps via 'Mass Pike' and 'Route 30' -->
<step>
<route>Mass Pike</route>
</step>
<step>
<route>Route 30</route>
</step>

<!-- followed by at least one element -->
<xm:any xm:maxOccurs='unbounded'/>


Note: this strategy is applicable for XML structure matching. There is a similarly named strategy for matching text nodes values.


Back to Main Page