Traumwind - processing the output of the REX parse

processing the output of the REX parse

After the shallow parse is done, we have a list of all XML and text tokens as they appear in the original XML data.
That in itself is naturaly only the first step.

The method I will be eleoping here to make sense and use of the XML data follows fairly close to those used by 'event-driven' XML parsers like expat (which is the built-im YML parser for PHP). That technique is well understood and there's plenty of source code to learn from.

The process:

In this essay I will only deal with three kinds of tokens:
XML-elements, which can either be start-tag or end-tag
and text-data contained within (or without, which is not correct XML, but might happen)

How can we identify a token?: Well, we can use the regex-parts defined earlier. (I strongly suggest you view the earlier code in a second window to follow along as it is a nice list of all regexes used, I will only repeat the parts here that are relevant, and not the parts they are built up of)
Identifying a token as Text-data is easiest: any list element not starting with '<': /[^<]+/ all others are some kind of XML markup (well, hopefully as we will see)
Now we need to pick out those XML elements proper:
The shallow parse is built so that any tokens that are not text will start with an '<', end-tags with '</'.
If, and I mean if we are dealing with an correct markup-token, it will also end with an '>'. ('empty' tags will end in '/>').
Therefore we can simply discard token not ending with '>' as incorrect (this will not suffice in real live, but we will do so for now)
So if we take those regex building bocks, we could write a opening element tag regex as:
// this is already defined $ElemTagCE = "$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/?>?"; $ElemTagRegex = "/<$ElemTagCE/";
There are several problems with this; a) this would grab opening and empty tags likewise, and b) the closing '$lt;' is optional
Let's re-write those regexes slightly by modifying the $ElemTagCE definition:
$OpeningTagRegex = "/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/"; $EmptyTagRegex = "/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/";
Those changes are slight but important. What happens here is that in the first case I have removed the optional '/' before the closing '>', and made it madatory in case of the empty-tag regex. I have also added the opening '<' and made the closing '>' mandatory, which is not strictly necessary, as the shallow parse would have taken care of that.
End-tags are easy:
$EndTagRegex = "/<\/$Name($S)?>/";
Now we at last need to identify text contained inside markup:
$TextRegex = "/TextSE/";
What do we have until now?: A simple script that can parse any given XML file, identify open-, end-, empty tags and enclosed text sections.
run the script here and view the source code below

alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors

Martin Spernau
© 1994-2025

amazon.de Wunschliste

Facebook me!
Google+

Intro

porting REX to PHP

processing the REX output