processing the output of the REX parse
After the shallow parse is done, we have a list of all XML and text tokens as they appear in the original
XML data.
That in itself is naturaly only the first step.
The method I will be eleoping here to make sense and use of the XML data follows fairly close to those used by 'event-driven' XML parsers like expat (which is the built-im YML parser for PHP). That technique is well understood and there's plenty of source code to learn from.
- The process:
- go through the token list one token at a time
- identify each token (is it start-tag, text-data, end-tag or some other XML element)
- call an appropriate function for each kind of token
In this essay I will only deal with three kinds of tokens:
XML-elements, which can either be start-tag or end-tag
and text-data contained within (or without, which is not correct XML, but might happen)
- How can we identify a token?
- Well, we can use the regex-parts defined earlier.
(I strongly suggest you view the earlier code
in a second window to follow along as it is a nice list of all regexes used, I will only repeat the parts here
that are relevant, and not the parts they are built up of)
Identifying a token as Text-data is easiest: any list element not starting with '<':/[^<]+/all others are some kind of XML markup (well, hopefully as we will see)
Now we need to pick out those XML elements proper:
The shallow parse is built so that any tokens that are not text will start with an '<', end-tags with '</'.
If, and I mean if we are dealing with an correct markup-token, it will also end with an '>'. ('empty' tags will end in '/>').
Therefore we can simply discard token not ending with '>' as incorrect (this will not suffice in real live, but we will do so for now)
So if we take those regex building bocks, we could write a opening element tag regex as:
// this is already defined
$ElemTagCE =
"$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/?>?";
$ElemTagRegex = "/<$ElemTagCE/";
There are several problems with this; a) this would grab opening and empty tags likewise, and b) the closing '$lt;' is optional
Let's re-write those regexes slightly by modifying the $ElemTagCE definition:
$OpeningTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/";
$EmptyTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/";
Those changes are slight but important. What happens here is that in the first case I have removed the optional '/' before the closing '>', and made it madatory in case of the empty-tag regex. I have also added the opening '<' and made the closing '>' mandatory, which is not strictly necessary, as the shallow parse would have taken care of that.
End-tags are easy:$EndTagRegex = "/<\/$Name($S)?>/";
Now we at last need to identify text contained inside markup:$TextRegex = "/TextSE/";
- What do we have until now?
- A simple script that can parse any given XML file,
identify open-, end-, empty tags and enclosed text sections.
run the script here and view the source code below
# displaying source of ./process1.php #
<code>
<?
include("./shallowParser.php");
// no need to repeat that code here ;)
// load and parse some XML
$testxml = join("", file("./diveintomark.rss.xml"));
$tokenList = shallowParse($testxml);
// define opening tag and empty tag regexes
$OpeningTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/";
$EmptyTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/";
$EndTagRegex =
"/<\/$Name($S)?>/";
// Text regex
$TextRegex =
"/^$TextSE$/";
while (list($count,$token)=each($tokenList[0])) {
if (preg_match($OpeningTagRegex, $token)) {
echo "<b>opentag</b>
->".htmlspecialchars($token)."<br>\n";
} elseif (preg_match($EmptyTagRegex, $token)) {
echo "<b>emptytag</b>
->".htmlspecialchars($token)."<br>\n";
} elseif (preg_match($EndTagRegex, $token)) {
echo "<b>endtag</b>
->".htmlspecialchars($token)."<br>\n";
} elseif (preg_match($TextRegex, $token)) {
echo "<b>TEXT</b>
->".htmlspecialchars($token)."<br>\n";
} else {
echo "<b>other</b>
->".htmlspecialchars($token)."<br>\n";
}
}
?>
</code>
# source code end #
alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors