processing the output of the REX parse
After the shallow parse is done, we have a list of all XML and text tokens as they appear in the original
XML data.
That in itself is naturaly only the first step.
The method I will be eleoping here to make sense and use of the XML data follows fairly close to those used by 'event-driven' XML parsers like expat (which is the built-im YML parser for PHP). That technique is well understood and there's plenty of source code to learn from.
- The process:
- go through the token list one token at a time
- identify each token (is it start-tag, text-data, end-tag or some other XML element)
- call an appropriate function for each kind of token
In this essay I will only deal with three kinds of tokens:
XML-elements, which can either be start-tag or end-tag
and text-data contained within (or without, which is not correct XML, but might happen)
- How can we identify a token?
- Well, we can use the regex-parts defined earlier.
(I strongly suggest you view the earlier code
in a second window to follow along as it is a nice list of all regexes used, I will only repeat the parts here
that are relevant, and not the parts they are built up of)
Identifying a token as Text-data is easiest: any list element not starting with '<':/[^<]+/
all others are some kind of XML markup (well, hopefully as we will see)
Now we need to pick out those XML elements proper:
The shallow parse is built so that any tokens that are not text will start with an '<', end-tags with '</'.
If, and I mean if we are dealing with an correct markup-token, it will also end with an '>'. ('empty' tags will end in '/>').
Therefore we can simply discard token not ending with '>' as incorrect (this will not suffice in real live, but we will do so for now)
So if we take those regex building bocks, we could write a opening element tag regex as:
// this is already defined
$ElemTagCE =
"$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/?>?";
$ElemTagRegex = "/<$ElemTagCE/";
There are several problems with this; a) this would grab opening and empty tags likewise, and b) the closing '$lt;' is optional
Let's re-write those regexes slightly by modifying the $ElemTagCE definition:
$OpeningTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/";
$EmptyTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/";
Those changes are slight but important. What happens here is that in the first case I have removed the optional '/' before the closing '>', and made it madatory in case of the empty-tag regex. I have also added the opening '<' and made the closing '>' mandatory, which is not strictly necessary, as the shallow parse would have taken care of that.
End-tags are easy:$EndTagRegex = "/<\/$Name($S)?>/";
Now we at last need to identify text contained inside markup:$TextRegex = "/TextSE/";
- What do we have until now?
- A simple script that can parse any given XML file,
identify open-, end-, empty tags and enclosed text sections.
run the script here and view the source code below
# displaying source of ./process1.php # <code> <? include("./shallowParser.php"); // no need to repeat that code here ;) // load and parse some XML $testxml = join("", file("./diveintomark.rss.xml")); $tokenList = shallowParse($testxml); // define opening tag and empty tag regexes $OpeningTagRegex = "/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/"; $EmptyTagRegex = "/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/"; $EndTagRegex = "/<\/$Name($S)?>/"; // Text regex $TextRegex = "/^$TextSE$/"; while (list($count,$token)=each($tokenList[0])) { if (preg_match($OpeningTagRegex, $token)) { echo "<b>opentag</b> ->".htmlspecialchars($token)."<br>\n"; } elseif (preg_match($EmptyTagRegex, $token)) { echo "<b>emptytag</b> ->".htmlspecialchars($token)."<br>\n"; } elseif (preg_match($EndTagRegex, $token)) { echo "<b>endtag</b> ->".htmlspecialchars($token)."<br>\n"; } elseif (preg_match($TextRegex, $token)) { echo "<b>TEXT</b> ->".htmlspecialchars($token)."<br>\n"; } else { echo "<b>other</b> ->".htmlspecialchars($token)."<br>\n"; } } ?> </code> # source code end #
alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors