processing the output of the REX parse


After the shallow parse is done, we have a list of all XML and text tokens as they appear in the original XML data.
That in itself is naturaly only the first step.

The method I will be eleoping here to make sense and use of the XML data follows fairly close to those used by 'event-driven' XML parsers like expat (which is the built-im YML parser for PHP). That technique is well understood and there's plenty of source code to learn from.

The process:
  1. go through the token list one token at a time
  2. identify each token (is it start-tag, text-data, end-tag or some other XML element)
  3. call an appropriate function for each kind of token

In this essay I will only deal with three kinds of tokens:
XML-elements, which can either be start-tag or end-tag
and text-data contained within (or without, which is not correct XML, but might happen)

How can we identify a token?
Well, we can use the regex-parts defined earlier. (I strongly suggest you view the earlier code in a second window to follow along as it is a nice list of all regexes used, I will only repeat the parts here that are relevant, and not the parts they are built up of)
Identifying a token as Text-data is easiest: any list element not starting with '<':  /[^<]+/ all others are some kind of XML markup (well, hopefully as we will see)
Now we need to pick out those XML elements proper:
The shallow parse is built so that any tokens that are not text will start with an '<', end-tags with '</'.
If, and I mean if we are dealing with an correct markup-token, it will also end with an '>'. ('empty' tags will end in '/>').
Therefore we can simply discard token not ending with '>' as incorrect (this will not suffice in real live, but we will do so for now)
So if we take those regex building bocks, we could write a opening element tag regex as:
// this is already defined
$ElemTagCE =
"$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/?>?";
$ElemTagRegex = "/<$ElemTagCE/";

There are several problems with this; a) this would grab opening and empty tags likewise, and b) the closing '$lt;' is optional
Let's re-write those regexes slightly by modifying the $ElemTagCE definition:
$OpeningTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/";
$EmptyTagRegex =
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/";

Those changes are slight but important. What happens here is that in the first case I have removed the optional '/' before the closing '>', and made it madatory in case of the empty-tag regex. I have also added the opening '<' and made the closing '>' mandatory, which is not strictly necessary, as the shallow parse would have taken care of that.
End-tags are easy:
$EndTagRegex = "/<\/$Name($S)?>/";
Now we at last need to identify text contained inside markup:
$TextRegex = "/TextSE/";

What do we have until now?
A simple script that can parse any given XML file, identify open-, end-, empty tags and enclosed text sections.
run the script here and view the source code below
# displaying source of ./process1.php #
<code>
<?
include("./shallowParser.php"); 
// no need to repeat that code here ;)

// load and parse some XML
$testxml = join("", file("./diveintomark.rss.xml"));
$tokenList = shallowParse($testxml);

// define opening tag and empty tag regexes
$OpeningTagRegex = 
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?>/"; 
$EmptyTagRegex = 
"/<$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/>/"; 
$EndTagRegex = 
"/<\/$Name($S)?>/";
// Text regex
$TextRegex = 
"/^$TextSE$/";

while (list($count,$token)=each($tokenList[0])) {
        if (preg_match($OpeningTagRegex, $token)) {
                echo "<b>opentag</b>
                -&gt;".htmlspecialchars($token)."<br>\n";
        } elseif (preg_match($EmptyTagRegex, $token)) {
                echo "<b>emptytag</b>
                -&gt;".htmlspecialchars($token)."<br>\n";
        } elseif (preg_match($EndTagRegex, $token)) {
                echo "<b>endtag</b>
                -&gt;".htmlspecialchars($token)."<br>\n";
        } elseif (preg_match($TextRegex, $token)) {
                echo "<b>TEXT</b>
                -&gt;".htmlspecialchars($token)."<br>\n";
        } else {
                echo "<b>other</b>
                -&gt;".htmlspecialchars($token)."<br>\n";
        }
}
?>
</code>
# source code end #


alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors

Martin Spernau
© 1994-2024


amazon.de Wunschliste

Facebook me!
Google+

Google

powered by Traumtank