Traumwind - REX, ported to PHP

REX, ported to PHP

The aproach Cameron follows here is to build one big regular expression that will match each part of a XML document, and thereby split the document into a list of XML elements and enclosed text.

I've made a rather direct 'port' of Cameron's JavaScript regexes, as the Perl version uses some modifiers that are not supported in PHP's pcre (perl-compatible regular expressions)

To understand the following example, please read the Paper "REX: XML Shallow Parsing with Regular Expressions" by Robert D. Cameron, that explains the various parts of the regular expression being put together.
esp. the Perl code in Appendix A: Shallow Parsing in Perl

A first working approach in PHP can be found here
execute test2.php (the sample XML file here is a snapshot RSS file downloaded from Mark Pilgrim's site on 24th September. Note: this file is an example of valid, wellformed XML, so it would give any other XML parser no trouble, it's just here for an eyample of XML data)
What can be seen clearly when executing the code is the 'shallow' characteristic of this aproach. We simply get a flat list of all 'tokens' in the order of their appearance. If we were to 'join' this list again, we would have the exact data we started with, with all whitespace, indentation and linebreak intact. (This is actually an important characteristic of this aproach)

Obviously this list is not very much as of yet, and doesn't really buy us anything. What to do with this list will be discussed in the next part.

test2.php code follows:

# displaying source of ./test2.php #
<code>
<?
/*
* The following code was adopted from
* (the Javascript and PHP impl of regex being more compatible)
' 
* REX/Javascript 1.0 
* Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
* Technical Report TR 1998-17, School of Computing Science, Simon Fraser 
* University, November, 1998.
* Copyright (c) 1998, Robert D. Cameron. 
* The following code may be freely used and distributed provided that
* this copyright and citation notice remains intact and that modifications
* or additions are clearly identified.
' 
* 'port' to PHP by Martin Spernau 2002
*/
function shallowParse($xmlstring) {
        $TextSE = "[^<]+";
        $UntilHyphen = "[^-]*-";
        $Until2Hyphens = "$UntilHyphen([^-]$UntilHyphen)*-";
        $CommentCE = "$Until2Hyphens>?";
        $UntilRSBs = "[^]]*]([^]]+])*]+";
        $CDATA_CE = "$UntilRSBs([^]>]$UntilRSBs)*>";
        $S = "[ \\n\\t\\r]+";
        $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
        $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
        $Name = "($NameStrt)($NameChar)*";
        $QuoteSE = "\"[^\"]*\"|'[^']*'";
        $DT_IdentSE = "$S$Name($S($Name|$QuoteSE))*";
        $MarkupDeclCE = "([^]\"'><]+|$QuoteSE)*>";
        $S1 = "[\\n\\r\\t ]";
        $UntilQMs = "[^?]*\\?+";
        $PI_Tail = "\\?>|$S1$UntilQMs([^>?]$UntilQMs)*>";
        $DT_ItemSE = "<(!(--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name($PI_Tail))|%$Name;|$S";
        $DocTypeCE = "$DT_IdentSE($S)?(\\[($DT_ItemSE)*]($S)?)?>?";
        $DeclCE = "--($CommentCE)?|\\[CDATA\\[($CDATA_CE)?|DOCTYPE($DocTypeCE)?";
        $PI_CE = "$Name($PI_Tail)?";
        $EndTagCE = "$Name($S)?>?";
        $AttValSE = "\"[^<\"]*\"|'[^<']*'";
        $ElemTagCE = "$Name($S$Name($S)?=($S)?($AttValSE))*($S)?\/?>?";
        $MarkupSPE = "<(!($DeclCE)?|\\?($PI_CE)?|\/($EndTagCE)?|($ElemTagCE)?)";

        $XML_SPE = "$TextSE|$MarkupSPE";
        
        // display regex, only for the brave ;)
        // echo htmlspecialchars($XML_SPE)."\n";
        
        preg_match_all("/$XML_SPE/", $xmlstring, $matches);
        return $matches;
}        

$testxml = join("", file("./diveintomark.rss.xml"));
$result = shallowParse($testxml);
while (list($count,$token)=each($result[0])) {
        echo "$count -> <b>".htmlspecialchars($token)."</b><br>\n";
}
?>
</code>
# source code end #

alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors

amazon.de Wunschliste

Facebook me!
Google+

Intro

porting REX to PHP

processing the REX output