REX in PHP, Intro
- Introduction
- Why use a regular expression based XML-parsing trchnique when theres scorees of 'proper' XML parsers and PHP
has one of the powerfull kind built in you may ask.
The answer is in fact rather simple. Most XML parsers require the XML data to be (at least) 'well formed', if not valid XML. Now that is a good thing in itself, being one way of enforcing the exelent machine-readability XML can bring us.
But in 'real life' one all to often has to deal with öess-tahn well-formed data, sometime outright illegal XML. The reasons for this are very varied, and beside the point for our case. While be can and should evengelize people to create and distribute at least well-formed XML, sometimes we are simply interested in making sense (and use) of the data we find.
- A typical use of XML is RSS
- ...which has its use in syndicating newsfeeds of various sources, weblogs being one of the most popular.
Now the medium of the weblog being as 'grass-roots' as it is, the RSS output of many a weblog has major deficiets XML-wise. The simplest case being HTML-markup enclosed unescaped inside an XML element like <description>
Content-wise this makes absolute sense, if you view the RSS-XML as an 'container' for your content, XML-wise it's na-na.
- The Problem
- here is that most standard XML-parsers choke on this kind of illegal XML and simply give us nothing
to work with.
- What we need is a fault-tolerant XML-parser
-
That is still fast and efficient. We need a parser that can work efficently with correct XML data,
but will still return as much viable info about the data in case it is invalid.
In the best case scenario that kind of parser would simply ignore or skip anything that is not RSS markup and thus return the content.
I will try to develop just such a parser in PHP in the following instalments.
- Inspiration
- for this work came from two angles:
- Mark Pilgrim's 'ultra liberal RSS parser' (in Python)
- and the Paper "REX: XML Shallow Parsing with Regular Expressions" by Robert D. Cameron,
esp. the Perl code in Appendix A: Shallow Parsing in Perl
alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors