Traumwind - I'm having a real scripter-block here.

As a long-time Perl scripter I'm used to doing string manipulation with regex. I've read and understood Mastering Regular Expressions to a degree that I've written a rather nice XML-Parser in pure Perl.

But today I have a problem I can't solve with regex.

I have the following kind of tags in my project:


<span ptal:content="hello">

    some text

</span>

no great problem there:
"/<((\w+)[^>]*)\s+ptal:content=\"([^\"]+)\"([^>]*)>(.*?)<\/\\2>/"
will match the whole thing

But now comes the challenge:


<span ptal:define="say hello">

  <span ptal:content="say">

      some text

  </span>

</span>

The problem is that any kind of above regex will match the yellow part like so:


<span ptal:define="say hello">

  <span ptal:content="say">

      some text

  </span>

</span>

as it will take the first opening-tag and match until it finds an matching closing tag, disregarding any nesting.

And converting the regex to be greedy is no solution, as it would then match the first <span> to the very last </span>...

I guess some programmatic string-parsing is called for here... Darn if only I could do that...

Some poiners I found:

class HTML parser
HotScripts has some very nice XML Stuff
Jazarsoft.com has a very low level charakter based HTML Parser
PHP Classes: HTMLParser looks promising, or rather not...
Far out: A Lisp Based HTML Parser
PhpWiki - New Tranformation Engine seems to have some very good pointers to similar parsing problems
- Let's Build a Compiler This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction.
- Compiling Little Languages in Python "Little languages" such as configuration files or HTML documents are commonplace in computing
  taken from this article its a 'scanner' I'm looking for, not so much a 'parser':
  Scanning, or lexical analysis. Breaks the input stream into a list of tokens. For example, the expression ``2 + 3 * 5'' can be broken up into five tokens: number plus number times number. The values 2, 3, and 5 are attributes associated with the corresponding number token.
  Parsing, or syntax analysis. Ensures that a list of tokens has valid syntax according to a grammar - a set of rules that describes the syntax of the language. For the above example, a typical expression grammar would be:

[ by Martin>] [permalink] [similar entries]

similar entries (vs):

aus der Reihe: Traumwind 'stupid but entertaining PihÄhtschPih tricks'™ (# 9%)
reinventing the wheel... (# 8%)
Perl Advisor / Parsing Interesting Things (# 8%)

similar entries (cg):

aus der Reihe: Traumwind 'stupid but entertaining PihÄhtschPih tricks'™ (# 8%)
Converting HTML to CSS (# 6%)
rss encoding... (# 5%)
apart from the cool news here, (# 5%)
REX: XML Shallow Parsing with Regular Expressions (# 5%)
tomorrow (# 4%)
Hi from Konqueror (# 4%)

relevant words

nbsp (1.5)
lt (1.4)
ptal (1.4)
regex (1.2)
match (1.2)
number (1.1)
tokens (1.1)
a (1)
syntax (1)
the (1)
html (0.9)
text (0.9)
analysis (0.9)
hello (0.9)
grammar (0.9)
expression (0.9)
above (0.9)
problem (0.9)
to (0.8)
as (0.8)
languages (0.8)
list (0.8)
example (0.8)
parsing (0.8)
or (0.8)
very (0.8)
has (0.8)
some (0.8)
of (0.7)
i (0.7)
written (0.7)
perl (0.7)
nice (0.7)
that (0.7)
kind (0.7)
based (0.7)
im (0.7)
is (0.7)
rather (0.7)
have (0.7)
ive (0.7)
into (0.7)
be (0.6)
for (0.6)
in (0.6)
no (0.6)
first (0.6)
with (0.6)
and (0.6)
here (0.6)
would (0.6)
any (0.6)
will (0.6)
it (0.6)
- (0.5)
not (0.5)
from (0.5)
so (0.5)
are (0.5)
opensource.franz.com (0.5)
files (0.5)
valid (0.5)
matching (0.5)
xml-parser (0.5)
/crenshaw/ (0.5)
scanning (0.5)
expressions (0.5)
closing (0.5)
regular (0.5)
looks (0.5)
build (0.5)
www.foretec.com (0.5)
greedy (0.5)
/PHP/Scripts_and_Programs/ (0.5)
poiners (0.5)
/python/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html (0.5)
pointers (0.5)
challenge (0.5)
parser (0.5)
following (0.5)
comes (0.5)
htmlparser (0.5)
understood (0.5)
parserphpwiki (0.5)
scripter (0.5)
/codex.php (0.5)
scanner (0.5)
ensures (0.5)
corresponding (0.5)
broken (0.5)
commonplace (0.5)
www.jazarsoft.com (0.5)
fifteen-part (0.5)
darn (0.5)
scripter-block (0.5)
breaks (0.5)
/php/view.php3 (0.5)
/xmlutils/xmlutils-dist/phtml.htm (0.5)
programmatic (0.5)
configuration (0.5)
yellow (0.5)
string-parsing (0.5)
finds (0.5)
charakter (0.5)
attributes (0.5)
lisp (0.5)
series (0.5)
mastering (0.5)
www.zend.com (0.5)
compilerthis (0.5)
python (0.5)
lexical (0.5)
parserhotscripts (0.5)
(0.5)
plus (0.5)
compilers.iecc.com (0.5)
manipulation (0.5)
phpwiki.sourceforge.net (0.5)
non-technical (0.5)
phpclasses.upperdesign.com (0.5)
google/Mastering+Regular+Expressions (0.5)
jazarsoft (0.5)
/phpwiki/NewTranformationEngine (0.5)
disregarding (0.5)
associated (0.5)
degree (0.5)
solve (0.5)
long-time (0.5)
tranformation (0.5)
/browse.html/package/244.html (0.5)
promising (0.5)
token (0.5)
computingtaken (0.5)
www.hotscripts.com (0.5)
opening-tag (0.5)
but (0.5)
documents (0.4)
converting (0.4)
nesting (0.4)
introduction (0.4)
describes (0.4)
construction (0.4)
typical (0.4)
stream (0.4)
values (0.4)
doing (0.4)
compiling (0.4)
low (0.4)
pure (0.4)
string (0.4)
its (0.4)
problems (0.4)
according (0.4)
level (0.4)
solution (0.4)
called (0.4)
xml (0.4)
rules (0.4)
great (0.4)
article (0.4)
classes (0.4)
far (0.4)
used (0.4)
found (0.4)
part (0.4)
tag (0.4)
five (0.4)
language (0.4)
tags (0.4)
times (0.4)
whole (0.4)
guess (0.4)
until (0.4)
compiler (0.4)
project (0.4)
take (0.4)
such (0.4)
class (0.4)
parser (0.4)
having (0.4)
seems (0.4)
much (0.4)
last (0.4)
lets (0.4)
thing (0.4)
input (0.4)
set (0.4)
looking (0.4)
real (0.4)
engine (0.4)
today (0.4)
similar (0.4)
cant (0.4)
php (0.4)
stuff (0.4)
little (0.4)
then (0.4)
good (0.3)
could (0.3)
out (0.3)
new (0.3)
there (0.3)
now (0.3)
up (0.3)
read (0.3)
an (0.3)
only (0.3)
like (0.3)
if (0.3)
can (0.3)
do (0.3)
my (0.3)
this (0.3)
martin (0.2)
(0)