[current]
As a long-time Perl scripter I'm used to doing string manipulation with regex.
I've read and understood
Mastering Regular Expressions to a degree that I've written a rather nice XML-Parser in pure Perl.
But today I have a problem I can't solve with regex.
I have the following kind of tags in my project:
<span ptal:content="hello">
some text
</span>
no great problem there:
"/<((\w+)[^>]*)\s+ptal:content=\"([^\"]+)\"([^>]*)>(.*?)<\/\\2>/"
will match the whole thing
But now comes the challenge:
<span ptal:define="say hello">
<span ptal:content="say">
some text
</span>
</span>
The problem is that any kind of above regex will match the yellow part like so:
<span ptal:define="say hello">
<span ptal:content="say">
some text
</span>
</span>
as it will take the first opening-tag and match until it finds an matching closing tag, disregarding any nesting.
And converting the regex to be greedy is no solution, as it would then match the first <span> to the very last </span>...
I guess some programmatic string-parsing is called for here... Darn if only I could do that...
Some poiners I found:
- class HTML parser
- HotScripts has some very nice XML Stuff
- Jazarsoft.com has a very low level charakter based HTML Parser
- PHP Classes: HTMLParser looks promising, or rather not...
- Far out: A Lisp Based HTML Parser
- PhpWiki - New Tranformation Engine seems to have some very good pointers to similar parsing problems
- Let's Build a Compiler
This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction.
- Compiling Little Languages in Python
"Little languages" such as configuration files or HTML documents are commonplace in computing
taken from this article its a 'scanner' I'm looking for, not so much a 'parser':
Scanning,
or lexical analysis. Breaks the input stream into a list of tokens. For example, the expression ``2 + 3 * 5'' can be broken up into five tokens: number plus number times number. The values 2, 3, and 5 are attributes associated with the corresponding number token.
Parsing, or syntax analysis. Ensures that a list of tokens has valid syntax according to a grammar - a set of rules that describes the syntax of the language. For the above example, a typical expression grammar would be:
[ by Martin>]
[]
[]
similar entries (vs):
similar entries (cg):
Martin Spernau
© 1994-2003
Big things to come (TM) 30th Dez 2002
Look closely at the most embarrassing details and amplify them
Oblique Strategies,
Ed.3
Brian Eno and Peter Schmidt
amazon.de Wunschliste
usefull links:
powered by SBELT