[current]
As a long-time Perl scripter I'm used to doing string manipulation with regex.
I've read and understood
Mastering Regular Expressions to a degree that I've written a rather nice XML-Parser in pure Perl.
But today I have a problem I can't solve with regex.
I have the following kind of tags in my project:
<span ptal:content="hello">
some text
</span>
no great problem there:
"/<((\w+)[^>]*)\s+ptal:content=\"([^\"]+)\"([^>]*)>(.*?)<\/\\2>/"
will match the whole thing
But now comes the challenge:
<span ptal:define="say hello">
<span ptal:content="say">
some text
</span>
</span>
The problem is that any kind of above regex will match the yellow part like so:
<span ptal:define="say hello">
<span ptal:content="say">
some text
</span>
</span>
as it will take the first opening-tag and match until it finds an matching closing tag, disregarding any nesting.
And converting the regex to be greedy is no solution, as it would then match the first <span> to the very last </span>...
I guess some programmatic string-parsing is called for here... Darn if only I could do that...
Some poiners I found:
- class HTML parser
- HotScripts has some very nice XML Stuff
- Jazarsoft.com has a very low level charakter based HTML Parser
- PHP Classes: HTMLParser looks promising, or rather not...
- Far out: A Lisp Based HTML Parser
- PhpWiki - New Tranformation Engine seems to have some very good pointers to similar parsing problems
- Let's Build a Compiler
This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction.
- Compiling Little Languages in Python
"Little languages" such as configuration files or HTML documents are commonplace in computing
taken from this article its a 'scanner' I'm looking for, not so much a 'parser':
Scanning,
or lexical analysis. Breaks the input stream into a list of tokens. For example, the expression ``2 + 3 * 5'' can be broken up into five tokens: number plus number times number. The values 2, 3, and 5 are attributes associated with the corresponding number token.
Parsing, or syntax analysis. Ensures that a list of tokens has valid syntax according to a grammar - a set of rules that describes the syntax of the language. For the above example, a typical expression grammar would be:
[ by Martin>]
[]
[]
similar entries (vs):
similar entries (cg):