Traumwind - Holy Cow cool

One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com/) and it dawned on me that all these sites were, at some level, generated with some templating engine. The Template Toolkit takes a template and some data and produces HTML output.

Okay, you might think, very interesting, but how does this relate to scraping web pages for RSS? Well, we know what the HTML looks like, and we can make a reasonable guess at what the template ought to look like, but we want only the data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that could conceivably generate the output, I could retrieve the original data structure and, from then on, generating RSS from the data structure would be a piece of cake.

Like most brilliant ideas, this is hardly original, and an equally brilliant man named Autrijus Tang not only had the idea a long time before me, but - and this is the hard part - actually worked out how to implement it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/) does precisely this: extract a data structure from its template and output.

via Keith Devens .com - Weblog (This tip was an excerpt from O'Reilly's new book, Spidering Hacks )

[ by Martin>] [permalink] [similar entries]

similar entries (vs):

AT&T naturalvoices.com demo (# 11%)
More scraping (# 10%)
reinventing the wheel... (# 8%)
source code documenting systems (# 8%)

similar entries (cg):

no similar entries (yet?)

Martin Spernau
© 1994-2003

Big things to come (TM) 30th Dez 2002

Lowest common denominator
Oblique Strategies, Ed.3 Brian Eno and Peter Schmidt

amazon.de Wunschliste

usefull links:
Google Graph browser
Traumwind 6-Colormatch
UAV News