Home | Geschichten | Kunst | Computer | Tindertraum |
One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com/) and it dawned on me that all these sites were, at some level, generated with some templating engine. The Template Toolkit takes a template and some data and produces HTML output.via Keith Devens .com - Weblog (This tip was an excerpt from O'Reilly's new book, Spidering Hacks )
Okay, you might think, very interesting, but how does this relate to scraping web pages for RSS? Well, we know what the HTML looks like, and we can make a reasonable guess at what the template ought to look like, but we want only the data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that could conceivably generate the output, I could retrieve the original data structure and, from then on, generating RSS from the data structure would be a piece of cake.
Like most brilliant ideas, this is hardly original, and an equally brilliant man named Autrijus Tang not only had the idea a long time before me, but - and this is the hard part - actually worked out how to implement it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/) does precisely this: extract a data structure from its template and output.
[ by Martin>] [permalink] [similar entries]
similar entries (vs):
similar entries (cg):
no similar entries (yet?)