NAME
HTML::Untemplate - web scraping assistant
VERSION
version 0.013
DESCRIPTION
Suppose you have a set of HTML documents generated by populating the
same template with the data from some kind of database. HTML::Untemplate
is a set of command-line tools ("xpathify", "untemplate") and modules
(HTML::Linear and it's dependencies) which assist in original data
retrieval.
This process is also known as wrapper induction
.
To achieve this goal, HTML tree nodes are presented as XPath/content
pairs. HTML documents linearized this way can be easily inspected
manually or with a diff tool. Please refer to "EXAMPLES".
Despite being named similarly to HTML::Template, this distribution is
not directly related to it. Instead, it attempts to reverse the
templating action, whatever the template agent used.
Why?
Suppose you have a CMS. Typical CMS works roughly as this (data flows
bottom-down):
RDBMS
scripting language
HTML
HTTP server
(...)
HTTP agent
layout engine
screen
user
Consider the first 3 steps: "RDBMS => scripting language => HTML"
This is "applying template".
Now, consider this: "HTML => scripting language => RDBMS"
I would call that "un-applying template", or "untemplate" ":)"
The practical application of this set of tools is to assist in creation
of web scrappers.
A similar (however completely unrelated) approach is described in the
paper XPath-Wrapper Induction for Data Extraction
.
Human-readability
Consider the following HTML node address representations:
* 0.1.3.0.0.4.0.0.0.2 (HTML::TreeBuilder internal address
representation);
* "/html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3]"
(HTML::Linear, strict);
* "//td[1]/ul[1]/li[3]" (HTML::Linear, strict, shrink);
* "/html/body[@class='section_home']/div[@id='content_holder'][1]/div[
@id='content']/div[@id='main']/table[@class='content_table'][2]/tr/t
d/ul/li[@class='rss_content rss_content_col'][2]" (HTML::Linear,
non-strict);
* "//li[@class='rss_content rss_content_col'][2]" (HTML::Linear,
non-strict, shrink).
They all point to the same node, however, their verbosity/readability
vary. The *strict* mode specifies tag names and positions only.
Disabling *strict* will use additional data from CSS selectors. *Shrink*
mode attempts to find the shortest XPath unique for every node
("/html/body" is shared among almost all nodes, thus is likely to be
irrelevant).
EXAMPLES
xpathify
The xpathify tool flatterns the HTML tree into key/value list:
Hello HTML
Hello World!
This is a sample HTML
Beware!
HTML is not XML!
Have a nice day.
Becomes:
*(HTML block)*
The keys are in XPath format, while the values are respective content
from the HTML tree. Theoretically, it could be possible to reassemble
the HTML tree from the flat key/value list this tool generates.
untemplate
The untemplate tool flatterns a set of HTML documents using the
algorithm from xpathify. Then, it strips the shared key/value pairs. The
"rest" is composed of original values fed into the template engine.
And this is how the result actually looks like with some simple
real-world examples (quotes 1839 and 2486
from bash.org ):
*(HTML block)*
MODULES
May be used to serialize/flattern HTML documents by your own:
* HTML::Linear - represent HTML::Tree as a flat list
* HTML::Linear::Element - represent elements to populate HTML::Linear
* HTML::Linear::Path - represent paths inside HTML::Tree
SEE ALSO
* Wrapper (data mining)
* XPath-Wrapper Induction for Data Extraction
* HTML::TreeBuilder
* HTML::Similarity
* XML::DifferenceMarkup
AUTHOR
Stanislaw Pusep
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.