I am indexing a set of websites which have a very large number of pages (tens of millions) that are generated from a small number of templates. I am looking for an algorithm to learn the templates that the pages were generated from and to match templates to pages so that I need to store only the variable part and a template reference for each page fetched.
The algorithm need not produce the greatest compression possible, but it should hopefully become better as it sees more pages and it should behave gracefully when faced with a page generated using a previously unseen template.
I would greatly appreciate any references to literature or existing libraries.
I could run a general purpose compression algorithm on batches of pages. The reason I do not want to do that is that the data of interest to me would be in the variable part of the pages and so the template approach would allow me to retrive it without uncompressing. I want to able to recreate the full page if needed both to ensure future replicability and 开发者_如何转开发to guard against bugs in my scraping program.
In some circles, this problem is known as "HTML Wrapper Induction" or "Wrapper Learning". In this paper you can find an interesting -- albeit old -- review along with links to some commercial applications: http://www.xrce.xerox.com/Research-Development/Historical-projects/IWRAP-Intelligent-Wrapper-Learning-Tools)
You may be interested in this Python library: http://code.google.com/p/templatemaker/ "Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files." (http://www.holovaty.com/writing/templatemaker/)
Also, another Python library called scrapy seems to have a wrapper induction library: http://dev.scrapy.org/wiki/Scrapy09Changes#Addedwrapperinductionlibrary
I can't tell much about the algorithms, though. If you want to implement one yourself, this looks like a good starting point: http://portal.acm.org/citation.cfm?id=1859138 It features both wrapper induction and online learning, so you can start to classify pages as you continue in the crawling process.
Have you considered clone detection? These tools determine how copy-and-pasted-code has been reused repeatedly, which sound a lot like what you want. Surely these pages have been created this way, or generated as "template instances" perhaps. Such clone detectors figure out the commonalities automatically.
Some clone detectors match maximal-length identical substrings. The problem with this that trivial differences in the web structure (extra whitespace, newlines, HTML comments) prevent many perfectly valid matches, so you miss cases. This also has the defect that it only finds identical strings, not templates with variation points. Algorithms that do induction over strings (other poster's answer) may suffer from these issues.
What you really want are matches over the structures that make up the languages (of the web pages).
Many clone detectors ("token-based") find common token code sequences that vary in at most one token. These can often ignore the whitespace, by virtual of using a lanugage specific lexer. What you get are templates whose variable points are single tokens.
In my experience, variations are more often language substructures (expressions, statement sequences, ...) and so you want you clone detection based on that. Our CloneDR tool finds clones using langauge grammars to drive the process, that is, it does detection on abstract syntax trees obtained by parsing the pages. (Several key clone detection papers including the one on how CloneDR works are listed on that Wikipedia page).
In your case, the language grammar is going to be a mixture of the langauges that make your web pages, e.g. HTML, JavaScript, and whatever dynamic langauge you used to generate web text dynamically. (We have clone detectors for dirty HTML, JavaScript, C#, Java, JSP, PHP, [not one yet for Perl but close!]...). You can see some clone-detection reports for different languages (unfortunately, not one for HTML) at the link.
The CloneDR results are exactly the commonality (templates), the variation points, and how the variation points differ.
It seems to me that, since HTML is in essence the structured representation of a web page, your best approach would be to ignore textual content altogether and concentrate on identifying both similar and repeating page structures.
Due to the design requirements of a single web site, each of its pages will contain the same group of feature substructures--such as menus, side bars, breadcrumbs, titles, footers, etc.--so that, at a certain depth, all pages within the same web site will be structurally the same and below this depth, page structures will differ. By identifying this "similarity depth", you should be able to isolate those portions of a page which are structurally invariate across the entire site corpus.
Then, for pages which present varying data in the same format, like product descriptions or other data-backed pages, the entire structure of the page will be the same, only varying in textual content. By masking off the invariant features, those shared by all pages, you should be able to reduce the page to only the portion or portions of interest.
The rest is just normalizing HTML (using HTMLTidy) and comparing DOM trees.
Given a large number of pages that use a single template, we would expect to find that the longest common subsequence (LCS) of these pages corresponds closely to the template "shell". (If we only have 2 or 3 pages, then letters in the text that happen to appear in the same order in each will creep into the LCS as well -- that isn't a showstopper however.)
Unfortunately finding the LCS of k sequences requires time expontential in k, however it's possible to produce an approximation by computing LCS(a, LCS(b, LCS(c, ...)), where each LCS() operation on 2 sequences of length n takes O(n^2) time. In fact I expect this approximation to do a better job at weeding out spurious textual subsequences than solving the problem exactly would.
Up to this point I've talked about a hypothetical situation in which all of the files use the same template. But the problem we have is that there are multiple templates. To address this, I propose a clustering algorithm in which we build up clusters of files as we go. For each cluster we will maintain the LCS of all files so far included in that cluster, computed using the pairwise approximation given above; call this clcs[i]
for the ith cluster. For each file in turn, for each cluster i
we compute the LCS of the file with clcs[i]
, and compare the length of this LCS with the length of clcs[i]
. If this length is not much less than the length of clcs[i]
, then this file is "a good fit", so it is added to cluster i
and the LCS just computed becomes the new clcs[i]
. If no existing cluster produces a good-enough fit for the file, then a new cluster is created, containing just this file (and its LCS, which is equal to the file itself).
Regarding "not much less than": this is a weighting factor that will need some tweaking. Obviously when a new cluster is freshly born and contains just a single file, we would expect that another file produced using the same template will have an LCS with it that is quite a bit shorter than that cluster's LCS so far, so we should tolerate quite a drop in LCS length. As the cluster grows in size, its overall LCS should stabilise at the template "shell", so we should be less inclined to add a new file to the cluster if it decreases the LCS length considerably.
This algorithm will in general produce a different set of clusters if files are presented in a different order, so it may be a good idea to try a few different orders and check that the same clusters appear.
Finally, to extract the relevant, varying information from a given file in a cluster, the following algorithm works:
i = 0
for (j = 0 to len(file)) {
if (file[j] == lcs[i]) {
++i;
} else {
output file[j]
}
}
精彩评论