I am trying to generate a plain text file containing a list开发者_如何学编程 of words that is on a webpage. The problem is that the list is divided into multiple pages.
http://www.whonamedit.com/eponyms/A/?start=50&maxrows=25
This is what I mean. Like for the letter A, I need all 13 pages of words and I also need every letter of the alphabet.
I was thinking of maybe modifying a webcrawler to do this task, would that be the easiest way?
I prefer Java, but Python is ok.
Sorry if the answer is obvious, but any nudges in the right direction would be SO GREATLY appreciated!!
Assuming this is specifically for the whonamedit website, you can do the following:
List<String>getWordsOnPage(String url) {
// read words within <ul class="result-list"> element.
}
void getAllWords() {
List<String> all = new ArrayList<String>();
for (char letter = 'A'; letter <= 'Z'; ++letter) {
for (int start = 0; true; start += 25) {
List<String> page = getWordsOnPage("http://www.whonamedit.com/eponyms/" + letter + "/?start=" + start + "&maxrows=25");
if (page.isEmpty()) {
break;
}
all.addAll(page);
}
}
}
I use HtmlUnit to write spiders
精彩评论