I am writing a web scraper that grabs content from decade articles from wikipedia. (e.g. articles on the 10s, the 1970s, the 1670s BC, and so on.)
I am using logic that resembles this to grab the pages.
for (i = -1690; i <= 2010; i += 10)
if (i < 0)
page = (-i) + "s_BC"
else
page = i + "s"
GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)
This is working, except for one little detail that I hadn't considered.
The problem is that there are two 0s decades. There is a 0s AD and a 0s BC. With the way my loop currently works, the program only grabs the content from the 0s AD page.
This is a pretty simple 开发者_StackOverflowproblem, but I'm having a hard time coming up with a very nice way to fix it. I know I can extract the body of the loop to a separate function and use two separate loops, but I feel like there's a more elegant way to do this that I'm missing.
How can I fix this problem without introducing too much complexity?
You mind hitting a few 404
pages along the way?
for (i = 0; i <= 2010; i+=10)
GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end
If the answer to that question was "yes, I mind" then you can still toss in some if
s:
for (i = 0; i <= 2010; i+=10)
GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
if (i < 1690)
GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end
If you only want one function call, how about something like:
for (int i = -1695; i <= 2015; i += 10)
if (i < 0)
page = (- (i + 5)) + "s_BC";
else
page = (i - 5) + "s";
GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)
There is a logical problem in that when i = 0
if "BC branch" is never run. I'd change it as so:
for (i = -1690; i <= 2010; i+= 10)
if (i <= 0) // includes zero so will run for 0 BC
processDecade((-i) + "s_BC")
if (i >= 0) // not else-if so will match 0 AD after 0 BC (above)
processDecade(i + "s")
function processDecade (page)
GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)
Another approach is to use two loops, one from [-1960, 0] by 10
(or [1960, 0] by -10
) and then from [0, 2010] by 10
. (For languages with nice sequence support this is a doozey in one loop.)
Happy coding.
In Python, could also be translated to CoffeeScript
for i, sign in [(j * 10, -1) for j in range(197)] +\
[(j * 10, 1) for j in range(202)]: # range(N) is going from 0 to N-1
grab_url "%d%s" % (i, "s_BC" if sign < 0 else "s")
精彩评论