开发者

Is there an elegant way to prevent my program from skipping a decade?

开发者 https://www.devze.com 2023-04-10 09:10 出处:网络
I am writing a web scraper that grabs content from decade articles from wikipedia. (e.g. articles on the 10s, the 1970s, the 1670s BC, and so on.)

I am writing a web scraper that grabs content from decade articles from wikipedia. (e.g. articles on the 10s, the 1970s, the 1670s BC, and so on.)

I am using logic that resembles this to grab the pages.

for (i = -1690; i <= 2010; i += 10)
    if (i < 0)
        page = (-i) + "s_BC"
    else
        page = i + "s"
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

This is working, except for one little detail that I hadn't considered.

The problem is that there are two 0s decades. There is a 0s AD and a 0s BC. With the way my loop currently works, the program only grabs the content from the 0s AD page.

This is a pretty simple 开发者_StackOverflowproblem, but I'm having a hard time coming up with a very nice way to fix it. I know I can extract the body of the loop to a separate function and use two separate loops, but I feel like there's a more elegant way to do this that I'm missing.

How can I fix this problem without introducing too much complexity?


You mind hitting a few 404 pages along the way?

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end

If the answer to that question was "yes, I mind" then you can still toss in some ifs:

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    if (i < 1690)
        GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end


If you only want one function call, how about something like:

for (int i = -1695; i <= 2015; i += 10)
    if (i < 0)
        page = (- (i + 5)) + "s_BC";
    else
        page = (i - 5) + "s";
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)


There is a logical problem in that when i = 0 if "BC branch" is never run. I'd change it as so:

for (i = -1690; i <= 2010; i+= 10)
    if (i <= 0) // includes zero so will run for 0 BC
      processDecade((-i) + "s_BC")
    if (i >= 0) // not else-if so will match 0 AD after 0 BC (above)
      processDecade(i + "s")

function processDecade (page)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

Another approach is to use two loops, one from [-1960, 0] by 10 (or [1960, 0] by -10) and then from [0, 2010] by 10. (For languages with nice sequence support this is a doozey in one loop.)

Happy coding.


In Python, could also be translated to CoffeeScript

for i, sign in [(j * 10, -1) for j in range(197)] +\
               [(j * 10, 1) for j in range(202)]: # range(N) is going from 0 to N-1
    grab_url "%d%s" % (i, "s_BC" if sign < 0 else "s")
0

精彩评论

暂无评论...
验证码 换一张
取 消