I'm scraping some data from the frontpages of a list of website domains. Som开发者_高级运维e of them are not answering, or are very slow, causing the scraper to halt.
I wanted to solve this by using a timeout. The various HTTP libraries available don't seem to support that, but System.Timeout.timeout seems to do what I need.
Indeed, it seems to work fine when I test the scraping function, but it crashes as soon as I run the enclosing function: (Sorry for bad/ugly code. I'm learning.)
fetchPage domain =
-- Try to read the file from disk.
catch
(System.IO.Strict.readFile $ "page cache/" ++ domain)
(\e -> downloadAndCachePage domain)
downloadAndCachePage domain =
catch
(do
-- Failed, so try to download it.
-- This craches when called by fetchPage, but works fine when called from directly.
maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
let page = fromMaybe "" maybePage
-- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
-- page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
-- Cache it.
writeFile ("page cache/" ++ domain) page
return page)
(\e -> catch
(do
-- Failed, so just fuggeddaboudit.
writeFile ("page cache/" ++ domain) ""
return "")
(\e -> return "")) -- Failed BIG, so just don't give a crap.
downloadAndCachePage works fine with the timeout, when called from the repl, but fetchPage crashes. If I remove the timeout from downloadAndCachePage, fetchPage will work.
Anyone who can explain this, or know an alternative solution?
Your catch handler in fetchPage looks wrong -- it seems you're trying to read a file, and on file not found exception are directly calling into your http function from the exception handler. Don't do this. For complicated reasons, as I recall, code in exception handlers doesn't always behave like normal code -- particularly when it attempts to handle exceptions itself. And indeed, under the covers, timeout uses asynchronous exceptions to kill threads.
In general, you should put as little code as possible in exception handlers, and especially not put code that tries to handle further exceptions (although it is generally fine to reraise a handled exception to "pass it on" [as with bracket
]).
That said, even if you're not doing the right thing, a crash (if it is a segfault type crash as opposed to a <<loop>>
type crash), even from weird code, is nearly always wrong behavior from GHC, and if you're on GHC 7 then you should consider reporting this.
精彩评论