I've stole some code from "Expert F# 2.0", that shows how to build a webcrawler, using MailboxProcessor. As you see, then I have a print expression at line 23, that prints the current number of urls in the visited
Set. Also the number of urls to crawl is limited by 49.
open System
open System.Net
open System.Text.RegularExpressions
open Microsoft.FSharp.Control.WebExtensions
let getLinks (txt:string) =
[ for m in Regex.Matches(txt, "href=\s*\"[^\"h]*(http://[^&\"]*)\"") -> m.Groups.Item(1).Value ]
let collectLinks (url:string) =
async { let web = new WebClient()
let! data = web.AsyncDownloadString <| Uri url
let links = getLinks data
return links }
let urlCollector =
MailboxProcessor.Start(fun self ->
let rec waitForUrl (visited : Set<string>) =
async { // Checks whether we have reached the limit of pages to crawl
if visited.Count < 50 then
// Waits for a URL...
let! url = self.Receive()
printfn "%A | %A" visited.Count url
// If not the URL already has been crawled...
if not (visited.Contains url) then
// Start
do! Async.StartChild(
async { let! links = collectLinks url
Seq.iter self.Post links}) |> Async.Ignore
return! waitForUrl (visited.Add url) }
waitForUrl Set.empty)
urlCollector.Post "http://news.google.com/"
That's seems alright eh? - But now the output looks like:
0 | "http://news.google.com/"
1 | "http://www.gstatic.com/news/img/favicon.ico"
2 | "http://mail.google.com/mail/?tab=nm"
3 | "http://www.google.com/intl/en/options/"
4 | "http://docs.google.com/?tab=no"
5 | "http://www.google.com/reader/?tab=ny"
6 | "http://sites.google.com/?ta开发者_开发问答b=n3"
7 | "http://www.google.com/intl/en/options/"
7 | "http://www.google.com/preferences?hl=en"
8 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
9 | "http://www.bloomberg.com/news/2011-08-07/london-rioters-clash-with-police-loot-in-tottenham-after-shooting-death.html"
10 | "http://www.hindustantimes.com/Rioters-battle-police-after-shooting-protest/Article1-730371.aspx"
11 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
12 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
12 | "http://www.montrealgazette.com/London+wakes+riot+aftermath/5218849/story.html"
13 | "http://themediablog.typepad.com/the-media-blog/2011/08/daily-mail-tottenham-violence-twitter.html"
14 | "http://en.wikipedia.org/wiki/2011_Tottenham_riots"
15 | "http://www.babnet.net/festivaldetail-37897.asp"
16 | "http://www.youtube.com/watch?v=l9UImSbegj4"
17 | "http://www.babnet.net/festivaldetail-37897.asp"
17 | "http://www.youtube.com/watch?v=l9UImSbegj4"
17 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
17 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
17 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
17 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
17 | "http://www.bbc.co.uk/news/uk-14436001"
18 | "http://www.bbc.co.uk/news/uk-14436001"
18 | "http://www.kbc.co.ke/news.asp?nid=71755"
19 | "http://www.kbc.co.ke/news.asp?nid=71755"
19 | "http://news.sky.com/skynews/Home/UK-News/Tottenham-Riots-Simmering-Anger-Erupts-In-North-London-After-Protest-At-Mans-Shooting-Death/Article/201108116045172?f=rss"
20 | "http://news.sky.com/skynews/Home/UK-News/Tottenham-Riots-Simmering-Anger-Erupts-In-North-London-After-Protest-At-Mans-Shooting-Death/Article/201108116045172?f=rss"
20 | "http://www.irishtimes.com/newspaper/breaking/2011/0807/breaking2.html?via=mr"
21 | "http://www.irishtimes.com/newspaper/breaking/2011/0807/breaking2.html?via=mr"
21 | "http://www.cbc.ca/news/world/story/2011/08/07/tottenham-riot.html"
22 | "http://www.cbc.ca/news/world/story/2011/08/07/tottenham-riot.html"
22 | "http://www.newsday.com/news/police-officer-hospitalized-7-injured-in-uk-riot-1.3079769"
23 | "http://www.newsday.com/news/police-officer-hospitalized-7-injured-in-uk-riot-1.3079769"
23 | "http://www.msnbc.msn.com/id/44049721/ns/world_news-europe/"
24 | "http://www.msnbc.msn.com/id/44049721/ns/world_news-europe/"
24 | "http://www.timeslive.co.za/world/2011/08/07/eight-london-police-hospitalised-after-riots"
25 | "http://www.timeslive.co.za/world/2011/08/07/eight-london-police-hospitalised-after-riots"
25 | "http://www.cnn.com/2011/WORLD/europe/08/07/uk.riots/"
26 | "http://www.cnn.com/2011/WORLD/europe/08/07/uk.riots/"
26 | "http://www.dailymail.co.uk/news/article-2023348/Tottenham-anarchy-Grim-echo-1985-Broadwater-farm-riot.html"
27 | "http://www.dailymail.co.uk/news/article-2023348/Tottenham-anarchy-Grim-echo-1985-Broadwater-farm-riot.html"
27 | "http://www.mirror.co.uk/news/top-stories/2011/08/06/tottenham-riot-protesters-torch-police-cars-shops-and-a-bus-115875-23325724/"
28 | "http://www.mirror.co.uk/news/top-stories/2011/08/06/tottenham-riot-protesters-torch-police-cars-shops-and-a-bus-115875-23325724/"
28 | "http://www.theglobeandmail.com/news/world/images-of-the-destruction-from-londons-tottenham-riots/article2122026/"
29 | "http://www.theglobeandmail.com/news/world/images-of-the-destruction-from-londons-tottenham-riots/article2122026/"
29 | "http://thelede.blogs.nytimes.com/2011/08/06/shops-and-cars-burn-in-anti-police-riot-in-london/"
30 | "http://thelede.blogs.nytimes.com/2011/08/06/shops-and-cars-burn-in-anti-police-riot-in-london/"
30 | "http://www.stuff.co.nz/world/5403614/Crowds-attack-police-after-UK-protest"
31 | "http://www.stuff.co.nz/world/5403614/Crowds-attack-police-after-UK-protest"
31 | "http://www.google.com/hostednews/afp/article/ALeqM5jOCV_DVSYR1S50v6vdSBjsR5H9Jw?docId=CNG.36dce69df0a155bfd2fa1a3a5f92f6e1.5c1"
32 | "http://www.google.com/hostednews/afp/article/ALeqM5jOCV_DVSYR1S50v6vdSBjsR5H9Jw?docId=CNG.36dce69df0a155bfd2fa1a3a5f92f6e1.5c1"
32 | "http://fallenscoop.com/16993/tottenham-riot-2011-north-london-burns-after-protest-of-mark-duggan"
33 | "http://fallenscoop.com/16993/tottenham-riot-2011-north-london-burns-after-protest-of-mark-duggan"
33 | "http://www.thedailybeast.com/cheats/2011/08/07/riots-grip-north-london.html"
34 | "http://www.thedailybeast.com/cheats/2011/08/07/riots-grip-north-london.html"
34 | "http://www.thehindu.com/news/article2333142.ece"
35 | "http://www.sfgate.com/cgi-bin/article.cgi?f=/g/a/2011/08/07/bloomberg1376-LPHCT11A1I4H01-3ULNPF643I4ERSIU09MO54CQ4B.DTL"
36 | "http://online.wsj.com/community/groups/question-day-229/topics/do-you-agree-sps-decision?commentid=2864110"
37 | "http://www.businessweek.com/ap/financialnews/D9OUMJVO1.htm"
38 | "http://www.cnn.com/2011/BUSINESS/08/06/global.economy.cnn/"
39 | "http://www.chicagotribune.com/news/opinion/editorials/ct-edit-credit-20110806,0,6468631.story"
40 | "http://www.foxbusiness.com/markets/2011/08/07/treasury-hits-back-against-sp-downgrade/"
41 | "http://en.wikipedia.org/wiki/Standard_%26_Poor%27s"
42 | "http://www.usatoday.com/money/companies/management/2011-08-07-verizon-strike_n.htm"
43 | "http://www.businessweek.com/ap/financialnews/D9OV028O3.htm"
44 | "http://www.nbcnewyork.com/news/local/Verizon-Workers-Demonstrate-in-Manhattan-Part-of-45K-Worker-Strike-127087478.html"
45 | "http://www.poughkeepsiejournal.com/article/20110807/NEWS03/110807003/45K-Verizon-workers-strike-over-new-labor-contract-?odyssey=tab%7Ctopnews%7Ctext%7CPoughkeepsieJournal.com"
46 | "http://www.nypost.com/p/news/national/verizon_hit_by_strike_Ga9JjKphZrKCEAr608bqkI"
47 | "http://www.nytimes.com/2011/08/07/us/07verizon.html"
48 | "http://www.ctv.ca/CTVNews/World/20110807/afghanistan-helicopter-crash-fighting-110807/"
49 | "http://abcnews.go.com/International/nato-crash-team-seal-members-killed-afghanistan/story?id=14249189"
What's up with all the duplicates? Also why does some of them print the same "current urls in visited Set" (like 17, 33, 34 etc.)? I'm pretty sure, that I miss something totally fundamental, but I cant figure out what.
In your snippet, the printing using printfn
is done before you check if the URL is already present in the set. This means that it will print the URL even if it will not be added in the next step. (You can see that it wasn't added if you look at the numbers in the left column - if the count wasn't incremented, the number on the next line is the same).
Moving printfn
to the body of the if
expression should give the expected results:
// Waits for a URL...
let! url = self.Receive()
// If not the URL already has been crawled...
if not (visited.Contains url) then
printfn "%A | %A" visited.Count url
// Start
精彩评论