开发者

Webcrawler - Fetch links

开发者 https://www.devze.com 2023-03-26 04:38 出处:网络
I\'m trying to crawl a webpage, and get all the links, and add them to a list<string> which will be returned in the end, from the function.

I'm trying to crawl a webpage, and get all the links, and add them to a list<string> which will be returned in the end, from the function.

My code:

let getUrls s : seq<string> =
    let doc = new HtmlDocument() in 
              doc.LoadHtml s

    doc.DocumentNode.SelectNodes "//a[@href]"
    |> Seq.map(fun z -> (string z.Attributes.["href"]))

let crawler uri : seq<string> =
开发者_StackOverflow社区    let rec crawl url =
      let web = new WebClient() 
      let data = web.DownloadString url
      getUrls data |> Seq.map crawl (* <-- ERROR HERE *)

    crawl uri

The problem is that at the last line in the crawl function (the getUrls seq.map...), it simply throws an error:

Type mismatch. Expecting a string -> 'a but given a string -> seq<'a> The resulting type would be infinite when unifying ''a' and 'seq<'a>'


crawl is returning unit, but is expected to return seq<string>. I think you want something like:

let crawler uri =
  let rec crawl url =
    seq {
      let web = new WebClient() 
      let data = web.DownloadString url
      for url in getUrls data do
        yield url
        yield! crawl url
    }
  crawl uri

Adding a type annotation to crawl should point out the issue.


i think something like this:

let crawler (uri : seq<string>) =
    let rec crawl url =
        let data = Seq.empty
        getUrls data 
        |> Seq.toList
        |> function
            | h :: t -> 
                crawl h
                t |> List.iter crawl
            | _-> ()

    crawl uri


In order to fetch links:

    open System.Net
    open System.IO
    open System.Text.RegularExpressions

    type Url(x:string)=
     member this.tostring = sprintf "%A" x
     member this.request  = System.Net.WebRequest.Create(x)
     member this.response = this.request.GetResponse()
     member this.stream   = this.response.GetResponseStream()
     member this.reader   = new System.IO.StreamReader(this.stream)
     member this.html     = this.reader.ReadToEnd()

    let linkex                = "href=\s*\"[^\"h]*(http://[^&\"]*)\""

    let getLinks (txt:string) = [ 
                                 for m in Regex.Matches(txt,linkex) 
                                 -> m.Groups.Item(1).Value 
                                 ]

    let collectLinks (url:Url) =   url.html
                                |> getLinks
0

精彩评论

暂无评论...
验证码 换一张
取 消