开发者

implement multithreaded crawler

开发者 https://www.devze.com 2022-12-18 13:12 出处:网络
I would like to implement a mulithtreaded crawler using the single thread crawler code I have now. Basically I read the urls from a text file, take each one and crawl and parse it. I know how thread b

I would like to implement a mulithtreaded crawler using the single thread crawler code I have now. Basically I read the urls from a text file, take each one and crawl and parse it. I know how thread basics of creating a thread and assigning a process to it but not too sure how to implement in the following way:

I need at least 3 threads and need to assign a url to each thread from a list of urls, and then each needs to go and fetch it and parse it before adding contents to a database.

Dim gthread, tthread, ithread As Thread

        gthread = New Thread(AddressOf processUrl)
        gthread.Start(url)

        tthread = New Thread(AddressOf processUrl))
        tthread.Start(url)


        ithread = New Thread(AddressOf processUrl))
    开发者_开发技巧    ithread.Start(url)

WaitUntilAllAreOver:

        If gthread.ThreadState = ThreadState.Running Then
            Thread.Sleep(5)
            GoTo WaitUntilAllAreOver
        End If

'etc..

Now the code maynot make sense but what I need to do is add a unique url to each thread to go process.

Any ideas appreciated


The best way to wait for the Thread instances to finish is to call the .Join method. Take the following example

Public Sub ParseAll(ByVal ParamArray urls As Uri()) 
  Dim list as New List(Of Thread)
  For Each url in urls
    Dim thread = New Thread(AddressOf ProcessUrl)
    thread.Start(url)
    list.Add(thread)
  Next
  For Each thread in list
    thread.Join
  Next
End Sub

Though you may want to consider using the ThreadPool here. The ThreadPool is designed for spawning off lots of small tasks very efficiently.


You could use a synchronized Queue where u push the URLs to and every crawler takes the next URL it visits out of this Queue. When they detect new URLs, the push them into the Queue, too.


I recommend using a Background worker to accomplish this.


Look into the Concurrency and Coordination Runtime (CCR). I have built a few crawlers based on that framework, and it makes things very easy once you understand how the CCR works.

Should take you a few hours to get up to speed with the CCR.

0

精彩评论

暂无评论...
验证码 换一张
取 消