开发者

Queue(maxsize=) not working?

开发者 https://www.devze.com 2023-01-23 09:57 出处:网络
I\'ve implemented some threading into a project I\'ve been working on in another thre开发者_如何学Pythonad, but the comments and questions have grown way off topic of the original post, so I figured b

I've implemented some threading into a project I've been working on in another thre开发者_如何学Pythonad, but the comments and questions have grown way off topic of the original post, so I figured best thing to do was to make a new question. The problem is this. I want my program to stop iterating over a while loop after an amount of iterations specified by the command line. I'm passing Queue.Queue(maxsize=10), in the following segments of code:

THREAD_NUMBER = 5
def main():
    queue = Queue.Queue(maxsize=sys.argv[2])
    mal_urls = set(make_mal_list())

    for i in xrange(THREAD_NUMBER):
        crawler = Crawler(queue, mal_urls)
        crawler.start()

    queue.put(sys.argv[1])
    queue.join()

And here is the run function:

class Crawler(threading.Thread):

    def __init__(self, queue, mal_urls):
        self.queue = queue
        self.mal_list = mal_urls
        self.crawled_links = []

        threading.Thread.__init__(self) 

    def run(self):
        while True:
            self.crawled = set(self.crawled_links)
            url = self.queue.get()
            if url not in self.mal_list:
                self.crawl(url)
            else:
                print("Malicious Link Found: {0}".format(url))

            self.queue.task_done()

self.crawl is a function which does some lxml.html parsing and then calls another function which does some string handling with the links parsed using lxml, and then calls self.queue.put(link), like so:

def queue_links(self, link, url):

    if link.startswith('/'):
        link = "http://" + url.netloc + link

    elif link.startswith("#"):
        return

    elif not link.startswith("http"):
        link = "http://" + url.netloc + "/" + link

    # Add urls extracted from the HTML text to the queue to fetch them
    if link not in self.crawled:
        self.queue.put(link)
    else:
        return

Does anyone spot where I might have messed up that would be causing the program to never stop running, and why links that have already been crawled are not being recognized as such?


You're not actually passing the integer 10 as the maxsize. You're passing sys.argv[2]. sys.argv is a list of strings, so at best you're passing "10" as the maxsize argument. And unfortunately, in Python 2.x, any integer is less than any string. You probably want to use int(sys.argv[2]) instead.

0

精彩评论

暂无评论...
验证码 换一张
取 消