I've implemented some threading into a project I've been working on in another thre开发者_如何学Pythonad, but the comments and questions have grown way off topic of the original post, so I figured best thing to do was to make a new question. The problem is this. I want my program to stop iterating over a while loop after an amount of iterations specified by the command line. I'm passing Queue.Queue(maxsize=10), in the following segments of code:
THREAD_NUMBER = 5
def main():
queue = Queue.Queue(maxsize=sys.argv[2])
mal_urls = set(make_mal_list())
for i in xrange(THREAD_NUMBER):
crawler = Crawler(queue, mal_urls)
crawler.start()
queue.put(sys.argv[1])
queue.join()
And here is the run function:
class Crawler(threading.Thread):
def __init__(self, queue, mal_urls):
self.queue = queue
self.mal_list = mal_urls
self.crawled_links = []
threading.Thread.__init__(self)
def run(self):
while True:
self.crawled = set(self.crawled_links)
url = self.queue.get()
if url not in self.mal_list:
self.crawl(url)
else:
print("Malicious Link Found: {0}".format(url))
self.queue.task_done()
self.crawl is a function which does some lxml.html parsing and then calls another function which does some string handling with the links parsed using lxml, and then calls self.queue.put(link), like so:
def queue_links(self, link, url):
if link.startswith('/'):
link = "http://" + url.netloc + link
elif link.startswith("#"):
return
elif not link.startswith("http"):
link = "http://" + url.netloc + "/" + link
# Add urls extracted from the HTML text to the queue to fetch them
if link not in self.crawled:
self.queue.put(link)
else:
return
Does anyone spot where I might have messed up that would be causing the program to never stop running, and why links that have already been crawled are not being recognized as such?
You're not actually passing the integer 10
as the maxsize. You're passing sys.argv[2]
. sys.argv
is a list of strings, so at best you're passing "10"
as the maxsize argument. And unfortunately, in Python 2.x, any integer is less than any string. You probably want to use int(sys.argv[2])
instead.
精彩评论