开发者

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

开发者 https://www.devze.com 2023-01-22 13:13 出处:网络
Nearing what I would like to think is completion on a tool I\'ve been working on. What I\'ve got going on is some code that does essentially this:

Nearing what I would like to think is completion on a tool I've been working on. What I've got going on is some code that does essentially this:

open several files and urls which consist of known malware/phishing related websites/domains and create a list for each, Parse the html of a url passed when the method is called, pulling out all the a href links and placing them in a separate list,

for every link that was placed in the new list, create a regex for every item thats in the malware and phishing lists, and then compare against to determine if any of the links parsed from the URL passed when the method was called are malicious.

The problem I've ran into is in iterating over the items of all 3 lists, obviously I'm doing it wrong since its throwing this error at me:

File "./test.py", line 95, in <module>
main()
File "./test.py", line 92, in main
crawler.crawl(url)
File "./test.py", line 41, in crawl
self.reg1 = re.compile(link1)
File "/usr/lib/python2.6/re.py", line开发者_运维问答 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat

The following is the segment of code I'm having problems with, with the malware related list create omitted as that part is working fine for me:

def crawl(self, url):
        try:
            doc = parse("http://" + url).getroot()
            doc.make_links_absolute("http://" + url, resolve_base_href=True)
            for tag in doc.xpath("//a[@href]"):
                old = tag.get('href')
                fixed = urllib.unquote(old)
                self.links.append(fixed)

        except urllib.error.URLERROR as err:
            print(err)

        for tgt in self.links:
            for link in self.mal_list:
                self.reg = re.compile(link)
            for link1 in self.phish_list:
                self.reg1 = re.compile(link1)

            found = self.reg.search(tgt)
            if found:
                print(found.group())
            else:
                print("No matches found...")

Can anyone spot what I've done wrong with the for loops and list iteration that would be causing that regex error? How might I fix it? And probably most importantly is the way I'm going about doing this 'pythonic' or even efficient? Considering what I'm trying to do here, is there a better way of doing it?


It seems like your problem is that some of the URLs contain special regex characters, such as ? and +; for instance, the string ++ is really quite likely. The other problem is that you keep overwriting the regex you're using to test. If you just need to check if one string is contained in another, there's no need for a regex; just use

for tgt in self.links:
    for link in (self.mal_list + self.phish_list):
        if link in tgt: print link

And if you're just comparing for equality, you can use == instead of in.

0

精彩评论

暂无评论...
验证码 换一张
取 消