Cant determine whats causing an regex error, and would like some input on the efficiency of my program_问答_开发者

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

开发者 https://www.devze.com 2023-01-22 13:13 出处：网络

Nearing what I would like to think is completion on a tool I\'ve been working on. What I\'ve got going on is some code that does essentially this:

相关专题：python regex

Nearing what I would like to think is completion on a tool I've been working on. What I've got going on is some code that does essentially this:

open several files and urls which consist of known malware/phishing related websites/domains and create a list for each, Parse the html of a url passed when the method is called, pulling out all the a href links and placing them in a separate list,

for every link that was placed in the new list, create a regex for every item thats in the malware and phishing lists, and then compare against to determine if any of the links parsed from the URL passed when the method was called are malicious.

The problem I've ran into is in iterating over the items of all 3 lists, obviously I'm doing it wrong since its throwing this error at me:

File "./test.py", line 95, in <module>
main()
File "./test.py", line 92, in main
crawler.crawl(url)
File "./test.py", line 41, in crawl
self.reg1 = re.compile(link1)
File "/usr/lib/python2.6/re.py", line开发者_运维问答 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat

The following is the segment of code I'm having problems with, with the malware related list create omitted as that part is working fine for me:

def crawl(self, url):
        try:
            doc = parse("http://" + url).getroot()
            doc.make_links_absolute("http://" + url, resolve_base_href=True)
            for tag in doc.xpath("//a[@href]"):
                old = tag.get('href')
                fixed = urllib.unquote(old)
                self.links.append(fixed)

        except urllib.error.URLERROR as err:
            print(err)

        for tgt in self.links:
            for link in self.mal_list:
                self.reg = re.compile(link)
            for link1 in self.phish_list:
                self.reg1 = re.compile(link1)

            found = self.reg.search(tgt)
            if found:
                print(found.group())
            else:
                print("No matches found...")

Can anyone spot what I've done wrong with the for loops and list iteration that would be causing that regex error? How might I fix it? And probably most importantly is the way I'm going about doing this 'pythonic' or even efficient? Considering what I'm trying to do here, is there a better way of doing it?

It seems like your problem is that some of the URLs contain special regex characters, such as ? and +; for instance, the string ++ is really quite likely. The other problem is that you keep overwriting the regex you're using to test. If you just need to check if one string is contained in another, there's no need for a regex; just use

for tgt in self.links:
    for link in (self.mal_list + self.phish_list):
        if link in tgt: print link

And if you're just comparing for equality, you can use == instead of in.

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

精彩评论

关注公众号

热门标签

图文推荐

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：