开发者

re.findall how to return matches with repeated ones

开发者 https://www.devze.com 2023-03-12 00:31 出处:网络
I have list of IP:PORT in html and when i use findall to search all ip i get the list of all ip becouse IP are unique , some of ports are the same and i get by example list of 100 IP\'s and only 87 po

I have list of IP:PORT in html and when i use findall to search all ip i get the list of all ip becouse IP are unique , some of ports are the same and i get by example list of 100 IP's and only 87 ports. How to find all ports with the repeated ones ?

proxies = re.findall("[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}",html)

ports = re.findall("[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}",html)
#ports are coded to开发者_开发百科 looks like this 47,46,47,46

print len(proxies)
print len(ports)


Without seeing the source file, I can only make some basic points.

  • Port numbers are not limited to 3 digits, so you are excluding any port over 999
  • Do the port numbers only show up as a list of 4 ports? You said the format was a list of IP:PORT, but that is not what you are checking for.

EDIT:

Look at the source of the page more carefully. There are entries that do not have 4 port numbers.

<tr>
    <td class="t_ip">151.9.233.6</td>
    <td class="t_port">50,42</td>
    <td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
    <td class="t_anonymity">

            High

    </td>
    <td class="t_https">-</td>
    <td class="t_checked">00:02:16</td>
    <td class="t_check">
        <a href="" class="a_check" >check</a>
    </td>
</tr>

It also seems like it would be a lot easier to check for class="t_ip" and class="t_port" and grab the contents of that element.

<td class="t_ip">(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>
<td class="t_port">((\d,?)+)</td>

Note: The IP address expression will match invalid IP addresses.


Not sure that this will help you too much, but just another option:

txt = """
<tr>
    <td class="t_ip">151.9.233.6</td>
    <td class="t_port">50,42</td>
    <td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
    <td class="t_anonymity">

            High

    </td>
    <td class="t_https">-</td>
    <td class="t_checked">00:02:16</td>
    <td class="t_check">
        <a href="" class="a_check" >check</a>
    </td>
</tr>    
"""

txt = [line.strip() for line in txt.split('\n')]

#clstaglen = len('</td>') => 5
getVals = lambda startTxt: [line[len(startTxt):len(line)-5] for line in txt if line.startswith(startTxt)]

print getVals('<td class="t_ip">')
print getVals('<td class="t_port">')
0

精彩评论

暂无评论...
验证码 换一张
取 消