开发者

Screen Scraping Twitter Page with Unicode Equal Comparison Failure Python

开发者 https://www.devze.com 2023-03-29 15:02 出处:网络
I\'m using the following codeto obtain a list of a user\'s followers on twitter: import urllib from BeautifulSoup import BeautifulSoup

I'm using the following code to obtain a list of a user's followers on twitter:

import urllib
from BeautifulSoup import BeautifulSoup

#code only looks at one page of fol开发者_如何学运维lowers instead of continuing to all of a user's followers
#decided to only use a small sample 

site = "http://mobile.twitter.com/NYTimesKrugman/following"
friends = set()
response = urllib.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
names = soup.findAll('a', {'href': True})
for name in names:
    a = name.renderContents()
    b = a.lower()
    if ("http://mobile.twitter.com/" + b) == name['href']:
        c = str (b)
        friends.add(c)

for friend in friends:
    print friend
print ("Done!")

However, I get the following results:

NYTimeskrugman
nytimesphoto
rasermus

Warning (from warnings module):
   File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter     Crawler\crawlerversion14.py", line 42
    if ("http://mobile.twitter.com/" + b) == name['href']:
 UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
amnesty_norge
zynne_
fredssenteret
oljestudentene
solistkoret

....(and so it continues)

It would appear that I was able to get most of the names of the following but I received a somewhat random error. It didn't stop the code from finishing however...I was hoping that someone could enlighten me as to what happened?


Don't know if my answer will be useful several years later, but I rewrote your code using requests instead of urllib.

I think it's better to made an other selection with the class "username" to consider only followers names !

Here's the stuff :

import requests
from bs4 import BeautifulSoup

site = "http://mobile.twitter.com/paulkrugman/followers"
friends = set()
response = requests.get(site)
soup = BeautifulSoup(response.text)
names = soup.findAll('a', {'href': True})
for name in names:
    pseudo = name.find("span", {"class": "username"})
    if pseudo:
        pseudo = pseudo.get_text()
        friends.add(pseudo)

for friend in friends:
    print (friend)
print("Done !")

@paulkrugman appears in every set, so don't forget to delete it !

0

精彩评论

暂无评论...
验证码 换一张
取 消