Trying to make a long story short so I apologize in advance, feel free to ask more questions for clarity. Essentially I am trying to make a web scraping script that takes info from Zillow and puts it into a pandas data frame so that I can learn both pandas and beautifulsoup4 in the process. I am trying to avoid using the Zillow API but it seems it might be my only option. So, when I scrape the location the user inputs, it only returns 7 properties. I was told this is because of the Javascript Zillow uses ("Lazy-loading" or "infinite scrolling".) Basically the other properties aren't loaded until the user scrolls. I tried using selenium instead of requests but I end up getting bot verification captcha'd. I tried using headers and everything but cant seem to figure out a solution other than the API.
Here's my code BEFORE using selenium (aka when it semi-worked):
from bs4 import BeautifulSoup
import pandas as pd
from uszipcode import SearchEngine
import requests, prettify
search = SearchEngine()
zipcode = input("What is your zipcode: ")
zipcode_info = search.by_zipcode(zipcode)
headers = {
'accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding' : 'en-US,en;0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as session:
url= "https://www.zillow.com/homes/for_sale/" + zipcode_info.major_city + "/"
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.prettify()
df = pd.DataFrame()
address = list()
price = list()
bed_bath = list()
links = list()
properties = soup.find_all("li", attrs={"class": "ListItem-c11n-8-73-8__sc-10e22w8-0 srp__hpnp3q-0 enEXBq with_constellation"})
for li in properties:
try:
address.append(li.find("a", attrs = {"class": "StyledPropertyCardDataArea-c11n-8-73-8__sc-yipmu-0 lhIXlm property-card-link"}).text)
except:
pass
try:
price.append(li.find("span", attrs = {"data-test": "property-card-price"}).text)
except:
pass
try:
span = (li.find("span", attrs = {"class": "StyledPropertyCardHomeDetails-c11n-8-73-8__sc-1mlc4v9-0 jlVIIO"}))
for subspan in span:
bed_bath.append(subspan.find("b").text)
except:
pass
try:
links.append( (li.find("a", attrs = {"data-test": "property-card-link"}).get("href")) )
except:
pass
df['Address'] = address
df['Price'] = price
df['Links'] = links
print (df)
And the output is:
Address Price Links
0 525 W River Dr, Pennsauken, NJ 08110 $259,900 https://www.zillow.com/homedetails/525-W-River...
1 7519 Remington, Merchantville, NJ 08109 $270,000 https://www.zillow.com/homedetails/7519-Reming...
2 2269 Marlon Ave, Pennsauken, NJ 08110 $220,000 https://www.zillow.com/homedetails/2269-Marlon...
3 8129 River Rd, Pennsauken, NJ 08110 $324,999 https://www.zillow.com/homedetails/8129-River-...
4 1653 Springfield Ave, Pennsauken, NJ 08110 $259,900 https://www.zillow.com/homedetails开发者_StackOverflow中文版/1653-Spring...
5 5531 Jackson Ave, Pennsauken, NJ 08110 $265,000 https://www.zillow.com/homedetails/5531-Jackso...
6 8141 Stow Rd, Pennsauken, NJ 08110 $359,000 https://www.zillow.com/homedetails/8141-Stow-R...
7 2203 42nd St, Pennsauken, NJ 08110 $275,000 https://www.zillow.com/homedetails/2203-42nd-S...
精彩评论