I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/
<a href="http://example.com/events/1">Event 1</a>
<a href="http://example.com/events/2">Event 2</a>
example.com/eve开发者_StackOverflow中文版nts/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
To use in Python...
import bs4 as BeautifulSoup
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
from lxml.html.soupparser import fromstring
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
FULL PYTHON 3 EXAMPLE
Packages
# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4
Example:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://www.wikipedia.org/') as f:
data = f.read().decode('utf-8')
d = BeautifulSoup(data)
d.title.string
The above should print out 'Wikipedia'
精彩评论