BeautifulSoup newbe... Need help
Here is the code sample...
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mec开发者_如何学Python = Browser()
#url1 = "http://www.wines.com/catalog/index.php?cPath=21"
url2 = "http://www.wines.com/catalog/product_info.php?products_id=4866"
page = mec.open(url2)
html = page.read()
soup = BeautifulSoup(html)
print soup.prettify()
When I use url1 I get a nice dump of the page. When I use url2(the one I need). I get output without the body.
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html dir="LTR" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
2005 Jordan Cabernet Sauvignon Sonoma 2005
</title>
</head>
</html>
Any ideas?
Yes. The HTML is bad.
Step 1a, print soup.prettify()
and see where it stops indenting correctly.
Step 1b (if 1a doesn't work). Just print the raw through any HTML prettifying. I use BBEdit for things that confuse Beautiful Soup.
Look closely at the HTML. There will be some kind of horrible error. Misplaced "
characters is common. Also, the CSS background-image when given as a style has bad quotes.
<tag style="background-image:url("something")">
Note the "improper" quotes. You'll need to write an Regex to find and fix these.
Step 2. Write a "massage" regular expression and function to fix this. Read the http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps section for how to create a markup massage.
Here's what I often use
# Fix background-image:url("some URI")
# to replace the quotes with "e;
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
return 'background-image:url("e;%s"e;)' % ( match.group(1) )
# Fix <img src="some URI name="someString""> -- note the out-of-place quotes
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )
fix_style_quotes = [
(background_image, fix_background_image),
(bad_img, fix_bad_img),
]
It seems to be getting tripped up by this bad tag:
<META NAME="description" CONTENT="$49 at Wines.com "Deep red. Red- and blackcurrant, cherry and menthol on the nose, with subtle vanilla, cola and tobacco notes adding complexity. Tightly wound red berry and bitter cherry flavors are framed by dusty...">
Clearly here they have failed to escape a quote inside the attribute value (uh-oh... site might be vulnerable to cross-site scripting?), and that's making the parser think the rest of the content of the page is all in attribute values. (It would take another "
or a >
inside one of the real attribute values to make it realise the mistake, I think.)
Unfortunately this is quite a tricky error to fix up after. You could try a slightly different parser, perhaps? eg. try Soup 3.0.x instead of 3.1.x if you're using that version, or vice-versa. Or try html5lib.
The HTML is indeed horrible :-) BeautifulSoup 3.0.7 is much better at handling malformed HTML than the current version. The project website warns: "Currently the 3.0.x series is better at parsing bad HTML than the 3.1 series."... and there's a great page devoted to the reason why, which boils down to the fact that SGMLParser was removed in Python 3, and BS 3.1.x was written to be convertible to Py3k.
The good news is that you can still download 3.0.7a (the last version), which on my machine parses the url you mentioned perfectly: http://www.crummy.com/software/BeautifulSoup/download/3.x/
Running on the HTML in question a validator shows 116 errors -- just too many to track down which one BeautifulSoup is proving unable to recover from, I guess:-(
html5lib seems to survive the ordeal of parsing this horror page, and leaves a lot of stuff in (the prettify has just about all of the original page, it seems to me, when you use html5lib's parser to produce a BeautifulSoup object). Hard to say if the resulting parse tree will do what you need, since we don't really know what that is;-).
Note: I've installed html5lib right from the hg clone sources (just python setup.py install
from the html5lib/python
directory), since the last official release is a bit long in the tooth.
精彩评论