开发者

Python script using beautifulSoup to scrape webpage

开发者 https://www.devze.com 2023-04-11 06:37 出处:网络
I am trying to scrape the contents of the following page using BeautifulSoup, <div data-referrer=\"pagelet_123\" id=\"pagelet_123\">

I am trying to scrape the contents of the following page using BeautifulSoup,

<div data-referrer="pagelet_123" id="pagelet_123">
<div id="1" class="p1">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info - 1</h4>
</div></div></div><div class="phs">
<table class="uicontenttable">
<tbody>
<tr>
<th class="label">Other</th>
<td class="data"><div id="ua94ty_3" class="uiCollapsedList uiCollapsedListHidden uiCollapsedListNoSeparate pagesListData">
<span class="visible">
<a href="http://abc.com/Federer">info-2</a>, 
<a href="http://abc.com/pages/Ian-Wright-Out-of-Bounds/117602014955747">info-3</a>, 
<a href="http://abc.com/JuniperNetworks">info-4</a>, 
<a href="http://abc.com/pages/Join-Diaspora/118635234836351">info-5</a>
</span>
</div>
</td>
<td class="rightCol">
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div data-referrer="pagelet_ent" id="pagelet_ent">
<div id="2" class="section2">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info-6</h4>
</div></div></div>
<div class="phs"><table class="uiInfoTable mtm profileInfoTable">
<tbody>
<tr>
<th class="label">info - 7</th><td class="data">
<div class="mediaRowWrapper ">
<ul class="uiList uiListHorizontal clearfix pbl mediaRow">
<li class="uiListItem  uiListHorizontalItemBorder uiListHorizontalItem">
<a href="URL - 1">
<div class="mediaPortrait">
<div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo">
<img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="Hans Zimmer" alt="" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/203614_7170054127_6578457_s.jpg" class="img"></div><div class="mediaPageName">info - 8</div></div></a></li><li 开发者_运维问答class="pls uiListItem  uiListHorizontalItemBorder uiListHorizontalItem">

<a href="URL - 2">
<div class="mediaPortrait"><div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo"><img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="test" alt="" src="http://external.ak.fbcdn.net/safe_image.php?d=AQCVRllyopjA_z5F&amp;w=100&amp;h=300&amp;url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F59%2F-2.jpg&amp;fallback=hub_music&amp;prefix=s" class="img"></div><div class="mediaPageName">test</div></div></a>
</div>
<div class="mediaPageName">info - 8
</div>
</div>
</a>

This page contains multiple nested div's and table. need help in using BeautifulSoup to parse only info - 1 info -2 ... info -6 and URL - 1 and URL -2.

I read BeautifulSoup's documentation, it was not much helpful. also please suggest some BeautifulSoup reference doc, book for parsing complex web pages.

Thanks for your help, appreciated!

sat


Their documentation doesn't serve your purposes?

http://www.crummy.com/software/BeautifulSoup/documentation.html

It looks to me like you're going to want something like:

from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(theXMLAsAString)
results = soup.findAll(re.compile('info - [1-6]'))
for r in results:
    myurl = r.parent.href

That code isn't tested, but is the general idea of how to use BeautifulSoup.

0

精彩评论

暂无评论...
验证码 换一张
取 消