开发者

How do I print a line following a line containing certain text in a saved file in Python?

开发者 https://www.devze.com 2022-12-21 06:30 出处:网络
I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&resp

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)

My program currently searches the file and finds the line containing the div tag, but now I need a way to store th开发者_JS百科e next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC


What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.

Sample code:

import urllib2, BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')

response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()

bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()


You should be using a HTML parser such as BeautifulSoup or lxml instead.


to get the next line, you can use

htmlsource = open('carrier.html', 'r')
for line in htmlsource:
    if '<div class="carrier_result">' in line:
         nextline = htmlsource.next()
         print nextline

A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg

data=open("carrier.html").read().split("</div>")
for item in data:
    if '<div class="carrier_result">' in item:
       print item.split('<div class="carrier_result">')[-1].strip()

by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

0

精彩评论

暂无评论...
验证码 换一张
取 消