开发者

python how to fetch these string

开发者 https://www.devze.com 2023-01-27 14:14 出处:网络
text=u’<a href=\"#5\" accesskey=\"5\"></a><a href=\"#1\" accesskey=\"1\"><font color=\"#667755\">\\ue689</font></a><a href=\"#2\" accesskey=\"2\"><font co
text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" ac开发者_如何学运维cesskey="3"><font color="#667755">\ue6f6</font></a>‘ 

I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!


Regexp is not good tool to work with HTML. Use the Beautiful Soup.


>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']


Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.


If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.

However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号