python how to fetch these string_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-27 14:14 出处：网络

text=u’<a href=\"#5\" accesskey=\"5\"></a><a href=\"#1\" accesskey=\"1\"><font color=\"#667755\">\\ue689</font></a><a href=\"#2\" accesskey=\"2\"><font co

相关专题：python

text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" ac开发者_如何学运维cesskey="3"><font color="#667755">\ue6f6</font></a>‘

I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!

Regexp is not good tool to work with HTML. Use the Beautiful Soup.

>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']

Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.

If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.

However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.