I want to extract the onel-iner-texts from this website using Python. The messages in HTML look like this:
<div class="olh_message">
<p>foobarbaz <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" /></p>
</div>
My code looks like this so far:
import lxml.html
url = "http://www.scenemusic.net/demovibes/oneliner/"
xpath = "//div[@class='olh_message']/p"
tree = lxml.html.parse(url)
texts = tree.xpath(xpath)
texts = [text.text_content() for text in texts]
print(texts)
Now, however, I only get foobarbaz
, I however would like to get the title-argument of the img's in it as well, so in this example foobarbaz :necta:
. It seems I need lxml's DOM 开发者_高级运维parser to do it, however I have no idea how. Anyone can give me a hint?
Thanks in advance!
try this
import lxml.html
url = "http://www.scenemusic.net/demovibes/oneliner/"
parser = lxml.etree.HTMLParser()
tree = lxml.etree.parse(url, parser)
texts = tree.xpath("//div[@class='olh_message']/p/img/@title")
Use:
//div[@class='olh_message']/p/node()
his selects all children nodes (elements, text-nodes, PIs and comment-nodes) of any p
element that is a child of any div
element, whose class
attribute is 'olh_message'
.
Verification using XSLT as the host of XPath:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="//div[@class='olh_message']/p/node()"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the following XML document:
<div class="olh_message">
<p>foobarbaz
<img src="/static/emoticons/support-our-fruits.gif" title=":necta:" />
</p>
</div>
the wanted, correct result is produced (showing that exactly the wanted nodes have been selected by the XPath expression):
foobarbaz
<img src="/static/emoticons/support-our-fruits.gif" title=":necta:"/>
精彩评论