开发者

XPath: Select Current and Next Node's text by Current Node Attributes

开发者 https://www.devze.com 2023-02-14 19:05 出处:网络
If this is a repeat question, I apologize, but I can\'t find another question either on SO or elsewhere that seems to handle what I need. Here is my question:

If this is a repeat question, I apologize, but I can't find another question either on SO or elsewhere that seems to handle what I need. Here is my question:

I'm using scrapy to get some information out of this webpage. For clarity, following is a block of the source code from that webpage, which is of interest to me:

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                        <span class='distribution'>(SCI)</span></p> 

<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
            onMouseover="showtip(this,event,'24 Lectures')"
            onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
            onMouseover="showtip(this,event,'12 Tutorials')"
            onMouseout="hidetip()">12T</span>]<br> 

<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>

<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 
</span><br/><br/<br/> 

Almost all of the code on that page looks like the above block.

From all of this, I need to grab:

开发者_开发问答
  1. ANT101H5 Introduction to Biological Anthropology and Archaeology
  2. Exclusion: ANT100Y5
  3. Prerequisite: ANT102H5

The problem is that Exclusion: is inside a <span class="title2"> and ANT100Y5 is inside the following <a>.

I don't seem to be able to grab both of them out of this source code. Currently, I have code that attempts (and fails) to grab ANT100Y5 which looks like:

hxs = HtmlXPathSelector(response)
    sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")

I'd appreciate any help with this, even if it's a "you're blind for not seeing this other SO question which answers this perfectly" (in which case, myself will vote to close this). I really am that much at my wits end.

Thanks in advance

EDIT: Complete original code after changes suggested by @Dimitre

I'm using the following code:

class regcalSpider(BaseSpider):
    name = "disc"
    allowed_domains = ['www.utm.utoronto.ca']
    start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']

    def parse(self, response):
            items = []
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("/*/p/text()[1] | \
                              (//span[@class='title2'])[1]/text() | \
                              (//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
                              (//span[@class='title2'])[2]/text() | \
                              (//span[@class='title2'])[2]/following-sibling::a[1]/text()")

            for site in sites:
                    item = RegcalItem()
                    item['title'] = site.select("a/text()").extract()
                    item['link'] = site.select("a/@href").extract()
                    item['desc'] = site.select("text()").extract()
                    items.append(item)
            return items

            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

Which gives me this result:

[{"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []}]

This is not the output that I need. What am I doing wrong? Keep in mind that I'm running this script on this, as mentioned.


.1. ANT101H5 Introduction to Biological Anthropology and Archaeology

p[@class='titlestyle']/text()

.2. Exclusion: ANT100Y5

concat(
    span/span[@class='title2'][1],
    span/span[@class='title2'][1]/following-sibling::a[1]
    )

.3. Prerequisite: ANT102H5

concat(
    span/span[@class='title2'][2],
    span/span[@class='title2'][2]/following-sibling::a[1]
    )


It's not difficult to select the three nodes you refer to (using techniques such as those of Flack). What's difficult is (a) selecting them without also selecting other things that you don't want, and (b) making your selection robust enough that it still selects them if the input is slightly different. We have to assume that you don't know exactly what's in the input - if you did, you wouldn't need to write an XPath expression to find out.

You've told us three things that you want to grab. But what are your criteria for selecting these three things, and not selecting something else? How much is known about what you are looking for?

You've expressed your problem as an XPath problem, but I would tackle it differently. I would start by transforming the input you have shown to something with better structure, using XSLT. In particular, I would try to wrap all the sibling elements that aren't within a <p> element into <p> elements, treating each group of successive elements ending in <br> as a paragraph. That can be done without too much difficulty using the <xsl:for-each-group group-ending-with> construct in XSLT 2.0.


My answers are quite like those of @Flack:

Having this XML document (corrected the provided one in closing numerous unclosed <br>s and in wrapping everything in a single top element):

<body>
    <p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
        <span class='distribution'>(SCI)</span>
    </p>
    <span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
        <span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>, 
        <span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
        <br/>
        <span class='title2'>Exclusion: </span>
        <a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
        <br/>
        <span class='title2'>Prerequisite: </span>
        <a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
        <br/>
    </span>
    <br/>
    <br/>
    <br/>
</body>

This XPath expression:

normalize-space(/*/p/text()[1])

when evaluated produces the wanted string (the surrounding quotes are not in the result. I added them to show the exact string produced):

"ANT101H5 Introduction to Biological Anthropology and Archaeology"

This XPath expression:

concat((//span[@class='title2'])[1],
            (//span[@class='title2'])[1]
                   /following-sibling::a[1]
            )

when evaluated produces the following wanted result:

"Exclusion: ANT100Y5"

This XPath expression:

concat((//span[@class='title2'])[2],
            (//span[@class='title2'])[2]
                   /following-sibling::a[1]
            )

when evaluated produces the following wanted result:

"Prerequisite: ANT102H5"

Note: In this particular case the abbreviation // is not needed and in fact this abbreviation should always when possible be avoided, because it leads to slower evaluation of the expression, causing in many cases a complete (sub) tree traversal. I am using '//' intentionally, because the provided XML fragment doesn't give us the full structure of the XML document. Also, This demonstrates how to correctly index the results of using // (note the surrounding brackets) -- helping to prevent a very frequent mistake in trying to do so

UPDATE: The OP has requested a single XPath expression that selects all the required text nodes -- here it is:

/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()

When applied on the same XML document as above, the concatenation of the text nodes is exactly what is required:

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

This result can be confirmed by running the following XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()
   "/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the same XML document (specified previously in this answer), the wanted, correct result is produced:

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

Finally: The following single XPath expression selects exactly all wanted text node in the HTML page, with the provided link (after tidying it to become well-formed XML):

  (//p[@class='titlestyle'])[2]/text()[1]
|
  (//span[@class='title2'])[2]/text()
|
  (//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
  (//span[@class='title2'])[3]/text()
|
  (//span[@class='title2'])[3]/following-sibling::a[1]/text()
0

精彩评论

暂无评论...
验证码 换一张
取 消