Retrieving relative DOM nodes in PHP_问答_开发者

I want to retrieve the data of the next element tag in a document, for example:

I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.

<html>
<body>


<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span id=12342></span>
<blockquote>Content 1</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span i开发者_如何学编程d=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->    

<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>


</body>
</html>

Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?

Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)

Use:

/*/*/span/following-sibling::*[1][self::blockquote]

This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?

Yes.

You can get all sets of contigious blockquote elements following a span:

 /*/*/span/following-sibling::blockquote
          [preceding-sibling::*[not(self::blockquote)][1][self::span]]

You can get the contigious set of blockquote elements following the (N+1)-st span by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=$vN]
           ]

where $vN should be substituted by the number N.

Thus, the set of contigious set of blockquote elements following the first span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=0]
           ]

the set of contigious set of blockquote elements following the second span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=1]
           ]

etc. ...

See in the XPath Visualizer the nodes selected by the following expression :

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=3]
           ]

Retrieving relative DOM nodes in PHP

Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.

http://www.php.net/DOM

Long answer:

$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
    if ($el->nodeName === '#text') continue;
    if ($el->nodeName === 'span') {
        $flag = true;
        continue;
    }
    if ($flag && $el->nodeName === 'blockqoute') {
        $TEXT[] = $el->firstChild->nodeValue;
        $flag = false;
        continue;
    }
}

Try the following *

/html/body/span/following-sibling::*[1][self::blockquote]

to match any first blockquotes after a span element that are direct children of body or

//span/following-sibling::*[1][self::blockquote]

to match any first blockquotes following a span element anywhere in the document

^{* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.}

Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]

Example:

$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
    echo $dom->saveXml($blockquote), PHP_EOL;
}

If you want to do that without XPath, you can do

$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
    if($span->nextSibling !== NULL &&
       $span->nextSibling->nodeName === 'blockquote')
    {
        echo $dom->saveXml($span->nextSibling), PHP_EOL;
    }
}

If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().

Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).

Besides @Dimitre good answer, you could also use:

/html
   /body
      /blockquote[preceding-sibling::*[not(self::blockquote)][1]
                     /self::span[@id='12341']]

Retrieving relative DOM nodes in PHP

精彩评论

关注公众号

热门标签

图文推荐

Retrieving relative DOM nodes in PHP

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：