开发者

Retrieving relative DOM nodes in PHP

开发者 https://www.devze.com 2023-01-27 09:08 出处:网络
I want to retrieve the data of the next element tag in a document, for example: I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.

I want to retrieve the data of the next element tag in a document, for example:

I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.

<html>
<body>


<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span id=12342></span>
<blockquote>Content 1</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span i开发者_如何学编程d=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->    

<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>


</body>
</html>

Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?


Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)

Use:

/*/*/span/following-sibling::*[1][self::blockquote]

This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?

Yes.

You can get all sets of contigious blockquote elements following a span:

 /*/*/span/following-sibling::blockquote
          [preceding-sibling::*[not(self::blockquote)][1][self::span]]

You can get the contigious set of blockquote elements following the (N+1)-st span by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=$vN]
           ]

where $vN should be substituted by the number N.

Thus, the set of contigious set of blockquote elements following the first span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=0]
           ]

the set of contigious set of blockquote elements following the second span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=1]
           ]

etc. ...

See in the XPath Visualizer the nodes selected by the following expression :

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=3]
           ]

Retrieving relative DOM nodes in PHP


Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.

http://www.php.net/DOM

Long answer:

$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
    if ($el->nodeName === '#text') continue;
    if ($el->nodeName === 'span') {
        $flag = true;
        continue;
    }
    if ($flag && $el->nodeName === 'blockqoute') {
        $TEXT[] = $el->firstChild->nodeValue;
        $flag = false;
        continue;
    }
}


Try the following *

/html/body/span/following-sibling::*[1][self::blockquote]

to match any first blockquotes after a span element that are direct children of body or

//span/following-sibling::*[1][self::blockquote]

to match any first blockquotes following a span element anywhere in the document

* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.

Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]

Example:

$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
    echo $dom->saveXml($blockquote), PHP_EOL;
}

If you want to do that without XPath, you can do

$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
    if($span->nextSibling !== NULL &&
       $span->nextSibling->nodeName === 'blockquote')
    {
        echo $dom->saveXml($span->nextSibling), PHP_EOL;
    }
}

If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().

Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).


Besides @Dimitre good answer, you could also use:

/html
   /body
      /blockquote[preceding-sibling::*[not(self::blockquote)][1]
                     /self::span[@id='12341']]
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号