开发者

Preserve line breaks inside <p> tags using DOMXPath?

开发者 https://www.devze.com 2023-02-05 08:05 出处:网络
I\'m currently using PHP and DOMXPath to get the contents of all of the <p> elements of a web page:

I'm currently using PHP and DOMXPath to get the contents of all of the <p> elements of a web page:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

My problem is that the string resulting from textContent does not respect <br /> tags that exist within those <p> elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:

Sample HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

Current Output from PHP above:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.

Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.

If there is an alternative method for retrieving the string within <p> elements while preserving <br /> or '\n' that would also be great.

EDIT


Upon further investigation the HTML in question is actually a list of anchors separated by <br> tags bu开发者_JAVA技巧t with no actual line breaks:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

Turns out that this works properly with the original HTML given.

UPDATE: Solved


With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.

Example:

foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br> with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.

Obviously there's some room for improvement, but it has solved my current issue.

Thanks for the help everyone,

Ben


Well, what I would do is replace the line-breaks with literal linebreaks:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}


One of the possibilities

echo simplexml_import_dom($paragraph)->asXML();


I have same situation, i use:

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

And i use urlencode() to change it back for display or inserting to database.

0

精彩评论

暂无评论...
验证码 换一张
取 消