Preserve line breaks inside tags using DOMXPath?_问答_开发者

I'm currently using PHP and DOMXPath to get the contents of all of the  elements of a web page:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

My problem is that the string resulting from textContent does not respect   tags that exist within those  elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:

Sample HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

Current Output from PHP above:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.

Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.

If there is an alternative method for retrieving the string within  elements while preserving   or '\n' that would also be great.

EDIT

Upon further investigation the HTML in question is actually a list of anchors separated by   tags bu开发者_JAVA技巧t with no actual line breaks:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

Turns out that this works properly with the original HTML given.

UPDATE: Solved

With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.

Example:

foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of   with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.

Obviously there's some room for improvement, but it has solved my current issue.

Thanks for the help everyone,

Ben

Well, what I would do is replace the line-breaks with literal linebreaks:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}

One of the possibilities

echo simplexml_import_dom($paragraph)->asXML();

I have same situation, i use:

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

And i use urlencode() to change it back for display or inserting to database.

Preserve line breaks inside <p> tags using DOMXPath?

精彩评论

关注公众号

热门标签

图文推荐

Preserve line breaks inside <p> tags using DOMXPath?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：