I'm currently using PHP and DOMXPath
to get the contents of all of the <p>
elements of a web page:
<?php
...
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
My problem is that the string resulting from textContent
does not respect <br />
tags that exist within those <p>
elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:
Sample HTML:
<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>
<p>
Random information and what not<br />
Isn't that cool?
</p>
Current Output from PHP above:
Some happy talk about our great product.We would love for you to buy it!
Random information and what notIsn't that cool?
I have tried $paragraphs = $doc->getElementsByTagName("p");
as well and it gives me the same thing.
Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.
If there is an alternative method for retrieving the string within <p>
elements while preserving <br />
or '\n'
that would also be great.
EDIT
Upon further investigation the HTML in question is actually a list of anchors separated by <br>
tags bu开发者_JAVA技巧t with no actual line breaks:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>
Turns out that this works properly with the original HTML given.
UPDATE: Solved
With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.
Example:
foreach ($paragraphs as $paragraph){
$p_text = new DOMDocument();
$p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
//Do whatever, in this case get all of the words in an array.
$words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}
This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br>
with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.
Obviously there's some room for improvement, but it has solved my current issue.
Thanks for the help everyone,
Ben
Well, what I would do is replace the line-breaks with literal linebreaks:
$doc = new DOMDocument();
$doc->loadHTML($html);
$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
$node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
One of the possibilities
echo simplexml_import_dom($paragraph)->asXML();
I have same situation, i use:
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));
And i use urlencode() to change it back for display or inserting to database.
精彩评论