Possible Duplicates:
Preg match text in php between html tags RegEx match open tags except XHTML self-contained tags
I have a large amount of text formatted in the following way:
<P><B>1- TITLE</B>
<P>
<DL><DD> Text text text text text
text text
</DL><P>
<P><B>2 - Title 2</B>
<P>
<DL><DD> Text text text text text
text text Text text text text text
text text Text text text text text
text text
<br><I>Additional irrelevant information</I>
</DL><P>
I'm trying to use PHP's Regexp f开发者_JAVA百科unctions to retrieve the Title-Text value pairs while stripping out the extra characters as well as the irrelevant info that follows some of the text blocks. Preferably I'd like to:
Grab everything between <P><B> and </B>
as the title
Grab all the text between
<DL><DD>
and the next HTML tag (<) as the text, and somehow keep the two associated together for further processing. Any idea how to do this with PHP's Regexp functions?
As the comments on your question suggest, questions along the same lines are frequently asked on Stack Overflow, and the right answer is generally "Don't try to parse HTML with regular expressions". As well as making that point, however, I think it's useful to have an example in the answer of showing how one might take the suggested approach. For the case in your question, one could do:
<?php
$html = <<<EOF
<P><B>1- TITLE</B>
<P>
<DL><DD> Text text text text text
text text
</DL><P>
<P><B>2 - Title 2</B>
<P>
<DL><DD> Text text text text text
text text Text text text text text
text text Text text text text text
text text
<br><I>Additional irrelevant information</I>
</DL><P>
EOF;
$d = new DomDocument;
$d->loadHtml($html);
$xp = new DomXpath($d);
$matches = $xp->query("//p/b", $d);
foreach ($matches as $dn) {
echo "Title is: " . $dn->nodeValue . "\n";
$dl = $dn->parentNode->nextSibling->nextSibling->firstChild;
$dd = $dl->firstChild;
echo "Content is: " . $dd->nodeValue . "\n";
}
?>
Depending on how robust you need this to be, you would probably want to check that the nextSibling
s and children are tags with the name you expect, but this shows the idea anyway.
精彩评论