I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.
Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.
Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.
Source :
...<TR Class="Head1">
<TD width="15%"><font size="12">Name</font></TD>
<TD>: </TD>
<TD align="center"><font color="red">Alex</font></TD>
<TD width="25%"><b>Job</b></TD>
<TD>: </B></TD>
<TD align="center" width="25%"><font color="red">Doctor</font></TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>
...
As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.
How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else开发者_开发技巧 can I do?
Im waiting for your help.
If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.
You can use DOMDocument to load badly formed HTML:
$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>');
$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
echo $td->textContent, "\n";
}
I'm suppressing warnings in the above code for brevity.
Output:
Age
:
32
data
<!-- space -->
<!-- space -->
Using regex to parse HTML can be a futile effort as HTML is not a regular language.
Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.
You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.
Or you could use a parser.
Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.
$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>: </TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD> </B></TD>\s+<TD width="40%"> </TD>\s+</TR>
EOF;
preg_match_all($regex, $text, $result);
var_dump($result)
精彩评论