开发者

How Can I Get Data From HTML Source Code with PHP and RegEx?

开发者 https://www.devze.com 2023-02-07 03:09 出处:网络
I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn\'t well-formed.

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.

Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.

Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.

Source :

...

<TR Class="Head1">

<TD width="15%"><font size="12">Name</font></TD>

<TD>:&nbsp;</TD>

<TD align="center"><font color="red">Alex</font></TD>

<TD width="25%"><b>Job</b></TD>

<TD>:&nbsp;</B></TD>

<TD align="center" width="25%"><font color="red">Doctor</font></TD>

</TR>

...

...

<TR Class="Head2">

<TD width="15%" align="left">Age</B></TD>

<TD>:&nbsp;</TD>

<TD align="center"><font color="red">32</font></TD>

<TD width="15%"><font size="10">data</TD></font>

<TD>&nbsp;</B></TD>

<TD width="40%">&nbsp;</TD>

</TR>

...

As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.

How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else开发者_开发技巧 can I do?

Im waiting for your help.


If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.


You can use DOMDocument to load badly formed HTML:

$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>');


$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
 echo $td->textContent, "\n";
}

I'm suppressing warnings in the above code for brevity.

Output:

Age
: 
32
data
  <!-- space -->
  <!-- space -->

Using regex to parse HTML can be a futile effort as HTML is not a regular language.


Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.

You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.

Or you could use a parser.

Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.


$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>:&nbsp;</TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD>&nbsp;</B></TD>\s+<TD width="40%">&nbsp;</TD>\s+</TR>
EOF;

preg_match_all($regex, $text, $result);

var_dump($result)
0

精彩评论

暂无评论...
验证码 换一张
取 消