开发者

php regex problem

开发者 https://www.devze.com 2023-02-03 15:22 出处:网络
I want to get the <form> from the site. but between the form part in this situation, there still have mnay other html code. how to remove them? I mean how to use php just regular thean开发者_开发百

I want to get the <form> from the site. but between the form part in this situation, there still have mnay other html code. how to remove them? I mean how to use php just regular the an开发者_开发百科d part from the site?

$str = file_get_contents('http://bingphp.codeplex.com');
preg_match_all('~<form.+</form>~iUs', $str, $match);
var_dump($match); 


You should not use regular expressions for extracting HTML content. Use a DOM parser.

E.g.

$doc = new DOMDocument();
$doc->loadHTMLFile("http://bingphp.codeplex.com");

$forms = $doc->getElementsByTagName('form');

Update: If you want to remove the forms (not sure if you meant that):

for($i = $forms.length;$i--;) {
    $node = $forms->item($i);
    $node->parentNode->removeChild($node);
}

Update 2:

I just noticed that they have one form that wraps the whole body content. So this way or another, you will get the whole page actually.


The regex problem lies in the greedyness. For such cases .+? is advisable.

But what @Felix said. While a regular expression is workable for HTML extraction, you often look for something specific, and should thus rather parse it. It's also much simpler if you use QueryPath:

 $str = file_get_contents('http://bingphp.codeplex.com');
 print qp($str)->find("form")->html();


The best way i can think of is to use the Simple HTML DOM library with PHP to get the form(s) from the HTML page using DOM queries.

It is a little more convenient than using built-in xml parsers like simplexml or domdocument.

You can find the library here.


Normally you should use DOM to parse HTML, but in this case the web site is very far from being standard HTML, with some of the code being modified in place by javascript. It can therefore not be loaded into the DOM object. This might be intentional, a way of obfuscating the code.

In any case, it is not so much your RE (although using a non-greedy match would help), but the design of the site itself which is preventing you from parsing out what you want.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号