开发者

Is regex the right tool to find a line of HTML?

开发者 https://www.devze.com 2022-12-12 01:55 出处:网络
I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can\'t just pull a specific line. However, the content is

I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can't just pull a specific line. However, the content is contained within a div that has a unique id. Is it possible (and is it the best way) for regex to search for this unique id and then pass the line of which it's on back to my script?

Example:

HTML file:

<html><head><title>Example</title></head>
<body>
<div id="Alpha"> Blah blah blah </div>
<div id="开发者_开发技巧Beta"> Blah Blah Blah </div>
</body>
</html>

So let's say that I'm looking for the line with an opening div tag with an id of alpha. The code should return 3, because on the third line is the div with the id of alpha.


At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here

The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.


Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.

Otherwise, DOMDocument would be a more sane way.

EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.

I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.

Here is a rought PHP code to extract it. (just to give some idea)

$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";

$ID = "Alpha";

function GetLineOfDIV($HTML, $ID) {
    $RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
    $Index       = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
    $Match       = $Match[1]; // Only the one in '(...)'
    if ($Match == "")
        return -1;

    //$MatchStr    = $Match[0]; Since you do not want it, so we comment it out.
    $MatchOffset = $Match[1];

    $StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
    foreach($StartLines as $I => $StartLine) {
        $LineOffset = $StartLine[1];
        if ($MatchOffset <= $LineOffset)
            return $I + 1;
    }
    return count($StartLines);
}

echo GetLineOfDIV($HTML, $ID);

I hope I give you some idea.


According to Jeff Atwood, you should never parse HTML using regex.


Since the line number is important to you here and not the actual contents of the div, I'd be inclined not to use regex at all. I'd probably explode() the string into an array and loop through that array looking for your marker. Like so:

<?php
$myContent = "[your string of html here]";
$myArray = explode("\n", $myContent);
$arraylen = count($myArray); // So you don't waste time counting the array at every loop
$lineNo = 0;
for($i = 0; $i < $arraylen; $i++)
{
     $pos = strpos($myArray[$i], 'id="Alpha"');
     if($pos !== false)
     {
          $lineNo = $i+1;
          break;
     }
}
?>

Disclaimer: I haven't got a php installation readily available to test this so some debugging may be required.

Hope this helps as I think it's probably just going to be a waste of time for you to implement a parsing engine just to do something so simple - especially if it's a one-off.


Edit: if the content is impotant to you at this stage too then you can use this in combination with the other answers which provide an adequate regex for the job.


Edit #2: Oh what the hey... here's my two cents:

"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"

The (<div.*//div>) tells the regex engine that it may find nested div tags and to just incorporate them into the match if it finds them rather than just stopping at the first </div>. However this only solves the problem if there is only one level of nesting. If there's more, then regex is not for you sorry :(.

The /m also makes the regex engine ignore linebreaks so you don't have to dirty up your expressions with [\S\s] everywhere.

Again, sorry, I've no environment to test this in at the moment so you may need to debug.

Cheers Iain


The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply.

Not recommended.


Instead of RegEx, use a parser that is made especially to handle (messy) HTML. This will make your application less brittle in case the HTML changes slightly, and you don't have to hand-craft custom RegEx each time you want to pull out a new piece of data.

See this Stack Overflow page: Mature HTML Parsers for PHP


@OP since your requirement is that easy, you can just use string methods

$f = fopen("file","r");
if($f){
    $s="";
    while( !feof($f) ){
        $i+=1;
        $line = fgets($f,4096);        
        if (stripos($line,'<div id="Alpha">')!==FALSE){
            print "line number: $i\n";
        }
    }
    fclose($f);
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号