开发者

Perl REGEX Question

开发者 https://www.devze.com 2023-03-06 22:13 出处:网络
As a PHP programmer new to Perl working through \'Programming Perl\', I have come across the following regex:

As a PHP programmer new to Perl working through 'Programming Perl', I have come across the following regex:

/^(.*?): (.*)$/;

This regex is intended to parse an email header and insert it into a hash. The email header is contained in a seperate .txt file and is in the following format:

From: person@site.com
To: email@site.com
Date: Mon, 1st Jan 2000 09:00:00 -1000
Subject: Subject here

The entire code I am using to work with this example regex is as follows:

use warnings;
use strict;

my %fields = ();

open(FILE, 'header.txt') or die('Could not open.');

while(<FILE>)
{
    /^(.*?): (.*)$/;
    $fields{$1} = $2;
}

forea开发者_开发知识库ch(%fields)
{
    print;
    print "\n";
}

Now, onto my question. I am unsure as to why the first subpattern has been modified to use a minimal quantifier. It is perhaps a small point to get hung up with, but I cannot see why it has been done.

Thanks for any replies.


If it hadn't, there is a risk that it wouldn't match correctly if the value contains :<space>.

Imagine:

Subject: Urgent: Need a regex

Without the minimal match $1 would get Subject: Urgent, and $2 would be Need a regex.


Consider what happens if the subject is Subject: RE: reply to something.

A minimal quantifier will stop after Subject, but the greedy quantifier will match up to RE.


Because otherwise it will match all characters till last ':'. For example, without minimal quantifier this string:

Test: My: Weird: String

will match "Test: My: Weird" as the first group. But with minimal quantifier it will match only "Test".


The reason it uses a minimal quantifier is that it does not need to read any further than the colon. And in fact, it should not. I'm not sure what characters can exist in these keywords, but I am pretty sure . is a bit too wide, and that is the problem. If your fields contain any colons, a non-minimal regex would gobble it all up, for example:

Subject: Counter Strike: Source

If the first subpattern was greedy, it would grab Subject: Counter Strike, and not just Subject.


Without a minimal quantifier, wouldn't the first capture for the Date line be "Date: Mon, 1st Jan 2000 09:00:" instead of "Date:"?


Without that minimal quantifier, the value for $1 obtained from the "Date:" line would actually be "Date: Mon, 1st Jan 2000 09:00" due to Perl regex being greedy by default.

0

精彩评论

暂无评论...
验证码 换一张
取 消