I want to match dates that have the following format:
2010-08-27 02:11:36
i.e. yyyy-mm-dd hh:mm:ss
.
Right now I am not very particular about the date being actually feasible, but just that it is in the correct format.
Possible formats that should match are (for this example)
2010
2010-08
2010-08-27
2010-08-27 02
2010-08-27 02:11
2010-08-27 02:11:36
In Perl, what can be a concise regex for this?
I have this so far (which works, btw)
/\d{4}(-\d{2}(-\d{2}( \d{2}(:开发者_JAVA技巧\d{2}(:\d{2})?)?)?)?)?/
Can this be improved performance-wise?
Based on the lack of a capturing group around the year, I assume you care only whether a date matches.
I tried a few different patterns related to the one from your question, and the one that gave a ten- to fifteen-percent improvement was disabling capturing, i.e.,
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
The perlre documentation covers (?:...)
:
(?:pattern)
(?imsx-imsx:pattern)
This is for clustering, not capturing; it groups subexpressions like
()
, but doesn't make backreferences as()
does. So@fields = split(/\b(?:a|b|c)\b/)
is like
@fields = split(/\b(a|b|c)\b/)
but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.
Any letters between
?
and:
act as flags modifiers as with(?imsx-imsx)
. For example,/(?s-i:more.*than).*million/i
is equivalent to the more verbose
/(?:(?s-i)more.*than).*million/i
Benchmark output:
Rate U U/NC CH/NC/A CH/NC/A/U CH CH/NC null U 31811/s -- -32% -58% -59% -61% -66% -93% U/NC 46849/s 47% -- -38% -39% -42% -50% -90% CH/NC/A 76119/s 139% 62% -- -1% -6% -18% -84% CH/NC/A/U 76663/s 141% 64% 1% -- -6% -17% -84% CH 81147/s 155% 73% 7% 6% -- -13% -83% CH/NC 92789/s 192% 98% 22% 21% 14% -- -81% null 481882/s 1415% 929% 533% 529% 494% 419% --
Code:
#! /usr/bin/perl
use warnings;
use strict;
use Benchmark qw/ :all /;
sub option_chain {
local($_) = @_;
/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
}
sub option_chain_nocap {
local($_) = @_;
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
}
sub option_chain_nocap_anchored {
local($_) = @_;
/\A\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
}
sub option_chain_anchored_unrolled {
local($_) = @_;
/\A\d\d\d\d(-\d\d(-\d\d( \d\d(:\d\d(:\d\d)?)?)?)?)?\z/
}
sub simple_split {
local($_) = @_;
split /[ :-]/;
}
sub unrolled {
local($_) = @_;
grep defined($_), /\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) \z
|\A (\d\d\d\d)-(\d\d) \z
|\A (\d\d\d\d) \z
/x;
}
sub unrolled_nocap {
local($_) = @_;
grep defined($_), /\A \d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d \z
|\A \d\d\d\d-\d\d-\d\d \z
|\A \d\d\d\d-\d\d \z
|\A \d\d\d\d \z
/x;
}
sub id { $_[0] }
my @examples = (
"xyz",
"2010",
"2010-08",
"2010-08-27",
"2010-08-27 02",
"2010-08-27 02:11",
"2010-08-27 02:11:36",
);
cmpthese -1 => {
"CH" => sub { option_chain $_ for @examples },
"CH/NC" => sub { option_chain_nocap $_ for @examples },
"CH/NC/A" => sub { option_chain_nocap_anchored $_ for @examples },
"CH/NC/A/U" => sub { option_chain_anchored_unrolled $_ for @examples },
"U" => sub { unrolled $_ for @examples },
"U/NC" => sub { unrolled_nocap $_ for @examples },
"null" => sub { id $_ for @examples },
};
How about something from Regexp::Common::time?
Your regex is just fine except for missing anchors (unless you want to match 2008 in "abc200890"?). Assuming you want to match the whole string:
/^\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
(?:...)
should be used if you don't actually want the captured substrings, which I'd guess to be the case.
I would use the split function :
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @dates = (
'2010',
'2010-08',
'2010-08-27',
'2010-08-27 02',
'2010-08-27 02:11',
'2010-08-27 02:11:36',
);
for (@dates) {
my @list = split /[ :-]/;
print Dumper(\@list);
}
output :
$VAR1 = [
'2010'
];
$VAR1 = [
'2010',
'08'
];
$VAR1 = [
'2010',
'08',
'27'
];
$VAR1 = [
'2010',
'08',
'27',
'02'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11',
'36'
];
This matches all the above (but also other stuff - see the comment!) and may be slightly easier to read:
/(\d{4})(-\d{2})?(\w{1}\d{2})?(:\d{2})?/
If you want faster, then look away from regex, and look at XS modules: Date::Calc is a good one.
精彩评论