I am trying to figure out how to capture one statement if the other one doesn't exist using preg_match.
Sample Text:
<!-- InstanceBeginEditable name="doctitle" -->
<title>BU Libraries | Research Guides | Citing Your Sources</title>
<!-- InstanceEndEditable -->
<div id="standardpgt"><h1><!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable --></h1></div>
Because pagetitle exists I want to pull it instead of the doctitle tag. Of course there is tons of other characters in between them, but I wanted to show you a small sample.
If pagetitle didn't exist I would want to grab the contents of doctitle.
The twist is that I'm not using the php code directly, I'm passing in a regex statement through a config file, then a script is taking it and pulling out the 1st group from the statement.
This is what I 开发者_Python百科came up with:
((?!.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->)<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->|<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->)
What the issue is for some reason php always reads the first empty group as group 1 if it didn't work.
For example in the sample text above it would return
0 -> <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
1 ->
2 -> <strong>Citing Your Sources</strong>
I can't for the life of figure out how to make this work. I also wrote this regex:
(?(?=.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->).*?<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->|.*?<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->)
But that didn't work either. Thank you very much for the help.
Chris
Just use the branch reset pattern: (?|...) around your whole expression, as in:
((?|(?!.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->.*?<!--\s*?InstanceEndEditable\s*?-->)<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?<title>(.*?)<\/title>\s*?<!--\s*?InstanceEndEditable\s*?-->|<!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->(.*?)<!--\s*?InstanceEndEditable\s*?-->))s
From "man perlre":
"(?|pattern)" This is the "branch reset" pattern, which has the special property that the capture buffers are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0.
Capture buffers are numbered from left to right, but inside this construct the numbering is restarted for each branch.
The numbering within each branch will be as normal, and any buffers following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture buffers in it.
This construct will be useful when you want to capture one of a number of alternative matches.
Consider the following pattern. The numbers underneath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4
user178551 is absolutely correct in recommending the use of a branch reset construct. There is fundamentally nothing wrong with your original regex (other than the fact that it is more than 300 characters long and is ALL ON ONE LINE! - and that it is unable to put one of two alternatives in a single capture group). A non-trivial (to put it mildly) regex like this needs to be written in free-spacing mode with indentation so you can actually read it. Here is your original regex with some reasonable whitespace added:
$re_OP1 = '%
( # $1:
(?!
.*?<!--\s*?InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->
.*?<!--\s*?InstanceEndEditable\s*?-->
)
<!--\s*?InstanceBeginEditable\s*?name=\x22doctitle\x22\s*?-->\s*?
<title>(.*?)<\/title>\s*? # $2:
<!--\s*?InstanceEndEditable\s*?-->
| <!-- InstanceBeginEditable\s*?name=\x22pagetitle\x22\s*?-->
(.*?) # $3;
<!--\s*?InstanceEndEditable\s*?-->
)
%six';
Looking at this regex now, you can see where you have hard coded one space on the line with the OR operator (i.e. |<!-- InstanceBegin...
). This will cause the regex to fail to match with the 'x'
modifier is applied. So replacing this space with a \s*
and running it on your test data, here are the result I get (php-5.2.14):
Array
(
[0] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
[1] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
[2] =>
[3] => <strong>Citing Your Sources</strong>
)
These results are similar to the ones you posted (but for some reason your results show only 2 capture groups???) All we need to do now is to apply user178551's branch reset suggestion, and the regex solution becomes:
$re_jmr = '%
(?| # Branch reset construct. (restart counting for each alternative)
(?!
.*?<!--\s*InstanceBeginEditable\s*name="pagetitle"\s*-->
.*?<!--\s*InstanceEndEditable\s*-->
)
<!--\s*InstanceBeginEditable\s*name="doctitle"\s*-->\s*
<title>(.*?)<\/title>\s* # $1: Group 1A
<!--\s*InstanceEndEditable\s*-->
| <!--\s*InstanceBeginEditable\s*name="pagetitle"\s*-->
(.*?) # $1: Group 1B
<!--\s*InstanceEndEditable\s*-->
)
%six';
I've gone ahead and changed all the lazy \s*?
to greedy (because greedy is what you want here). I also changed all the \x22
to just "
- shorter and more readable IMHO. And here are the results from running with this new, branch reset regex:
Array
(
[0] => <!-- InstanceBeginEditable name="pagetitle" --><strong>Citing Your Sources</strong><!-- InstanceEndEditable -->
[1] => <strong>Citing Your Sources</strong>
)
Which is, (if I'm not mistaken), exactly what you are looking for. (You did not provide a test case for the other alternative so that has not yet been tested.) Other than that, your original regex was pretty close.
精彩评论