Python regex negative lookbehind_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-13 05:37 出处：网络

We parse logs created by开发者_如何学编程 automated scripts.A typical thing that we\'d care about is the string: \'1.10.07-SNAPSHOT (1.10.07-20110303.024749-7)\' from the following line:

相关专题：python regex

We parse logs created by开发者_如何学编程 automated scripts. A typical thing that we'd care about is the string: '1.10.07-SNAPSHOT (1.10.07-20110303.024749-7)' from the following line:

15:28:02.115 - INFO   - TestLib: Successfully retrieved build version: '1.11.11-SNAPSHOT (1.11.11-20110303.024749-7)'

The trouble is, some logs are manually created, with users entering this information themselves. To remind themselves of the format they have added a dialog with the template:

02:24:50.655 - INFO   - gui: Step Dialog: For test results management purposes, specify the build in which the test is executed in the following format, build version: 'specify version here'
02:25:04.905 - INFO   - gui:     Response: OK
02:25:04.905 - INFO   - gui:     Comments: 'build version: '1.11.11''

My regex for this currently is .*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'. The '(?!.*<)' was my first attempt to avoid this problem, because some users would write ''. That doesn't catch the above case though. I think the correct thing to do is going to be a negative lookbehind that does not match if 'Step Dialog' is present on the line, but my attempts to write that seem to be failing me, according to regexr (for some reason it's not letting me share the link to my saved form). I thought negative lookbehind would look like this: (?<!Step Dialog) and result in this:

`(?<!Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

but that's matching both the first and third line of the above for some reason.

Edit:

'[Bb]', and ':\s' are for users who entered information in not precisely the right format by using multiple colons and spaces, capitalized 'Build'. Suggestions for cleaning this up in general are appreciated, I'm relatively new to regexs.

You are close, but it is still matching because it can find a string that satisfies .* without being preceded by Step Dialog. Positive and negative assertions only affect the pattern immediately surrounding them. Thus, you have to force it to check every character you don't want matching Step Dialog.

Try this:

`^(?:(?!Step Dialog).)*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Now, it ensures that every character between ^ (the beginning of the line) and [Bb]uild [Vv]ersion is not the string Step Dialog.

You'll notice I also changed it to a positive lookahead, because it's easier to understand what's going on.

Couple ways you can do this, but you're pretty close.

`.*(?<!Step Dialog.*)[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`
`^(?!.*Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Chriszuma's pattern should work, too. Use whichever you like best. If performance is a consideration, you could benchmark the three patterns and see which is faster. My feeling is that it'll be the one starting with ``.(?)`, but I can't say for sure.

Edit: As ekhumoro points out, the Python regex engine requires fixed-length lookbehinds, so the first one won't work in Python. The second one should be fine, though.