开发者

How can I match "anything up until this sequence of characters" in a regular expression?

开发者 https://www.devze.com 2023-03-29 14:21 出处:网络
Take this regular expression: /^[^abc]/. This will match any single 开发者_C百科character at the beginning of a string, except a, b, or c.

Take this regular expression: /^[^abc]/. This will match any single 开发者_C百科character at the beginning of a string, except a, b, or c.

If you add a * after it – /^[^abc]*/ – the regular expression will continue to add each subsequent character to the result, until it meets either an a, or b, or c.

For example, with the source string "qwerty qwerty whatever abc hello", the expression will match up to "qwerty qwerty wh".

But what if I wanted the matching string to be "qwerty qwerty whatever "?

In other words, how can I match everything up to (but not including) the exact sequence "abc"?


You didn't specify which flavor of regex you're using, but this will work in any of the most popular ones that can be considered "complete".

/.+?(?=abc)/

How it works

The .+? part is the un-greedy version of .+ (one or more of anything). When we use .+, the engine will basically match everything. Then, if there is something else in the regex it will go back in steps trying to match the following part. This is the greedy behavior, meaning as much as possible to satisfy.

When using .+?, instead of matching all at once and going back for other conditions (if any), the engine will match the next characters by step until the subsequent part of the regex is matched (again if any). This is the un-greedy, meaning match the fewest possible to satisfy.

/.+X/  ~ "abcXabcXabcX"        /.+/  ~ "abcXabcXabcX"
          ^^^^^^^^^^^^                  ^^^^^^^^^^^^

/.+?X/ ~ "abcXabcXabcX"        /.+?/ ~ "abcXabcXabcX"
          ^^^^                          ^

Following that we have (?={contents}), a zero width assertion, a look around. This grouped construction matches its contents, but does not count as characters matched (zero width). It only returns if it is a match or not (assertion).

Thus, in other terms the regex /.+?(?=abc)/ means:

Match any characters as few as possible until a "abc" is found, without counting the "abc".


If you're looking to capture everything up to "abc":

/^(.*?)abc/

Explanation:

( ) capture the expression inside the parentheses for access using $1, $2, etc.

^ match start of line

.* match anything, ? non-greedily (match the minimum number of characters required) - [1]

[1] The reason why this is needed is that otherwise, in the following string:

whatever whatever something abc something abc

by default, regexes are greedy, meaning it will match as much as possible. Therefore /^.*abc/ would match "whatever whatever something abc something ". Adding the non-greedy quantifier ? makes the regex only match "whatever whatever something ".


As Jared Ng and @Issun pointed out, the key to solve this kind of regular expression like "matching everything up to a certain word or substring" or "matching everything after a certain word or substring" is called "lookaround" zero-length assertions. Read more about them here.

In your particular case, it can be solved by a positive look ahead: .+?(?=abc)

A picture is worth a thousand words. See the detailed explanation in the screenshot.

How can I match "anything up until this sequence of characters" in a regular expression?


Solution

/[\s\S]*?(?=abc)/

This will match

everything up to (but not including) the exact sequence "abc"

as the OP asked, even if the source string contains newlines and even if the sequence begins with abc. However be sure to include the multiline flag m, if the source string may contain newlines.

How it works

\s means any whitespace character (e.g. space, tab, newline)

\S means any non-whitespace character; i.e. opposite to \s

Together [\s\S] means any character. This is almost the same as . except that . doesn't match newline.

* means 0+ occurrences of the preceding token. I've used this instead of + in case the source string starts with abc.

(?= is known as positive lookahead. It requires a match to the string in the parentheses, but stops just before it, so (?=abc) means "up to but not including abc, but abc must be present in the source string".

? between [\s\S]* and (?=abc) means lazy (aka non greedy). i.e. stop at the first abc. Without this it would capture every character up to the final occurrence of abc if abc occurred more than once.


You need a look around assertion, like .+? (?=abc).

See: Lookahead and Lookbehind Zero-Length Assertions

Be aware that [abc] isn't the same as abc. Inside brackets it's not a string - each character is just one of the possibilities. Outside the brackets it becomes the string.


For regex in Java, and I believe also in most regex engines, if you want to include the last part this will work:

.+?(abc)

For example, in this line:

I have this very nice senabctence

Select all characters until "abc" and also include abc.

Using our regex, the result will be: I have this very nice senabc

Test this out: https://regex101.com/r/mX51ru/1


In Python:

.+?(?=abc) works for the single line case.

[^]+?(?=abc) does not work, since python doesn't recognize [^] as valid regex. To make multiline matching work, you'll need to use the re.DOTALL option, for example:

re.findall('.+?(?=abc)', data, re.DOTALL)


So I had to improvise... after some time I managed to reach the regex I needed:

How can I match "anything up until this sequence of characters" in a regular expression?

As you can see, I needed up to one folder ahead of "grp-bps" folder, without including the last dash. And it was required to have at least one folder after the "grp-bps" folder.

The text version for copy-paste (change 'grp-bps' for your text):

.*\/grp-bps\/[^\/]+

I ended in this Stack Overflow question after looking for help to solve my problem, but I didn't find any solution to it :(


Match From Start Till "Before ABC" or "Line End" if no ABC

(1) Matches whole string if string does not contain ABC anywhere

(2) Does not match empty string

(Not checked for strings with line breaks)

^.+?(?=ABC|$)


This will make sense about regex.

The exact word can be got from the following regex command:

("(.*?)")/g

Here, we can get the exact word globally which is belonging inside the double quotes.

For example, if our search text is

This is the example for "double quoted" words

then we will get "double quoted" from that sentence.


I would like to extend the answer from sidyll for the case insensitive version of the regex.

If you want to match abc/Abc/ABC... case insensitively, which I needed to do, use the following regex.

.+?(?=(?i)abc)

Explanation:

(?i) - This will make the following abc match case insensitively.

The other explanation of the regex remains same as sidyll pointed out.


Your question doesn't specify whether the succeeding character sequence is optional or not, but all other answers assume that the sequence is always given. So here is one, if the sequence is optional.

For instance, if matching code up to a line comment like foo # ... or foo // ..., the line comment itself may be optional, but one may still want to match the preceding code.

In this case, I would use ^(?:(?!abc).)* (or for line comments: ^(?:(?!#).)* or ^(?:(?!\/\/).)*).

Explanation:
^ marks the beginning of the line. (?:) is a non-capturing group, because a regular group would additionally capture the last matching letter in a group, which we don't need.
Inside the group, we use negative lookahead (?!) and a ., so everything is matched, except for a specific sequence. This is repeated 0 to unlimited times with *. Use + instead, if you only want to match non-empty strings.


I believe you need subexpressions. You can use the normal () brackets for subexpressions.

This part is from the grep manual:

Back References and Subexpressions

The back-reference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.

Doing something like ^[^(abc)] should do the trick.


The $ marks the end of a string, so something like this should work: [[^abc]*]$ where you're looking for anything not ending in any iteration of abc, but it would have to be at the end

Also if you're using a scripting language with regex (like PHP or JavaScript), they have a search function that stops when it first encounters a pattern (and you can specify start from the left or start from the right, or with php, you can do an implode to mirror the string).


Try this:

.+?efg

Query:

select REGEXP_REPLACE ('abcdefghijklmn','.+?efg', '') FROM dual;

Output:

hijklmn
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号