I am reading a book on regular expression and I came across this example for \b
:
The cat scattered his food all over the room.
Using regex - \bcat\b
will match the word cat
but not the cat
in scattered
.
For \B
the author uses the following example:
Please enter the nine-digit id as it
appears on your color - coded pass-key.
Using regex \B-\B
matches -
between the word color - coded
. Using \b-\b
on the other hand matches the -
in nine-digit
and pass-key
.
How come in the first example we use \b
to separate cat
and in the second use \B
to separate -
? Using \b
in the second example does the opposite of w开发者_如何学Gohat it did earlier.
Please explain the difference to me.
EDIT: Also, can anyone please explain with a new example?
The confusion stems from your thinking \b
matches spaces (probably because "b" suggests "blank").
\b
matches the empty string at the beginning or end of a word. \B
matches the empty string not at the beginning or end of a word. The key here is that "-" is not a part of a word. So <left>-<right>
matches \b-\b
because there are word boundaries on either side of the -
. On the other hand for <left> - <right>
(note the spaces), there are not word boundaries on either side of the dash. The word boundaries are one space further left and right.
On the other hand, when searching for \bcat\b
word boundaries behave more intuitively, and it matches " cat " as expected.
\b
is a zero-width word boundary. Specifically:
Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters.
Example: .\b
matches c
in abc
\B
is a zero-width non-word boundary. Specifically:
Matches at the position between two word characters (i.e the position between \w\w) as well as at the position between two non-word characters (i.e. \W\W).
Example: \B.\B
matches b
in abc
See regular-expressions.info for more great regex info
With a different example:
Consider this is the string and pattern to be searched for is 'cat':
text = "catmania thiscat thiscatmaina";
Now definitions,
'\b' finds/matches the pattern at the beginning or end of each word.
'\B' does not find/match the pattern at the beginning or end of each word.
Different Cases:
Case 1: At the beginning of each word
result = text.replace(/\bcat/g, "ct");
Now, result is "ctmania thiscat thiscatmaina"
Case 2: At the end of each word
result = text.replace(/cat\b/g, "ct");
Now, result is "catmania thisct thiscatmaina"
Case 3: Not in the beginning
result = text.replace(/\Bcat/g, "ct");
Now, result is "catmania thisct thisctmaina"
Case 4: Not in the end
result = text.replace(/cat\B/g, "ct");
Now, result is "ctmania thiscat thisctmaina"
Case 5: Neither beginning nor end
result = text.replace(/\Bcat\B/g, "ct");
Now, result is "catmania thiscat thisctmaina"
Hope this helps :)
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
Source: http://www.regular-expressions.info/wordboundaries.html
Source © Copyright RexEgg.com
Word Boundary: \b*
The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
The regex \bcat\b would, therefore, match cat in a black cat, but it wouldn't match it in catatonic, tomcat or certificate. Removing one of the boundaries, \bcat would match cat in catfish, and cat\b would match cat in tomcat, but not vice-versa. Both, of course, would match cat on its own.
Not-a-word-boundary: \B
\B matches all positions where \b doesn't match. Therefore, it matches:
✽ When neither side is a word character, for instance at any position in the string $=(@-%++) (including the beginning and end of the string)
✽ When both sides are a word character, for instance between the H and the i in Hi!
This may not seem very useful, but sometimes \B is just what you want. For instance,
✽ \Bcat\B will find cat fully surrounded by word characters, as in certificate, but neither on its own nor at the beginning or end of words.
✽ cat\B will find cat both in certificate and catfish, but neither in tomcat nor on its own.
✽ \Bcat will find cat both in certificate and tomcat, but neither in catfish nor on its own.
✽ \Bcat|cat\B will find cat in embedded situation, e.g. in certificate, catfish or tomcat, but not on its own.
\b is used as word boundary
word = "categorical cat"
Find all "cat" in the above word
without \b
re.findall(r'cat',word)
['cat', 'cat']
with \b
re.findall(r'\bcat\b',word)
['cat']
\b
matches a word-boundary. \B
matches non-word-boundaries, and is equivalent to [^\b]
(?!\b)
(thanks to @Alan Moore for the correction!). Both are zero-width.
See http://www.regular-expressions.info/wordboundaries.html for details. The site is extremely useful for many basic regex questions.
Let take a string like :
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
Note: Underscore ( _ ) is not considered a special character in this case.
/\bX\b/g
Should begin and end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/\bX/g
Should begin with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/X\b/g
Should end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/\BX\B/g
Should not begin and not end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/\BX/g
Should not begin with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/X\B/g
Should not end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/\bX\B/g
Should begin and not end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
/\BX\b/g
Should not begin and should end with a special character or white Space
XIX IXI XX X I II IIXX XXII I-I X-X -X X- X-I I-X -X- -I-X -X-I I-X- X-I- X_X _X-
As mentioned in https://www.regular-expressions.info/wordboundaries.html :
There are three different positions that qualify as word boundaries for
\b
:
- Before the first character in the string, if the first character is a word character
\w
.- After the last character in the string, if the last character is a word character
\w
.- Between two characters in the string, where one is a word character
\w
and the other is not a word character\W
.
To have a better understanding of \b
, I'd like to consider the string by putting the word boundaries on it using arrows.
Click this link for the array visualization of the string - 'THE CAT SCATTERED'.
Click this link for the array visualization of the string - 'THE NINE-DIGIT COLOR - CODED PASS-KEY'
In the string THE CAT SCATTERED
The word boundary at
index 0
is assigned by following thecondition 1
mentioned above.The word boundary at
index 16
is assigned by following thecondition 2
.The word boundaries at
indices 2, 4, 6 and 8
are assigned by following thecondition 3
.
In the string THE NINE-DIGIT COLOR - CODED PASS-KEY
The word boundary at
index 0
is assigned by followingcondition 1
.All the remaining word boundaires are assigned by following the
condition 3
. Note here that since the string ends with a '.
' character (which is not a word character\w
), thecondition 2
is not applied.
A similar array visualization can be done for non-word boundary \B
using following the condition:
(Credits: Check @Ganesh M S's answer for the same quesiton)
\B
matches all positions where\b
doesn't match, i.e:
- When neither side is a word character (i.e when both sides are
\W
), for instance at any position in the string $=(@-%++) (including the beginning and end of the string)- When both sides are a word character
\w
, for instance between the H and the i in Hi!
\B
is not \b
e.g. negative \b
pass-key
here is no word boundary beside -
so it matches \B
in your first example there are word boundary beside cat so it matches \b
similar rules apply for others too. \W
is negative of \w
\UPPER CASE
is negative of \LOWER CASE
精彩评论