I'm trying to use a Python regex to find a mathematical expression in a string. The problem is that the forward slash seems to do something unexpected. I'd have thought that [\w\d\s+-/*]*
would work for finding math expressions, but it finds commas too for some reason. A bit of experimenting reveals that forward slashes are the culprit. For example:
>>> import re
>>> re.sub(r'[/]*', 'a', 'bcd')
'abacada'
Apparently forward slashes match between characters (even开发者_如何学C when it is in a character class, though only when the asterisk is present). Back slashes do not escape them. I've hunted for a while and not found any documentation on it. Any pointers?
Look here for documentation on Python's re
module.
I think it is not the /
, but rather the -
in your first character class: [+-/]
matches +
, /
and any ASCII value between, which happen to include the comma.
Maybe this hint from the docs help:
If you want to include a ']' or a '-' inside a set, precede it with a backslash, or place it as the first character.
You are saying it to replace zero or more slashes with 'a'
. So it does replace each "no character" with 'a'
. :)
You probably meant [/]+
, i.e. one or more slashes.
EDIT: Read Ber's answer for a solution to the original problem. I didn't read the whole question carefully enough.
r'[/]*' means "Match 0 or more forward-slashes". There are exactly 0 forward-slashes between 'b' & 'c' and between 'c' & 'd'. Hence, those matches are replaced with 'a'.
The *
matches its argument zero or more times, and thus matches the empty string. The empty string is (logically) between any two consecutive characters. Hence
>>> import re
>>> re.sub(r'x*', 'a', 'bcd')
'abacada'
As for the forward slash, it receives no special treatment:
>>> re.sub(r'/', 'a', 'b/c/d')
'bacad'
The documentation describes the syntax of regular expressions in Python. As you can see, the forward slash has no special function.
The reason that [\w\d\s+-/*]*
also finds comma's, is because inside square brackets the dash -
denotes a range. In this case you don't want all characters between +
and /
, but a the literal characters +
, -
and /
. So write the dash as the last character: [\w\d\s+/*-]*
. That should fix it.
精彩评论