How do I add tags to certain strings in python using re.sub?_问答_开发者

I'm trying to add tags to some given query strings, and the tags should wrap around all the matching strings. For example, I want to wrap tags around all the words that match the query iphone games mac in the sentence I love downloading iPhone games from my mac. It should be I love downloading iPhone games from my mac.

Currently, I tried

sentence = "I love downloading iPhone games from my mac."
query = r'((iphone|games|mac)\s*)+'
regex = re.compile(query, re.I)
sentence = regex.sub(r'<em>\1</em> ', sentence)

The sentence outputs

I love downloading <em>games </em> on my <em>mac</em> !

Where \1 is only replace by one word (games instead of iPhone games) and there are some unnecessary spaces开发者_高级运维 after the word. How do I write the regular expression to get the desired output? Thanks!

Edit: I just realized that both Fred and Chris's solutions have problems when I have words within words. For instance, if my query is game, then it will turn out to be games while I want it not be highlighted. Another example is the in either shouldn't be highlighted.

Edit 2: I took Chris' new solution and it works.

First of all, to get the spaces as you want them, replace \s* with \s*? to make it non-greedy.

First fix:

>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'

Unfortunately, once the \s* is non-greedy, it splits phrases, as you can see. Without it, it goes like this, grouping the two together:

>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'

I can't think yet how to fix this.

Note also that in these I have stuck in an extra set of brackets around the + so that all matches get caught - that's the difference.

Further update: actually, I can think of a way to get around it. You decide whether you want it like that.

>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I)
>>> regex.sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

Update: taking your point about word boundaries into account, we only need to add in a few instances of \b, the word boundary matcher.

>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I)
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac')
'I love downloading <em>iPhone games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac')
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac')
'I love downloading iPhoney <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac')
'I love downloading iPhoney gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac')
'I love downloading miPhone gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac')
'I love downloading miPhone <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac')
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'

>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I)
>>> r.sub(r'\1<em>\2</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

The extra group fully containing the plus-repetition avoids losing words, while shifting the spaces before the words — but taking out leading spaces initially — handles that problem. The word boundary assertions require full word matching for the 3 words between them. However, NLP is hard and there will still be cases where this doesn't work as expected.