I'm looking for a clean way to get a set (list, array, whatever) o开发者_高级运维f words starting with #
inside a given string.
In C#, I would write
var hashtags = input
.Split (' ')
.Where (s => s[0] == '#')
.Select (s => s.Substring (1))
.Distinct ();
What is comparatively elegant code to do this in Python?
EDIT
Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
["stackoverflow", "rocks", "announcement"]
With @inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.
>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])
Note that { }
syntax for set comprehensions only works starting with Python 2.7.
If you're working with older versions, feed list comprehension ([ ]
) output to set
function as suggested by @Bertrand.
[i[1:] for i in line.split() if i.startswith("#")]
This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only "#"
. Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):
set([i[1:] for i in line.split() if i.startswith("#")])
the findall
method of regular expression objects can get them all at once:
>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>
I'd say
hashtags = [word[1:] for word in input.split() if word[0] == '#']
Edit: this will create a set without any duplicates.
set(hashtags)
there are some problems with the answers presented here.
{tag.strip("#") for tag in tags.split() if tag.startswith("#")}
[i[1:] for i in line.split() if i.startswith("#")]
wont works if you have hashtag like '#one#two#'
2 re.compile(r"#(\w+)")
wont work for many unicode languages (even using re.UNICODE)
i had seen more ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return set(ret)
Another option is regEx:
import re
inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"
re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #
精彩评论