I have a string, say "Hello_World I am Learning,Ruby". I would like to split this string into ea开发者_运维问答ch distinct word, what's the best way?
Thanks! C.
You could use \W for any non-word character:
"Hello_World I am Learning,Ruby".split /[\W_]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
"Hello_World I am Learning, Ruby".split /[\W_]+/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
You can use String.split with a regex pattern as the parameter. Like this:
"Hello_World I am Learning,Ruby".split /[ _,.!?]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
ruby-1.9.2-p290 :022 > str = "Hello_World I am Learning,Ruby"
ruby-1.9.2-p290 :023 > str.split(/\s|,|_/)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
String#Scan seems to be an appropriate method for this task
irb(main):018:0> "Hello_World I am Learning,Ruby".scan(/[a-z]+/i)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
or you might use built-in matcher \w
irb(main):020:0> "Hello_World I am Learning,Ruby".scan(/\w+/)
=> ["Hello_World", "I", "am", "Learning", "Ruby"]
Whilst the above examples work, I think it's probably better when splitting a string into words to split on characters not considered to be part of any kind of word. To do this, I did this:
str = "Hello_World I am Learning,Ruby"
str.split(/[^a-zA-Z]/).reject(&:empty?).compact
This statement does the following:
- Splits the string by characters that are not in the alphabet
- Then rejects anything that is an empty string
- And removes all nulls from the array
It would then handle most combination of words. The above examples require you to list out all the characters you want to match against. It's far easier to specify the characters that you would not consider part of a word.
Just for fun, a Unicode aware version for 1.9 (or 1.8 with Oniguruma):
>> "This_µstring has words.and thing's".split(/[^\p{Word}']|\p{Connector_Punctuation}/)
=> ["This", "µstring", "has", "words", "and", "thing's"]
Or maybe:
>> "This_µstring has words.and thing's".split(/[^\p{Word}']|_/)
=> ["This", "µstring", "has", "words", "and", "thing's"]
The real problem is determining what sequence of characters constitute a "word" in this context. You might want to have a look at the Oniguruma docs for the character properties that are supported, Wikipedia has some notes on the properties as well.
精彩评论