Ruby string split on more than one character_问答_开发者

Ruby string split on more than one character

开发者 https://www.devze.com 2023-04-11 23:28 出处：网络

I have a string, say \"Hello_World I am Learning,Ruby\". I would like to split this string into ea开发者_运维问答ch distinct word, what\'s the best way?

I have a string, say "Hello_World I am Learning,Ruby". I would like to split this string into ea开发者_运维问答ch distinct word, what's the best way?

Thanks! C.

You could use \W for any non-word character:

"Hello_World I am Learning,Ruby".split /[\W_]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

"Hello_World I am Learning,   Ruby".split /[\W_]+/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

You can use String.split with a regex pattern as the parameter. Like this:

"Hello_World I am Learning,Ruby".split /[ _,.!?]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

ruby-1.9.2-p290 :022 > str =  "Hello_World I am Learning,Ruby"
ruby-1.9.2-p290 :023 > str.split(/\s|,|_/)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

String#Scan seems to be an appropriate method for this task

irb(main):018:0> "Hello_World    I am Learning,Ruby".scan(/[a-z]+/i)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

or you might use built-in matcher \w

irb(main):020:0> "Hello_World    I am Learning,Ruby".scan(/\w+/)
=> ["Hello_World", "I", "am", "Learning", "Ruby"]

Whilst the above examples work, I think it's probably better when splitting a string into words to split on characters not considered to be part of any kind of word. To do this, I did this:

str =  "Hello_World I am Learning,Ruby"
str.split(/[^a-zA-Z]/).reject(&:empty?).compact

This statement does the following:

Splits the string by characters that are not in the alphabet
Then rejects anything that is an empty string
And removes all nulls from the array

It would then handle most combination of words. The above examples require you to list out all the characters you want to match against. It's far easier to specify the characters that you would not consider part of a word.

Just for fun, a Unicode aware version for 1.9 (or 1.8 with Oniguruma):

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|\p{Connector_Punctuation}/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

Or maybe:

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|_/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

The real problem is determining what sequence of characters constitute a "word" in this context. You might want to have a look at the Oniguruma docs for the character properties that are supported, Wikipedia has some notes on the properties as well.