I wonder if there is a way to get offsets and delimiters while I am splitting a string in ruby analagous to PHP preg_split:
preg_split("/( | |<|>|\t|\n|\r|;|\.)/i", $html_st开发者_开发知识库ring, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);
I imagine I can achieve it by traversing string by characters or using something heavy as treetop, but I would like to use something more convenient.
You're looking for MatchData#offset
or MatchData#begin
, which you can access on Regexp.last_match
or $~
:
html_string.scan(/( | |<|>|\t|\n|\r|;|\.)/i) do |match|
# Returns begin and end position for this match, e.g. [5, 10]
Regexp.last_match.offset(0)
end
You can fetch offsets from $~
in Ruby, for example:
"foobarbaz".scan(/[oa]+/) { p [$~.begin(0), $~.end(0), $~.to_s] }
prints
[1, 3, "oo"]
[4, 5, "a"]
[7, 8, "a"]
Based on this you can write a loop which generates the same offsets as your PHP code did.
Thanks for both solutions, very helpful to know such approach. If I use scan I have to add logic to get things between matches. The same effect can be achieved with similar amount of lines using String#index. Too bad String#split does not take a block
def html_split(str)
DELIMITERS = /( |[\s<>;.])/i
data = []
offset = 0
i = str.index(DELIMITERS)
while i do
if i > 0
value = str[0...i]
data << [value, offset]
offset += i
end
delimiter = str[i..i] == '&' ? str[i..i+6] : str[i..i]
data << [delimiter, offset]
offset += delimiter.size
str = str[(i + delimiter.size)..-1]
i = str.index(DELIMITERS)
end
data
end
精彩评论