开发者

Text manipulation in Ruby

开发者 https://www.devze.com 2022-12-21 07:28 出处:网络
I\'m trying to write a word counter for LyX files. Life is almost very simple as most lines that need to be ignored begin with a \\ (I\'m prepared to make the assumption that no textual lines begin w

I'm trying to write a word counter for LyX files.

Life is almost very simple as most lines that need to be ignored begin with a \ (I'm prepared to make the assumption that no textual lines begin with backslashes) - however there are some lines that look like real text that aren't, but they are enclosed by \begin_inset and \end_inset:

I'm gen开发者_StackOverflow中文版uine text.

\begin_inset something
I'm not real text
Perhaps there will be more than one line! Or none at all! Who knows.
\end_inset

/begin_layout
I also need to be counted, and thus not removed
/end_layout

Is there a quick way in ruby to strip the (smallest amount of) text between two markers? I'm imagining Regular Expressions are the way forward, but I can't figure out what they'd have to be.

Thanks in advance


Is there a quick way in ruby to strip the (smallest amount of) text between two markers?

str = "lala BEGIN_MARKER \nlu\nlu\n END_MARKER foo BEGIN_MARKER bar END_MARKER baz"
str.gsub(/BEGIN_MARKER.*?END_MARKER/m, "")
#=> "lala  foo  baz"


gsub could be expensive for longer files (if you're reading in the whole file as string)

so if you have to chunk it anyway, you might want to use a stateful parser

in_block = false
File.open(fname).each_line do |line| 
 if in_block
    in_block = false if line =~ /END_MARKER/
    next
  else
    in_block = true if line =~ /BEGIN_MARKER/
    next
  end
  count_words(line)
end


You should look at str.scan(). Assuming your text is in the variable s, something like this should work:

s_strip_inset = s.sub!(/\\begin_inset.*?\\end_inset/, "")
word_count = s_strip_inset.scan(/(\w|-)+/).size
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号