Word count in Rails?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-17 10:12 出处：网络

Say I have a blog model with Title and Body. How I do show the number of words in Body and characters in Title? I want the outpu开发者_如何学Pythont to be something like this

Title: Lorem Body: Lorem Lorem Lorem

This post has word count of 3.

"Lorem Lorem Lorem".scan(/\w+/).size
=> 3

UPDATE: if you need to match rock-and-roll as one word, you could do like

"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4

Also:

"Lorem Lorem Lorem".split.size
=> 3

If you're interested in performance, I wrote a quick benchmark:

require 'benchmark'
require 'bigdecimal/math'
require 'active_support/core_ext/string/filters'

# Where "shakespeare" is the full text of The Complete Works of William Shakespeare...

puts 'Benchmarking shakespeare.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.squish.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.split.size } }
puts 'Benchmarking shakespeare.squish.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.split.size } }

The results:

Benchmarking shakespeare.scan(/\w+/).size x50
 13.980000   0.240000  14.220000 ( 14.234612)
Benchmarking shakespeare.squish.scan(/\w+/).size x50
 40.850000   0.270000  41.120000 ( 41.109643)
Benchmarking shakespeare.split.size x50
  5.820000   0.210000   6.030000 (  6.028998)
Benchmarking shakespeare.squish.split.size x50
 31.000000   0.260000  31.260000 ( 31.268706)

In other words, squish is slow with Very Large Strings™. Other than that, split is faster (twice as fast if you're not using squish).

The answers here have a couple of issues:

They don't account for utf and unicode chars (diacritics): áâãêü etc...
They don't account for apostrophes and hyphens. So Joe's will be considered two words Joe and 's which is obviously incorrect. As will twenty-two, which is a single compound word.

Something like this works better and account for those issues:

foo.scan(/[\p{Alpha}\-']+/)

You might want to look at my Words Counted gem. It allows to count words, their occurrences, lengths, and a couple of other things. It's also very well documented.

counter = WordsCounted::Counter.new(post.body)
counter.word_count #=> 3
counter.most_occuring_words #=> [["lorem", 3]]
# This also takes into capitalisation into account.
# So `Hello` and `hello` are counted as the same word.

"Lorem Lorem Lorem".scan(/\S+/).size
=> 3

"caçapão adipisicing elit".scan(/[\w-]+/).size 
=> 5

But as we can see, the sentence has only 3 words. The problem is related with the accented characters, because the regex \w doesn't consider them as a word character [A-Za-z0-9_].

An improved solution would be

I18n.transliterate("caçapão adipisicing elit").scan(/[\w-]+/).size
=> 3

Word count in Rails?

精彩评论

关注公众号

热门标签

图文推荐

Word count in Rails?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：