开发者

Is there an efficient way to perform hundreds of text substitutions in Ruby?

开发者 https://www.devze.com 2023-02-19 13:22 出处:网络
I\'m trying to use a list of hundreds of common misspellings to clean some input before searching for duplicates.

I'm trying to use a list of hundreds of common misspellings to clean some input before searching for duplicates.

It's a time-critical process, so I'm hoping that there's a quicker way than having hundreds of regexes (or one with a hundred branches).

Is there an effi开发者_如何学Gocient way to perform hundreds of text substitutions in Ruby?


An alternative approach, if your input data is separated words, would simply be to build a hash table of {error => correction}.

Hash table lookup is fast, so if you can bend your input data to this format, it will almost certainly be fast enough.


I'm happy to say I just found "RegexpTrie" which is a useable replacement to the code, and need for, Perl's Regexp::Assemble.

Install it, and give it a try:

require 'regexp_trie'

foo = %w(miss misses missouri mississippi)

RegexpTrie.union(foo)
# => /miss(?:(?:es|ouri|issippi))?/

RegexpTrie.union(foo, option: Regexp::IGNORECASE)
# => /miss(?:(?:es|ouri|issippi))?/i

Here's a comparison of the outputs. The first, commented outputs in the array, are from Regexp::Assemble and the trailing output is from RegexpTrie:

require 'regexp_trie'

[
  'how now brown cow',                           # /(?:[chn]ow|brown)/
  'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/
  'jackdaws love my giant sphinx of quartz',     # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/
  'fu foo bar foobar',                           # /(?:f(?:oo(?:bar)?|u)|bar)/
  'ms miss misses missouri mississippi'          # /m(?:iss(?:(?:issipp|our)i|es)?|s)/
].each do |s|
  puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source]
end

# >> how now brown cow                           # /(?:how|now|brown|cow)/
# >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/
# >> jackdaws love my giant sphinx of quartz     # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/
# >> fu foo bar foobar                           # /(?:f(?:oo(?:bar)?|u)|bar)/
# >> ms miss misses missouri mississippi         # /m(?:iss(?:(?:es|ouri|issippi))?|s)/

Regarding how to use the Wikipedia link and misspelled words:

require 'nokogiri'
require 'open-uri'
require 'regexp_trie'

URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines'

doc = Nokogiri::HTML(open(URL))
corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s|
  a, b = s.chomp.split('->', 2)
  [a, b.split(/,\s+/) ]
}.to_h
#  {"abandonned"=>["abandoned"],
#   "aberation"=>["aberration"],
#   "abilityes"=>["abilities"],
#   "abilties"=>["abilities"],
#   "abilty"=>["ability"],
#   "abondon"=>["abandon"],
#   "abbout"=>["about"],
#   "abotu"=>["about"],
#   "abouta"=>["about a"],
#   ...
#   }

misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)...

At this point you can use gsub(misspelled_words_regex, corrections), however, the values in corrections contain some arrays because multiple words or phrases could have been used to replace the misspelled word. You'll have to do something to determine which of the choices to use.


Ruby is missing a very useful module found in Perl, called Regexp::Assemble. Python has hachoir-regex which appears to do the same sort of thing.

Regexp::Assemble creates a very efficient regular expression, based on lists of words and simple expressions. It's really remarkable ... or ... diabolical?

Check out the example for the module; It's extremely simple to use in its basic form:

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])

Notice how it's combining the patterns. It'd do the same with regular words, only it would be even more efficient because common strings will be combined:

use Regexp::Assemble;

my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.';

my $ra = Regexp::Assemble->new('flags' => 'i');

$lorem =~ s/[^a-zA-Z ]+//g;

$ra->add(split(' ', lc($lorem)));
print $ra->anchor_word(1)->as_string, "\n";

Which outputs:

\b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b

This code ignores case and honors word boundaries.

I'd recommend writing a little Perl app that can take a list of words and use that module to output the stringified version of the regex pattern. You should be able to import that pattern into Ruby. That would let you very quickly find misspelled words. You could even have it output the pattern to a YAML file, then load that file into your Ruby code. Periodically parse the misspelled word pages, run the output through the Perl code, and your Ruby code would have an updating pattern.

You could use that pattern against a chunk of text just to see if there are misspelled words. If so, then you break the text down into sentences or words and check against the regex again. Don't immediately test against words because most words will be spelled correctly. It's almost like a binary search against your text - test the whole thing, if there's a hit then break into smaller blocks to narrow the search until you've found the individual misspellings. How you break down the chunks depends on the amount of incoming text. A regex pattern can test the entire text block and return a nil or index value, in addition to individual words the same way, so you gain a lot of speed doing big chunks of the text.

Then, if you know you have a misspelled word you can do a hash lookup for the correct spelling. It would be a big hash, but the task of sifting out the good vs. bad spellings is what will take the longest. The lookup would be extremely fast.


Here's some example code:

get_words.rb

#!/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'
require 'yaml'

words = {}
['0-9', *('A'..'Z').to_a].each do |l|
  begin
    print "Reading #{l}... "
    html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read
    puts 'ok'
  rescue Exception => e
    puts "got \"#{e}\""
    next
  end

  doc = Nokogiri::HTML(html)
  doc.search('div#bodyContent > ul > li').each do |n| 
    n.content =~ /^(\w+) \s+ \(([^)]+)/x
    words[$1] = $2 
  end
end

File.open('wordlist.yaml', 'w') do |wordfile|
  wordfile.puts words.to_yaml
end

regex_assemble.pl

#!/usr/bin/env perl

use Regexp::Assemble;
use YAML;

use warnings;
use strict;

my $ra = Regexp::Assemble->new('flags' => 'i');

my %words = %{YAML::LoadFile('wordlist.yaml')};
$ra->add(map{ lc($_) } keys(%words));

print $ra->chomp(1)->anchor_word(1)->as_string, "\n";

Run the first, then run the second piping its output to a file to capture the emitted regex.


And more examples of words and the generated output:

'how now brown cow' => /\b(?:[chn]ow|brown)\b/
'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/
'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/
'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/
'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/

Ruby's Regexp.union is nowhere close to the sophistication of Regexp::Assemble. After capturing the list of misspelled words, there are 4225 words, consisting of 41,817 characters. After running Perl's Regexp::Assemble against that list, a 30,954 character regex was generated. I'd say that's efficient.


Try it the other way around. Rather than correcting misspellings and checking for duplicates on the result, drop everything to a soundalike format (like Metaphone, or Soundex), and check for duplicates in that format.

Now, I don't know which way is likely to be faster - on the one hand, you've got hundreds of regexes, each of which will fail to match almost instantly and return. On the other, you've got 30-odd potential regex replacements, one or two of which will definitely match for every word.

Now, metaphone is pretty fast - there's really not a lot to the algorithm - so I can only suggest that you try it out and measure if either is fast enough for your use.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号