开发者

Extract individual existing words in domain names

开发者 https://www.devze.com 2023-01-27 09:48 出处:网络
I\'m looking 开发者_JS百科for a Ruby gem (preferably) that will cut domain names up into their words.

I'm looking 开发者_JS百科for a Ruby gem (preferably) that will cut domain names up into their words.

whatwomenwant.com => 3 words, "what", "women", "want".

If it can ignore things like numbers and gibberish then great.


You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

If given a phrase that has any unrecognized words, it returns an empty array:

p phrase_to_words('expertsfoo')
# => []

If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.

Here's some code to turn the word list into a tree:

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

This produces a tree that looks like this:

{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}

It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.

Modifying words_that_phrase_begins_with to use the new tree structure will make it faster:

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end


I don't know gems for this, but if I had to solve this problem, I would download some english words dictionary and read about text searching algorythms.

When you have more than one variant to divide letters (like in sepp2k's expertsexchange), than you can have two hints:

  1. Your dictionary is sorted by... for example, popularity of a word. So dividings with most popular words will be more valuable.
  2. You can go to the main page of site with domain you are anazyling and just read the content, searching your words. I don't think that you'll find sex on a page for some experts. But... hm... experts can be so different ,.)


Update


I've been working with this challenge and came up with the following code. Please refactor if I'm doing something wrong :-)

Benchmark:


Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)

Code:


f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results
0

精彩评论

暂无评论...
验证码 换一张
取 消