How would you write a program to find the shortest pangram in a list of words?_问答_开发者

Given a list of words which contains the letters a-z at least once, how would you write a program to find the shortest pangram counted by number o开发者_如何转开发f characters (not counting spaces) as a combination of the words?

Since I am not sure whether short answers exist, this is not code golf, but rather just a discussion of how you would approach this. However, if you think you can manage to write a short program that would do this, then go ahead, and this might turn into code golf :)

I would approach this by proving that the problem is NP-hard, and by checking heuristics for the NP-hard problems that look similar.

We can reduce a Set Cover problem to our one. Set Cover is different in that not a number of letters used is minimized, but a number of words used is minimized instead. Assume we want to solve Set Cover problem, given N words, each of length less than M. Let's build another set of words by cloning the given set, but concatenating to each of them N*M non-english letters, say, Ж. If we could build a pangram (over a,b,c...x,y,z,ж alphabet) that requires minimum symbols, that would be a pangram with minimum words, if we remove all Ж letters.

This proves that the original problem is NP-hard, but, unfortunately we need a reduction to some NP-hard problem to reuse its (hopefully already known) heuristic. Set-Cover has a greedy heuristic with logarithmic approximation, but I don't think it applies to the original problem (nature of the Set-Cover problem requires taking letter-rich, long words; it's not the way to solve our problem).

So I'd search a list of related NP-hard problems, and check if there's something of interest. That's how I'd approach this one.

This is an variant of the set cover problem (a.k.a. hitting set problem):

As input you are given several sets. They may have some elements in common. You must select a minimum number of these sets so that the sets you have picked contain all the elements that are contained in any of the sets in the input. It was [...] shown to be NP-complete in 1972[,] and the optimization version of set cover is NP-hard.

It is a variant because we're looking for the minimum number of letters, not the minimum number of words. But I'd think it's still NP-hard, which means that you will not be able to do much better than brute force.

Here's an O(n) algorithm for a different problem for when you have a string instead of a list of words as input.. It was my oversight, but will leave the solution here cause I don't feel like deleting it :)

Since we are only interested in characters, it makes the problem a lot easier. Maintain a map of each character [a-z] to its position in the string. This map alone is sufficient do determine if we have a pangram and what's its length.

1. Initialize a map of all alphabets to null
2. Initialize shortest_pangram to { length: ∞, value: undefined }
3. Loop through each "character" in given string
  3.1 Update the value of map[character] to current string index
  3.2 If we have a pangram, and its the shortest so far, record its length/value
4. shortest_pangram should have our result

The map we created is enough to determine if we have a pangram - if all values in our map are non null, we have a pangram.

To find the length of the current pangram, subtract the max value from the min value in our map. Remember that before finding the length, we must check if it is a pangram.

Here's a naive non-optimized implementation in Ruby:

class Pangram
  def initialize(string)
    @input = string.downcase.split('')
    @map = {}
    ('a'..'z').each { |c| @map[c] = nil }
    infinity = 1.0/0.0
    @best = { :length => infinity, :string => nil }
  end

  def shortest
    @input.each_with_index do |c, index|
      @map[c] = index if @map.key?(c)
      if pangram? and length < @best[:length]
        @best[:length] = length
        @best[:string] = value
      end
    end
    @best
  end

  def pangram?
    @map.values.all? { |value| !value.nil? }
  end

  def length
    @map.values.max - @map.values.min
  end

  def value
    @input[@map.values.min..@map.values.max].join('')
  end
end

To use, instantiate the class and pass it the entire string. Call .shortest to find the length of the shortest pangram and the matching substring.

pangram = Pangram.new("..")
print pangram.shortest

This is an old question, so probably you've found some heuristics you already like. I came across this question while exploring ways to generate perfect pangrams, which will be the fewest number of characters (since they are only allowed to use each letter in the alphabet once). Anyway, for future finders like myself:

I wrote a program which has some success. I treated this problem more like graph search than set cover and used A* as a starting point for the algorithm. You can explore the code on github.

The things that helped the most were:

Compress the State Space

I took a dictionary and transformed all the words into their sorted letter set. For example, this way "BAD" and "DAB" are both stored as "ABD". The compressed dictionary I used took ~250,000 words down to ~31,000 unique letter combos which is a massive win.

Heuristics

As mentioned other places, this is NP hard so I started using heuristics. The three I'm currently using are:

Vowel Ratio

When I examine the letters remaining after picking a word, I compute #vowels / #unusedLetters. The motivation for this is pretty simple - having more vowels remaining makes it more likely that I'll be able select words using those letters.

Letter Commonality

When I read in the initial word set, I create a dictionary for each letter in the alphabet and count the number of times each letter appears across all the words. I used this dictionary to prefer nodes where the remaining letters had more common letters. (I believe OP mentioned this one in one of the comments)

Shared 3-Letter Combos

This is similar to the letter commonality heuristic. Again, when processing the initial word set, I created a dictionary which contains all 3-letter combinations which can be made with that word. So for example the letter-set ABC has only one valid combo, but ABCD has [ABC, ABD, BCD]. Remember, I only care about sorted letter-sets after having compressed the initial wordset.

So in the end, must like the letter commonality measure, I have a dictionary mapping all 26 choose 3 possible letter sets mapped to the number of times those combos appear across my wordset. Then I use this to prefer searching nodes where the remaining letters have more valid 3-letter combos.