Writing an algorithm for scrabble_问答_开发者_运维开发者技术经验分享

I'm working on a crossword-like problem, but I don't know how to design the algorithm.

For example:

there are words like 'car', 'apple' in the dictionary开发者_JAVA技巧.
the word 'app' is given on the board.
there are letters like 'l' 'e' 'c' 'r'....for making words.

So the algorithm's task is to make correct words which are stored in dictionary.

app -> lapp -> leapp -> lecapp -> .... -> lappe -> eappc -> ... -> appl -> apple (correct answer)

What is the best solution for this algorithm?

You might be interested in Googling the research paper "The World's Fastest Scrabble Program" by Appel and Jacobson (1988). The algorithms are outlined in pseudocode, so it takes a bit of work to mold that into useable form and glue it all together; however, the program the authors outline works great.

Store your dictionary as a tree, for example:

          *
          |
     +----+----+
     |         |
     A*        B
     |         |
  +--+--+      E*
  |     |      |
  P     S    +-+-+
  |     |    |   |
  P*    K*   A   E*
  |          |
+-+-+      +-+-+
|   |      |   |
E   L      D*  N*
|   |
A   E*
|
L*

Thanks paxdiablo for making my tree more readable.

This tree has the words a, app, appeal, apple, ask, bead, bean, be, and bee. The nodes marked with an asterisk indicate "If I were to stop here, this would be a valid word" such as the 'e' below 'b' for 'be'.

When you find a letter you don't know, use a wildcard (i.e., pick all children and recurse down all paths).

You said crossword, but then your "letters...for making words" seemed to indicate Scrabble. This would work for either. Not the fastest, but plenty fast.

Thanks Andreas for reminding us that this is called a trie.

If you wanted to say "the second letter is a P" you would start from the root node and take every branch (which would be every letter in the alphabet, assuming it's a proper dictionary) and then the "P" branch and then go on from there.

I've actually written a crossword program before (cryptic but the theory behind the construction is identical).

I had a database of words and their clues which could be sorted by times used (so that I wouldn't tend to get duplicate crosswords on subsequent runs).

The first thing you should do is design your patterns (blacks where you can't put letters and whites where you can). Trying to fit words into a grid while creating the pattern on the fly is very time consuming and prone to errors. If you look at most crosswords, they tend to follow certain rules to make it easier. Things like being symmetrical around one of the diagonals and disallowing a square of four white cells (to ease the task of selecting suitable words).

Once you have the pattern, then you start finding words to place in it. That way, you would know that "app" was the start of the word and be able to limit your searches to those starting with "app", not every word that has "app" in it. Similarly for words where you have known letters at any positions. It's a lot easier to locate words with letters at known positions than to evaluate those letters at any starting position within a word.

Mine ended up being written in shell script (believe it or not) and using the dictionary that came from Linux as a word search tool. If you know you have a 5-letter word starting with "app", it's quite easy to use:

grep '^app..$' words.txt

to get a list of all valid possibilities.

And, as each word was found, it was copied to a clues.txt file that contained both the word and multiple possible clues. The actual format was use {count, word, clue} where the same word may exist on multiple lines with a different clue - this allowed piping of the grep through sort so that lesser used words/clues floated to the top (whenever a word/clue was used, its count was incremented, making it less likely to be used next time).

Once that file was a decent size, the program would use it first to locate words and, only if one wasn't found would it revert to the words file (sans clues) where manual intervention would be required.

It actually ended up being quite good at doing the job. It wasn't blindingly fast but I didn't need to generate one every three seconds - this was for a community newsletter sent out once a week.

Now that you've changed the question to a Scrabble variant, that's actually much harder.

You need to take into account the letters you have, the letters on the board and the fact that there's a lot more places that you have to evaluate. This makes a brute force method much harder.

What I would do as an initial cut would be to select possibilities (starting position and direction on the board) chosen randomly, then use the same algorithm as for the crossword variant above to locate all words that can fit there. Then, if you have the letters to satisfy that word, store it (along with its score) in a list.

Keep in mind that you need to watch out for interfering with other words on the board.

I would continue to examine possibilities until one of:

your list is big enough to choose from.
you run out of time.
you've examined enough possibilities to satisfy your competency level.

That last one's important - if you're playing a beginner, you don't want to exhaustively examine millions of possibilities.

Then, choose the best move from your list (or maybe not the best if playing at beginner level - it all depends on how good you want the computer to be).

Steven A. Gordon wrote an interesting paper on how to search for possible Scrabble(tm I guess) moves (see Gordon's paper on GADDAG). While there is a big gap between searching for moves and winning at Scrabble - as the paper mentions - this isn't relevant for the original question.

In case you'd find it most useful just to go straight to reading some code, there is a good open-source player, Quackle.

Most of the Scrabble papers talk about searching the whole board for the best word to play. But to solve your problem, as stated, there is a pretty simple algorithm.

First, you know that the word you want contains 'app', and you know that the largest word you can make is seven letters long (3 letters already on the board, and 4 in your tray). So, search your database with a sql statement such as:

Select word from dictionary where word LIKE '%app%' and len(word) <= 7

Next, put all seven letters into an array {l,e,c,r,a,p,p}

Read each word from the database,one at a time. Then look at each character of the dictionary word and see if it exists in the array. If the first letter of the dictionary word is found in the array, then delete that element in the array and move on to the next dictionary letter.

If any dictionary word letter is not found in the array, then the word does not qualify, so, move on to the next word.

If you have looked at all the letters in the dictionary and all have been found in the array, then the word qualifies, and so you write it to the list.

Note, the reason you put your tiles into an array is so that once you match a letter from the dictionary word to a tile in your array, you'll need to remove that letter from further consideration, by deleting that element in the array.

So, for example, the dictionary database returns the word 'appeal'. The first four letters are found in your array, and those elements are deleted, leaving only {l,c,r} left in the array. When you look for the fifth letter 'a' you won't find it, so the word is disqualified.

The word 'apple' will qualify, leaving {c,r} left in your array.

It's pretty easy to code this in any language. However, it is not the fastest way to do it. I'm looking for a faster way myself!

If you're trying to create an index of words such that you could attempt to "solve" (or create) crosswords then I guess you'd start with a dictionary of words indexed by length. Then you'd create another dictionary of dictionaries of dictionaries... the first index is by total word length while the second one is by length, then by letter position, and finally by letter (six letter words with a second letter of "i" for example).

After you've built this index you can then express each step in trying to set or solve a puzzle in terms of set operations performed on these. (For example a 8 letter word starting with "w" and ending with "k" would be the intersection of all 8 letter words starting with "w" and all those ending with "k" --- which, unsurprisingly would include "homework"). Having built the indexed data structure I described, of course, allow for much more efficient searches for possible matches then would be possible by performing linear scans of just the global word list or even linear scans of the length separated lists).

Once you have this basic data structure then the rest of the program would, presumably, be a tree generation and traversal (with backtracking of course). Create a program that generates every possibility (using the data structure described) and a whenever it "gets stuck" have it backtrack until it finds new possibilities.

As implied by paxdiablo, you'll have to include a large pool of "words" for the generator to have a reasonable chance of creating a completed "solution." Anyone who is experienced with crosswords realizes that they allow the setter to take quite a few liberties (such as frequent use of compass points, archaic terms and poetic contracts) in order to get themselves o'er the hump as it were.

I haven't personally written a crossword generator. I have written cryptogram solvers which used a similar, though much simpler indexing structure. (To find every word which zyzxw could be in a cryptogram you "abstract" it into a pattern: abacd. Your dictionary contains every word indexed by its abstraction, and you can easily find that "every" matches "zyzxw"). In that case linear searches through the lists started at each abstraction is reasonably quick even when you're correlating to find out that "uvz" with "zyzxw" could, indeed, be "the" ... for example). I've also written a simple "Jotto" game which doesn't benefit from indexing at all --- linear scanning through the few thousand 5 or 6 letter words at each elimination step used to take far less than a second on my old 6 Mhz XT in the pre-history of modern PC computing).

Look for the PhD thesis paper called "Towards Perfect Play of Scrabble" by Brian Sheppard (author of Maven). It's informative and a very interesting. But also very long.

If I understood the question correctly (you start w hint letters, a substring of word, and you try to rearrange the letters to get a correct word) here is another solution:

You can start by going backwards. You already have the words in the dictionary and need to show part of the word (a substring) and a list of letters from the word so the people can arrange them. Given all this, you start with the word from the dictionary and create a graph of words of 1 edit distance away.

Example

Start with apple and keep removing a letter. Here is a small graph (for which I didn't draw all edges to reduce clutter):

apple -> appe -> ape -> ...
\ \
\_-> appl -> app -> ...

As you remove the letter, you place it in a hints list.

hints: l, p

hints: l, e

As the player uses letters from the list to form the original word, you accept only correct entries which are nodes which lead to a previous parent. You just traverse the graph backwards to find the original word.

Example

If the word is app and hints: l, p

If the user gives you l : appl you move to a prev node of app which is appl.

If the user gives you e : appe you move to a prev node of app which is appe in this case.

Any other letter which the user inputs, you disallow by remaining at current node.

The highest scoring move on a given turn, is not necessarily the winning move. Sometimes the best move involves blocking the opponents moves. Depending on if it's a hidden letters in the bag or not hidden, then it changes the strategy.

The opponents tray can be computed easily if the bag contents are known. Then the best move is the one such that you get the most net points relative to your opponents next move.

Now suppose the opponents tray cannot be deduced because the bag is hidden. Nonetheless the bag and opponents tray letters combined are known. So one can statistically determine the most probably letters in the opponents tray. Then the analysis is very costly, and requires scanning a huge space for points and probability of each so the best move can be determined.

Scrabble has randomness and not total information in some forms, making best moves a statistics problem. This is unlike in chess where no randomness or hidden information exist, and the theoretical best move is solely based on deductive reasoning, even if computers are nowhere near powerful enough to exactly solve it.

What you are looking for is the ability for your anagram solver to find "wildcard" letters to see what over words it can make with additional letters. I have a anagram solver I wrote which does this exact thing. One important thing I discovered to make this, and also for the speed of the solver is to predefine how many letter and the score of each word in your table of words.

For Instance You table should be structured like this

word | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | score
-------------------------------------------------------------------------------------------------------------
test | 0 | 0 | 0 | 0 | 1 | 0 | 0 | h | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 4

As you can see there is a separate column for the word, letter and how many of that letter they contain and the score of that word. I created this ahead of time with a separate script that simply ran for each word and filled it in for me until it was done.

Here is the script I wrote to calculate how many letters were in each word as well as the score and updated each record. You have to start with a table that just has the words in it before you can run this script. Once you run it then you are done and do not have to run it again unless you add new words.

<?
include('/includes/connect.php');
$sql = "SELECT * FROM SOWPODS WHERE word LIKE 'z%' ORDER BY word ASC";
$result = mysql_query($sql);
while($row = mysql_fetch_array($result)) {
$string = $row['word'];
$rowwordid = $row['ID'];
echo $thisword = strtoupper($row['word']);
echo " - ";
for ($ii = 0; $ii < strlen($string); ++$ii) {
    $thisletter = strtolower($string{$ii});
    if ($thisletter == 'a') {
        $a = $a+1;
    } elseif ($thisletter == 'b') {
        $b = $b+1;
    } elseif ($thisletter == 'c') {
        $c = $c+1;
    } elseif ($thisletter == 'd') {
        $d = $d+1;
    } elseif ($thisletter == 'e') {
        $e = $e+1;
    } elseif ($thisletter == 'f') {
        $f = $f+1;
    } elseif ($thisletter == 'g') {
        $g = $g+1;
    } elseif ($thisletter == 'h') {
        $h = $h+1;
    } elseif ($thisletter == 'i') {
        $i = $i+1;
    } elseif ($thisletter == 'j') {
        $j = $j+1;
    } elseif ($thisletter == 'k') {
        $k = $k+1;
    } elseif ($thisletter == 'l') {
        $l = $l+1;
    } elseif ($thisletter == 'm') {
        $m = $m+1;
    } elseif ($thisletter == 'n') {
        $n = $n+1;
    } elseif ($thisletter == 'o') {
        $o = $o+1;
    } elseif ($thisletter == 'p') {
        $p = $p+1;
    } elseif ($thisletter == 'q') {
        $q = $q+1;
    } elseif ($thisletter == 'r') {
        $r = $r+1;
    } elseif ($thisletter == 's') {
        $s = $s+1;
    } elseif ($thisletter == 't') {
        $t = $t+1;
    } elseif ($thisletter == 'u') {
        $u = $u+1;
    } elseif ($thisletter == 'v') {
        $v = $v+1;
    } elseif ($thisletter == 'w') {
        $w = $w+1;
    } elseif ($thisletter == 'x') {
        $x = $x+1;
    } elseif ($thisletter == 'y') {
        $y = $y+1;
    } elseif ($thisletter == 'z') {
        $z = $z+1;
    }
}
$scorea = $a*1;
$scoreb = $b*4;
$scorec = $c*4;
$scored = $d*2;
$scoree = $e*1;
$scoref = $f*4;
$scoreg = $g*3;
$scoreh = $h*3;
$scorei = $i*1;
$scorej = $j*10;
$scorek = $k*5;
$scorel = $l*2;
$scorem = $m*4;
$scoren = $n*2;
$scoreo = $o*1;
$scorep = $p*4;
$scoreq = $q*10;
$scorer = $r*1;
$scores = $s*1;
$scoret = $t*1;
$scoreu = $u*2;
$scorev = $v*5;
$scorew = $w*4;
$scorex = $x*8;
$scorey = $y*3;
$scorez = $z*10;

$totalscore = $scorea + $scoreb + $scorec + $scored + $scoree + $scoref + $scoreg +     $scoreh + $scorei + $scorej + $scorek + $scorel + $scorem + $scoren + $scoreo + $scorep +      $scoreq + $scorer + $scores + $scoret + $scoreu + $scorev + $scorew + $scorex + $scorey + $scorez;
$SQL_update_count = "UPDATE TWL06 SET a = '$a', b = '$b', c = '$c', d = '$d', e = '$e', f = '$f', g = '$g', h = '$h', i = '$i', j = '$j', k = '$k', l = '$l', m = '$m', n= '$n', o = '$o', p = '$p', q = '$q', r = '$r', s = '$s', t = '$t', u = '$u', v = '$v', w = '$w', x = '$x', y = '$y', z = '$z', score = '$totalscore' WHERE ID = '$rowwordid'";
echo "<br>";
$result_update_count = mysql_query($SQL_update_count);

$a = 0;
$b = 0;
$c = 0;
$d = 0;
$e = 0;
$f = 0;
$g = 0;
$h = 0;
$i = 0;
$j = 0;
$k = 0;
$l = 0;
$m = 0;
$n = 0;
$o = 0;
$p = 0;
$q = 0;
$r = 0;
$s = 0;
$t = 0;
$u = 0;
$v = 0;
$w = 0;
$x = 0;
$y = 0;
$z = 0;
 }
?>

Once that is done then all you have to do is make a script that counts the letters in the columns and matches it with the letters you give it. You are going to have to explode the letters first and find out how many of each letter you have. Then run a sql statement that finds those amounts of letter or less.