I built a word generator, it picks a length and then randomly picks letters of the alphabet to make up words.
The program works but 99% of the output is rubbish as it is 开发者_如何学Cnot observing the constructs of the English language, I am getting as many words with x and z in as I do e.
What are my options for biasing the RNG so it will use common letters more often.
I am using rand() from the stl seeded with the time.
The output will still be rubbish because biasing the random number generator is not enough to construct proper English words. But one approach to biasing the rng is:
- Make a histogram of the occurences of letters in a large English text (the corpus). You'll get something like 500 'e', 3 'x', 1 'q', 450 'a', 200 'b' and so on.
- Divide an interval into ranges where each letter gets a slice, with the length of the slice being the number of occurences in the interval. a gets [0-450), b [450,650), ..., q [3500,3501).
- Generate a random number between 0 and the total length of the interval and check where it lands. Any number within 450-650 gives you a b, but only 3500 gives you a 'q'.
One method would be to use the letter frequency. For each letter define a range: a = [0, 2] (if the letter 'a' has 2% chance of being used), b = [2, 5] (3% chance), and so forth.. then generate a random number between 0 and 100 and choose a letter.
An other method is to use a nondeterministic finite automata where you can define certain transitions (you could parse the bible and build your probability). So you have a lot of transitions like this: e.g. the transition from 'a' to 'b' is 5%. Then you walk through the automata and generate some words.
I just saw that the proper term is markov chain, which is probably better than a NFA.
You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.
To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.
What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like: std::map<std::string, std::map<char,int>>
.
Having made the analysis and collected the statistics, the algorithm would go like this:
- pick a random starting n-gram. Your previous analysis may contain weighted data for which letters usually start words;
- from all the n-grams that start with previous n-1 letters, pick a random last letter (considering the weights from the analysis);
- repeat until you reach the end of a word (either using a predefined length or from data about word ending frequencies);
To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.
For example:
- A happens 10 times;
- B happens 7 times;
- C happens 9 times;
You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.
You may want to use the English language's letter frequency to have a more realistic output : http://en.wikipedia.org/wiki/Letter_frequency.
But if you want pronounceable words, you should probably generate them from syllabes. You can find more information online, e.g. here : http://spell.psychology.wustl.edu/SyllStructDistPhon/CVC.html
You could derive a Markov Model be reading a source text and then generate words which are "like" the source.
This also works for generating sentences from words. Well, sort of works.
If you want to change just the letter frequency in the words, without futher lexical analisys (like the qu
pair), get a list of english language letter frequencies.
Then create a weighted random generator, that will have more chance to output an e
(1 in 7 chance) that a x
(around 1 in a 1000).
To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below
letter | weight | start | end
a | 8.17% | 0 | 8167
b | 1.49% | 8168 | 9659
c | 2.78% | 9660 | 12441
d | 4.25% | 12442 | 16694
e | 12.70% | 16695 | 29396
f | 2.23% | 29397 | 31624
g | 2.02% | 31625 | 33639
.....
z | 0.07% | 99926 | 99999
3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.
First, you need a table with the letters and their weights, something like:
struct WeightedLetter
{
char letter;
int weight;
};
static WeightedLetter const letters[] =
{
{ 'a', 82 },
{ 'b', 15 },
{ 'c', 28 },
// ...
};
char getLetter()
{
int totalWeight = 0;
for ( WeightedLetter const* iter = begin( letters );
iter != end( letters );
++ iter ) {
totalWeight += iter->weight;
}
int choice = rand() % totalWeight;
// but you probably want a better generator
WeightedLetter const* result = begin( letters );
while ( choice > result->weight ) {
choice -= result->weight;
++ result;
}
return result->letter;
}
This is just off the top of my head, so it's likely to contain errors; at the very least, the second loop requires some verification. But it should give you the basic idea.
Of course, this still isn't going to result in English-like words. The sequence "uq" is just as likely as "qu", and there's nothing to prevent a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)
If you want to create pronounceable words do not try and join letters together.
Join sounds. Make a list of sounds to select from:"abe", "ape", "gre" etc
精彩评论