I know there have been several posts about random word generation based on large dictionaries or web lookups. However, I'm looking for a word generator which I can use to create strong password without symbols. What I'm looking for is a reliable mechanism to generate a random, non recognised, English word of a given length.
An examp开发者_开发百科le of the type of word would be "ratanta" etc.
Are there any algorithms that understand compatible syllables and therefore generate a pronouncable output string? I know that certain captcha style controls generate these types of words but I'm unsure whether they use an algorithm or whether they are sourced from a large set as well.
If there are any .Net implementations of this type of functionality I would be very interested to know.
There are several things you can do:
1) Research English syllable structure, and generate syllables following those rules
2) Employ Markov chains to get a statistical model of English phonology.
There are plenty of resources on Markov chains, but the main idea is to record the probability of there being any particular letter after a certain sequence. For instance, after "q", "u" is very very likely; after "k", "q" is very very unlikely (this employs 1-length Markov chains); or, after "th", "e" is very likely (this employs 2-length Markov chains).
If you go the syllable model route, you can use resources like this to help you elucidate your intuitions about your language.
UPDATE:
3) You can make it much simpler by not simulating full English, but, say, Japanese, or Italian, where rules are much easier, and if it's a nonsense word it is as easy to remember as a nonsense English word. For instance, Japanese only has about 94 valid syllables (47 short, 47 long), and you can list all of them easily and pick at random.
I'd use a Markov chain algorithm for this.
In summary:
- Build a dictionary. Iterate through the letters in an example piece of English text. Build a data structure that maps pairs of letters. Against each pair, record a probability that the second letter appears immediately after the first.
- Generate your text. Using the map that you built in (1), pick a sequence of random letters. When deciding what letter to write next, look at the letter you wrote most recently, and use that letter to determine the probability of the next letter.
Some answers suggest a Markov chain but don't tell you how you'd build one. Here is an implementation:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Text.RegularExpressions;
namespace PseudoWord
{
public sealed class PseudoWordGenerator : IDisposable
{
private readonly RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
private readonly HashSet<string> enders = new HashSet<string>();
private readonly IList<string> starters = new List<string>();
private readonly Dictionary<char, IList<string>> gramDict =
Enumerable
.Range('a', 'z')
.ToDictionary(a => (char) a, _ => (IList<string>) new List<string>());
private readonly byte[] randomBytes = new byte[4];
public PseudoWordGenerator(IEnumerable<string> words, int gramLen)
{
foreach (var word in words.Select(w => w.Trim().ToLower()).Where(w => w.Length > gramLen)
.Where(w => Regex.IsMatch(w, "^[a-z]+$")))
{
this.starters.Add(word.Substring(0, gramLen));
this.enders.Add(word.Substring(word.Length - gramLen, gramLen));
for (var i = 0; i < word.Length - gramLen; i++)
{
var currentLetter = word[i];
if (!this.gramDict.TryGetValue(currentLetter, out var grams))
{
i = word.Length;
continue;
}
grams.Add(word.Substring(i + 1, gramLen));
}
}
}
public string BuildPseudoWord(int length)
{
var result = new StringBuilder(this.GetRandomStarter());
var lastGram = string.Empty;
while (result.Length < length || !this.enders.Contains(lastGram))
{
lastGram = this.GetRandomGram(result[result.Length - 1]);
result.Append(lastGram);
}
return result.ToString();
}
private string GetRandomStarter() => this.GetRandomElement(this.starters);
private string GetRandomGram(char preceding) =>
this.GetRandomElement(this.gramDict[preceding]);
private T GetRandomElement<T>(IList<T> collection) =>
collection[this.GetRandomUnsigned(collection.Count - 1)];
public void Dispose()
{
for (var i = 0; i < this.randomBytes.Length; i++)
{
this.randomBytes[i] = 0;
}
this.rng?.Dispose();
}
private int GetRandomUnsigned(int max)
{
this.rng.GetBytes(this.randomBytes);
return Math.Abs(BitConverter.ToInt32(this.randomBytes, 0)) % (max + 1);
}
}
}
With gramLen = 3 and the "linuxwords" Linux dictionary as input, here's some sample output of at least 12 characters
larisommento
damentivesto
honsgranspireas
incenctorsed
opemelersult
spenedriarblast
devokepocian
newmenaryrofile
perocererich
trerwhusinis
This implementation is simplified by simply storing repeats in the arrays to handle probability. Additionally, we treat the beginning and end of words specially to generate more plausible words.
精彩评论