FortuneCookie101 FortuneCookie101 - 7 months ago 14
SQL Question

How can I generate a random logical word?

I'm wondering how I can generate a random logical word list in PHP.

I have a MySQL database full of English words (A - Z) and I want to generate logical words to go with each one.

For example: In the word list I have, number 26 is 'abandon', I would like to generate a word for this word maybe using regex or something so I can translate a whole page of words back and forth using it.

The problem about using straight up random words is they don't look authentic enough, so 'abandon' might become (purely randomly generated) 'qdbskp' or something like that. The problem being the word doesn't look authentic at all, it really just looks like someone slammed their face into the keyboard.

However I would like some logic to it, so maybe a few vowels and consonants to make the word look "real".

Hopefully I'm explaining myself correctly.

Thanks.

TLDR: I'm trying to create a randomly generated word dictionary with links to an English word list that have some logic so the words look real.

Answer

Method & Data

What can make a word look somewhat logical is if it's composed of characters in an order you're used to seeing them. One way to do this is with a weighted list of trigrams - sequences of 3 characters.

Basically you take any two letters, like "so", and add another that commonly comes after it, like "l". Then take the last two letters, "ol", and find what comes after that. Rinse/repeat until you've got a word of whatever length you'd like - "solverom".

Sourcing from Peter Norvig's n-gram data (which itself was compiled from Google books ngrams), I've put together a few json files on github. I'd include the data directly here, but trigrams.json in particular is a bit big for that at ~128KB.

The data can actually be compiled from any dictionary or other hulking word list, and is structured like so...


distinct_word_lengths.json

[0,26,622,4615,6977,10541,13341,14392,13284,11079,8468,5769,3700,2272,1202,668,283,158,64,40,16,1,5,2]

This one is complete. It is a (0-indexed) distribution of lengths of distinct words. Each index is the word length and each value how many words of that length were found. So, for example, there were 4615 distinct words that were 3 characters long.

We'll use this to decide how long our new word should be. Basically we add up all the values, pick a random number between 1 and the total, then find where in the set it lays. The key for that element is how long the word will be.


word_start_bigrams.json

{
    "TH": "82191954206",
    "HE": "9112438473",
    "IN": "27799770674",
    "ER": "324230831",
    ...

This one couples bigrams, two-character combinations, with how often they're found at the beginning of words. Yes, everything is in capital letters.

We'll use this to decide what to start our word with.


trigrams.json

{
    "TH": {
        "E": "69221160871",
        "A": "9447439870",
        "I": "6357454845",
        "O": "3369505315",
        "R": "1673179164",
        ...
    },
    "AN": {
        "D": "26468697834",
        "T": "3755591976",
        "C": "3061152975",
        ...

This one is a little more interesting. Each key in this data set is a bigram with an array of characters and how often that character appears after it.

"D" shows up after "AN" a lot.

This is what we'll use to build up the rest of the word.


Functions

First we need a few utility functions.

gmp_rand()

function gmp_rand($min, $max) {
    $max -= $min;
    $bit_length = strlen(gmp_strval($max, 2));

    do {
        $rand = gmp_init(0);
        for ($i = $bit_length - 1; $i >= 0; $i--) {
            gmp_setbit($rand, $i, rand(0, 1));
            if ($rand > $max) break;
        }
    } while ($rand > $max);

    return $rand + $min;
}

Because some of the numbers we need to generate can be larger than PHP_INT_MAX we'll use the PHP GMP extension to deal with them. Simple enough rand() work-a-like.


array_weighted_rand()

function array_weighted_rand ($list) {
    $total_weight = gmp_init(0);
    foreach ($list as $weight) {
        $total_weight += $weight;
    }

    $rand = gmp_rand(1, $total_weight);
    foreach ($list as $key => $weight) {
        $rand -= $weight;
        if ($rand <= 0) return $key;
    }
}

This is much like the built-in array_rand() in that you pass it an array and it'll return a random key. Only this one factors in the weight when picking it.

So if you pass in an array that looks like:

array (
  'foo' => 2,
  'bar' => 4,
  'baz' => 12
)

It'll return bar about twice as often as it'll return foo, and baz about three times as often as bar.


fill_word()

function fill_word ($word, $length, $trigrams) {
    while (strlen($word) < $length) {
        $word .= array_weighted_rand($trigrams[substr($word, -2)]);
    }
    return $word;
}

This takes a string $word and fills it to $length from the set of given $trigrams. Each iteration it picks from the data set based on the last two characters in the string.


Usage

$lengths  = json_decode(file_get_contents('distinct_word_lengths.json'), true);
$bigrams  = json_decode(file_get_contents('word_start_bigrams.json'), true);
$trigrams = json_decode(file_get_contents('trigrams.json'), true);

for ($i = 0; $i < 10; $i++) {
    do {
        $length = array_weighted_rand($lengths);
        $start  = array_weighted_rand($bigrams);
        $word   = fill_word($start, $length, $trigrams);
    } while (!preg_match('/[AEIOUY]/', $word));

    $word = strtolower($word);
    echo "$word\n";
}

What we're doing is getting a random length, and random bigram to begin the word with, then filling it up. The preg_match() is just to validate that the word contains a vowel, which isn't otherwise guaranteed. If it doesn't, try again.

You can replace this with any sort of validation you might want to do, such as making sure it doesn't match a real word in your database or whatever.

Yeah, you might generate a real word. Just pronounce it different if you want to say you made it up.


Output

Running a handful of times landed me with these:

ancover             ingennized          plesuri             asymbablew
orkno               oftedi              nestrat             arlysect
welvency            thembe              therespaid          frokedgerition
judeth              ist                 rectede             privede
aprommautu          offeleal            townerislo          callynerly
thentsi             perma               themenum            agesputherflone
pecticangenti       whoult              ifileyea            onster
flatco              powne               prative             betion
inegansith          meraddin            theste              mysistai
skerest             uppre               ongdonc             hadmints

All of which my spell-checker hates.


Full data and code can be grabbed from github.