oucil oucil - 4 months ago 12
MySQL Question

Is there a formula to find shortest length required to achieve uniqueness across a set

I'd like to be able to calculate the length of the shortest sub-string required to achieve complete uniqueness.

Lets say I have a varying length list of 32 character UUIDs, but what I'd like to achieve is shortenening them during reference to only be as long as is required to achieve uniqueness in their set. For instance, if I have the following set of UUID's (pipes inserted to illustrate the answer)...

428|07082e1f445e79501bebfa87396af
723|0785bffaf4747865c202dd0924c7f
b65|634be909d4e5590aa0cdc97251eef
3c4|d94c683624d75a273e3186ec65b78
09e|bd42af0404bcf90413e11c5b40fbb
011|004743d65466dae8a9a6bc814ef4b
1f1|889e04e3a453fbf57521de0a70b60
1ac|44707af8d4681875171ad47c61037
42f|7a6236deb4a9ead32ab2e816d73a3
83a|fe22086064eec87704127622b8165


I would only require the first three characters to achieve the same level of uniqueness as if I had used the full 32 character strings.

I'm curious if there is a formula for reaching that value. I know that I could put this in a couple nested loops, but I'd like to know if there is a more elegant or programmatic way of achieving this.




Edit: Just to be clear, the pipes are only to illustrate that I can achieve uniqueness after only 3 characters. The result of the formula/method should be an array of equal length with only the shortest strings derived from the given set, in this case, the first three chars only. Imagine that I want to use these in a URL, and that I can't have any ambiguity, but still want to be able to reference the same records as if I used the full string in each case.

EDIT2: Actually... as I think about it, no need for a result array, only an integer, the min length required in characters.

Answer

I managed to create some codes to achieve that. Take a look:

  • Code 1:
function check_un($array){
    $array = array_values($array);
    $len = 1;
    $tmp = array();
    for($i = 0; $i < strlen($array[0]); $i++){
        if( count(array_unique( $tmp = array_map(function($v) use($len){ return substr($v, 0, $len); }, $array ) )) != count($array) ){
            $len++;
        }
    }
    return $tmp; // this was set in the array_map part
}
  • Code 2:
function check_un($array){
    $arr = $array;
    $len = 1;

    $tmp = array();

    while (list($key, $value) = each($arr)) {
        $v = substr($value, 0, $len);
        if (isset($tmp[$v])) {
            $tmp = array();
            $len++;
            reset($arr); // start again
        }
        $tmp[$v] = true;
    }
    $tmp = array_keys($tmp);
    array_shift($tmp);
    return $tmp;
}

There used to be a code 3 (the first I tried), but it's only available in the edit history.


You can test them with this:

$values = array(
    '42807082e1f445e79501bebfa87396af',
    '7230785bffaf4747865c202dd0924c7f',
    'b65634be909d4e5590aa0cdc97251eef',
    '3c4d94c683624d75a273e3186ec65b78',
    '09ebd42af0404bcf90413e11c5b40fbb',
    '011004743d65466dae8a9a6bc814ef4b',
    '1f1889e04e3a453fbf57521de0a70b60',
    '1ac44707af8d4681875171ad47c61037',
    '42f7a6236deb4a9ead32ab2e816d73a3',
    '83afe22086064eec87704127622b8165'
    //,'42807082e1f445e795aaaaaaaaaaaaa' // add this to test with more letters
);

$val = check_un($values);

The result (for both cases):

Array
(
    [0] => 428
    [1] => 723
    [2] => b65
    [3] => 3c4
    [4] => 09e
    [5] => 011
    [6] => 1f1
    [7] => 1ac
    [8] => 42f
    [9] => 83a
)

See them in action here:


You can change the returned value to get only the $len variable.