Oleksii Tsebinoga Oleksii Tsebinoga - 3 months ago 35
C Question

How to make my hashing algorithm faster

My question is connected with task from CS50, pset5. For ones who don't know any about that, I'll try to explain. Nothing very special. I just need to make function which will intake dictionary file (it was written before, all of the words in that file are uppercase), which contains more over 20K words, and sort them somehow. I've made simple and naive algorithm, building hash-table, which sort words, depending on the theirs first letters. And I've passed all checks by the CS50, so my program is working well. But comparing to the course's one - it is too slow. Time of executing for personnel's is 0.1s, but for mine - 5.0s - 7.0s. What can I improve in this code to make it faster? Or should I totally change everything? I have no experience in optimization, `cause just started learning. It would be great to study from any of you =) Thanks in advance!

// Some constant values, which are declared before the function
#define LENGTH 46
#define ALPHALENGTH 26
/* Definition of node struct. Nothing special, in fact =) */
typedef struct node {
char word[LENGTH +1];
struct node *next;
} node;

node *hashTable[ALPHALENGTH];

bool load(const char *dictionary) {
FILE *f = fopen(dictionary, "r");

if (f == NULL) {
return false;
}

char word[LENGTH + 1];
int hash = 0;

for (int i = 0; i < ALPHALENGTH; i++) {
hashTable[i] = NULL;
}

// 46 - is LENGTH, but for some reason it is impossible
// to put variable`s name between quotation marks
while (fscanf(f, "%46s", word) == 1) {
// make every letter lowercase to get result from 0 - 25
hash = tolower(word[0]) - 'a';
node *new_node = malloc(sizeof(node));
strcpy(new_node->word, word);

// check if element is first in the list
if (hashTable[hash] == NULL) {
new_node->next = NULL;
hashTable[hash] = new_node;
} else {
node *ptr = hashTable[hash];

do {
if (ptr->next == NULL) {
break;
}
ptr = ptr->next;
} while (true);

ptr->next = new_node;
new_node->next = NULL;
}
}

fclose(f);

return true;
}

Answer

The problem is not in your hash function, nor in the size of your hash table, it is in your list management: your method for appending words to the corresponding lists has a complexity of O(N2).

By the way, your hash function is not used for hashing, but for dispatching. You are sorting the table only on the first letter of each word, keeping the words with the same initial in the same order. If you meant to sort the dictionary completely, you would still need to sort each list.

You can drastically improve the performance while keeping the same semantics by prepending to the lists and reversing the lists at the end of the parsing phase.

For a dictionary with 20 thousand words, the code below runs 50 times faster (as expected by the CS50 site):

#include <ctype.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LENGTH  46
#define ALPHALENGTH  26
typedef struct node {
    struct node *next;
    char word[LENGTH +1];
} node;

node *hashTable[ALPHALENGTH];

bool load(const char *dictionary) {
    FILE *f = fopen(dictionary, "r");

    if (f == NULL) {
        return false;
    }

    char word[LENGTH + 1];
    int hash = 0;

    for (int i = 0; i < ALPHALENGTH; i++) {
        hashTable[i] = NULL;
    }

    while (fscanf(f, "%46s", word) == 1) {
        node *new_node = malloc(sizeof(node));
        if (new_node == NULL)
            return false;
        // make every letter lowercase to get result from 0 - 25
        hash = tolower(word[0]) - 'a';
        strcpy(new_node->word, word);
        /* prepending to the list */
        new_node->next = hashTable[hash];
        hashTable[hash] = new_node;
    }

    for (int i = 0; i < ALPHALENGTH; i++) {
        node *n, *prev, *next;
        /* reverse list */
        for (prev = NULL, n = hashTable[i]; n != NULL; ) {
            next = n->next;
            n->next = prev;
            prev = n;
            n = next;
        }
        hashTable[i] = prev;
    }

    fclose(f);

    return true;
}

void save(void) {
    for (int i = 0; i < ALPHALENGTH; i++) {
        for (node *n = hashTable[i]; n != NULL; n = n->next) {
            puts(n->word);
        }
    }
}

int main(int argc, char *argv[]) {
    if (argc > 1) {
        if (load(argv[1]))
            save();
    }
}

Changing the fscanf() to a simpler fgets() might provide a marginal performance improvement, at the cost of more restrictive semantics for the dictionary format.

Comments