HugoB HugoB - 14 days ago 5
C Question

Fastest way to search a string in a file

I need to do a "find function" for a project. I have to search all the same strings (of course only one written by the operator) and how many they are in a huge file in the fastest way possible.
I thought about a tree connected with a hash table but I don't know if It's correct.


  1. How can I do it with strings (I usually use numbers)?

  2. What should be the best data structure to use (complexity)?


Answer

Assuming worst case:

  • huge (1 Tebibyte) file
  • highly varied and highly repetitive content. Let's take /usr/share/dict/words with its ~100,000 words (here), concatenate until we have one Tebibyte which gives about 1.1 mio. repetitions and mix it up.
  • a non-repetitive (or close to non-repetitive) short (say 1-20 bytes, 10 on average) input.

The choice of algorithm depends on

  • number of inputs (inputs/second)
  • available memory

If you have only a handful (numbers intentionally kept vague) of inputs and/or not much memory available you can just search for it linearly (Boyer-Moor(-Horspool), Rabin-Karp, Apostolico-Giancarlo, Knuth-Morris-Pratt).

Ir you have a lot of inputs and some memory available you can index the file first (O(n), obviously) and search either in O(1) with a hashtable or O(log n) with a binary search tree (there are several optimizations possible but let's keep it simple).

Not much memory is needed. No matter what you do, a hashtable or a tree, you need to keep the position somewhere and because you have more than four Gibibytes you need a 64-bit counter. Eight bytes multiplied with the table size of 1.1 mio: just 8 Mebibytes. Plus space for the words themselves (less than one Mebibyte with my /usr/share/dict/words) or indices for the hashtable (a bit less because you don't need large integers for them with such a short wordlist).

You have some overhead for holding and managing the indices of the individual words in the big-file. A binary search tree is fast and quickly build, although it has quite some memory overhead. If you don't need to search the indices: just put them into a simple array.

tl;dr: index the file, that is make a hastable of the words and their places. Put the places (may need 64 bit integers!) in a simple array if you need them all at once but use a (binary) search tree if you need to search these indices. I assume here that you know how to build a perfect hash.

Comments