I need to do a "find function" for a project. I have to search all the same strings (of course only one written by the operator) and how many they are in a huge file in the fastest way possible.
I thought about a tree connected with a hash table but I don't know if It's correct.
Assuming worst case:
/usr/share/dict/wordswith its ~100,000 words (here), concatenate until we have one Tebibyte which gives about 1.1 mio. repetitions and mix it up.
The choice of algorithm depends on
If you have only a handful (numbers intentionally kept vague) of inputs and/or not much memory available you can just search for it linearly (Boyer-Moor(-Horspool), Rabin-Karp, Apostolico-Giancarlo, Knuth-Morris-Pratt).
Ir you have a lot of inputs and some memory available you can index the file first (O(n), obviously) and search either in O(1) with a hashtable or O(log n) with a binary search tree (there are several optimizations possible but let's keep it simple).
Not much memory is needed. No matter what you do, a hashtable or a tree, you need to keep the position somewhere and because you have more than four Gibibytes you need a 64-bit counter. Eight bytes multiplied with the table size of 1.1 mio: just 8 Mebibytes. Plus space for the words themselves (less than one Mebibyte with my
/usr/share/dict/words) or indices for the hashtable (a bit less because you don't need large integers for them with such a short wordlist).
You have some overhead for holding and managing the indices of the individual words in the big-file. A binary search tree is fast and quickly build, although it has quite some memory overhead. If you don't need to search the indices: just put them into a simple array.
tl;dr: index the file, that is make a hastable of the words and their places. Put the places (may need 64 bit integers!) in a simple array if you need them all at once but use a (binary) search tree if you need to search these indices. I assume here that you know how to build a perfect hash.