Deftware Deftware - 9 days ago 5
PHP Question

Fastest approach to search within file contents of a directory

I got a directory that contains files for users of a program I have. There are around 70k files in that directory.

The current search method is using

glob
and
foreach
. It's getting quite slow and hogging the server. Is there any good way to search through these files more efficiently? I'm running this on a Ubuntu 16.04 machine and I can use
exec
if needed.

Update:

Theses are json files and each file needs to be opened to check if it contains the search query or not. Looping over the files is quite fast, but when it needs to open each file, it takes quite a while.

These cannot be indexed using SQL or memcached, as I'm using memcached for some other things.

Answer

Depending on whether or not you're using SSD or HDD to store the files answer differs.

HDD

In case of HDD the most probable bottleneck isn't PHP but low number of I/O operation HDDs can handle. I would strongly advise to move to SSD or to use RAM disk if it's feasible.

Let's assume you're not able to move the directory to SSD. It means that you're stuck on HDD which can perform between ~70-~200 IOPS (I/O operation per second, assuming your system doesn't cache files in the directory in RAM). Your best bet is to minimize I/O calls like fstat, filemtime, file_exists etc and focus on operation that read files (file_get_contents() etc.).

HDD and operating system allow HDD controllers to group I/O operations to get around low IOPS available. For example if two files are close to each other on HDD you can read both or more of them at cost of reading just one of them (I'm simplifying things here, but let's not get into too technical details). So contrary to some beliefs reading multiple files at once (for example using threaded program, xargs etc.) might greatly improve performance.

Unfortunately this will be only the case if those files are close to each other on physical HDD. If you really want to speed up things you should first consider in what order you're going to read the files using your application as it's crucial for the next step. Once you figured it out you can erase the HDD drive completely (assuming you can do it) and write files to it sequentially in the order you settled on. This should place files side by side and improve effective IOPS when parallel file reads.

Next you need to go to shell and use program that can process files in parallel - PHP has support for pthreads but don't go down that route. xargs with multiple processes (-P option) might be helpful if you plan to use single threaded application. Read shell_exec() output and process it in your PHP program.

SSD

As with HDD parallel processing might help, it would be best however to see your code first as I/O might not be the problem.

Comments