unresolved_external unresolved_external - 21 days ago 12
C++ Question

What is an optimal way to find duplicate files in C++?

I want to find duplicate files on the file system in C++. Is there any algorithm to do that as fast as possible? And do I need to create a multi-threaded application, or I can just use one thread to do it?


I concur with Kerrek SB that there are better tools for this than C++, however, assuming you really need to do this in C++, here are some suggestions and things to consider in your implementation:

  1. use boost::filesystem for portable filesystem traversal

  2. the hash every file suggestion is very reasonable, but it might be more efficient to first make a multimap where the file size is the key. Then only apply the hash when there are files of duplicate size.

  3. decide how you want to treat empty files and symbolic links/short cuts

  4. decied how you want to treat special files, e.g. on unix you have directories fifos, sockets etc

  5. account for the fact that files or directory structure may change, disappear or move while your algorithm is running

  6. account for the fact that some files or directories may be inaccessible or broken (e.g. recursive directory links)

  7. Make the number of threads configurable as the amount of parallelization that makes sense depends on the underlying disk hardware and configuration. It will be different if you are on a simple hard drive vs an expensive san. Don't make assumptions, though; Test it out. For instance, Linux is very good about caching files so many of your reads will come from memory, and thus not block on i/o.