Akitake Akitake - 6 months ago 45
C# Question

Find duplicate files in a directory using LINQ

I'm currently writing a program that mass downloads images from various sources with given parameters from the user.

My issue is that I don't want duplicates to happen.
I should point out that I'm dealing with mass downloads of 100 max at a time (not so massive), and that each file has a different name, so simply searching by file name wouldn't work, I need to check hashes.

Anyways, here's what I've already found:

f => new
FileName = f,
FileHash = Encoding.UTF8.GetString(new SHA1Managed().ComputeHash(new FileStream(f, FileMode.Open, FileAccess.Read)))
.GroupBy(f => f.FileHash)
.Select(g => new { FileHash = g.Key, Files = g.Select(z => z.FileName).ToList() })
.SelectMany(f => f.Files.Skip(1))

My issue is that on the "File.Delete" line, I get the oh so famous error that the file is already in use by another process. I think this is because the code above lacks a way to close the FileStream it's using to get the FileHash before Deleting the file, but I don't know how to resolve that, any ideas ?

I should also point out I've tried other solutions, like this one (without linq): https://www.bhalash.com/archives/13544802709
Replacing the print function with a delete one, no errors but doesn't work.

Thanks in advance, I stay available for any additional information required! :)



You forgot to dispose the FileStream, so the file is still open until the GC collects the object.

You can replace the Select clause with:

.Select(f => {
    using (var fs = new FileStream(f, FileMode.Open, FileAccess.Read))
        return new
            FileName = f,
            FileHash = BitConverter.ToString(SHA1.Create().ComputeHash(fs))

Do NOT use Encoding.UTF8 to encode arbitrary bytes (which a hash is), as the result could be an invalid UTF8 sequence. Use BitConverter.ToString if you must, or better yet: find a different way which does not involve strings.

For instance, you could write:

.Select(f => {
    // Same as above, but with:
    // FileHash = SHA1.Create().ComputeHash(fs)
.GroupBy(f => f.FileHash, StructuralComparisons.StructuralEqualityComparer)

You may use a better approach though: you may group the files by size first, and calculate the hash only if there are multiple files with the same size. That should perform better when there are not many duplicates.