I'm writing a back up solution (of sorts). Simply it copies a file from location C:\ and pastes it to location Z:\
To ensure the speed is fast, before copying and pasting it checks to see if the original file exists. If it does, it performs a few 'calculations' to work out if the copy should continue or if the backup file is up to date. It is these calculations I'm finding difficult.
Originally, I compared the file size but this is not good enough because it would be very possible to change a file and it to be the same size (for example saving the character C in notepad is the same size as if I saved the Character T).
So, I need to find out if the modified date differs. At the moment, I get the file info using the
Going by modified date will be unreliable - the computer clock can go backwards when it synchronizes, or when manually adjusted. Some programs might not behave well when modifying or copying files in terms of managing the modified date.
Going by the archive bit might work in a controlled environment but what happens if another piece of software is running that uses the archive bit as well?
If you want (almost) complete reliability then what you should do is store a hash value of the last backed up version using a good hashing function like SHA1, and if the hash value changes then you upload the new copy.
Here is the SHA1 class along with a code sample on the bottom:
Just run the file bytes through it and store the hash value. Pass a
FileStream to it instead of loading your file into memory with a byte array to reduce memory usage, especially for large files.
You can combine this with modified date in various ways to tweak your program as needed for speed and reliability. For example, you can check modified dates for most backups and periodically run a hash checker that runs while the system is idle to make sure nothing got missed. Sometimes the modified date will change but the file contents are still the same (i.e. got overwritten with the same data), in which case you can avoid resending the whole file after you recompute the hash and realize it is still the same.
Most version control systems use some kind of combined approach with hashes and modified dates.
Your approach will generally involve some kind of risk management with a compromise between performance and reliability if you don't want to do a full backup and send all the data over each time. It's important to do "full backups" once in a while for this reason.