FocusedEnergy FocusedEnergy - 6 months ago 19
Linux Question

Bash command to archive files daily based on date added

I have a suite of scripts that involve downloading files from a remote server and then parsing them. Each night, I would like to create an archive of the files downloaded that day.

Some constraints are:


  • Downloading from a Windows server to an Ubuntu server.

  • Inability to delete files on the remote server.

  • Require the date added to the local directory, not the date the file was created.

  • I have deduplication running at the downloading stage; however, (using ncftp), the check involves comparing the remote and local directories. A strategy is to create a new folder each day, download files into it and then tar it sometime after midnight. A problem arises in that the first scheduled download on the new day will grab ALL files on the remote server because the new local folder is empty.



Because of the constraints, I considered simply archiving files based on "date added" to a central folder. This works very well using a Mac because HFS+ stores extended metadata such as date created and date added. So I can combine a tar command with something like below:

mdls -name kMDItemFSName -name kMDItemDateAdded -raw *.xml | \
xargs -0 -I {} echo {} | \
sed 'N;s/\n/ /' | \


but there doesn't seem to be an analogue under linux (at least not with EXT4 that I am aware of).

I am open to any form of solution to get around doubling up files into a subsequent day. The end result should be an archives directory full of tar.gz files looking something like:

files_$(date +"%Y-%m-%d").tar.gz

Answer

Depending on the method that is used to backup the files, the modified or changed date should reflect the time it was copied - for example if you used cp -p to back them up, the modified date would not change but the changed date would reflect the time of copy.

You can get this information using the stat command:

stat <filename>

which will return the following (along with other file related info not shown):

Access: 2016-05-28 20:35:03.153214170 -0400
Modify: 2016-05-28 20:34:59.456122913 -0400
Change: 2016-05-29 01:39:52.070336376 -0400

This output is from a file that I copied using cp -p at the time shown as 'change'.

You can get just the change time by calling stat with a specified format:

stat -c '%z' <filename>
2016-05-29 01:39:56.037433640 -0400

or with capital Z for that time in seconds since epoch. You could combine that with the date command to pull out just the date (or use grep, etc)

date -d "`stat -c '%z' <filename>" -I
2016-05-29

The command find can be used to find files by time frame, in this case using the flags -cmin 'changed minutes', -mmin 'modified minutes', or unlikely, -amin 'accessed minutes'. The sequence of commands to get the minutes since midnight is a little ugly, but it works.

We have to pass find an argument of "minutes since a file was last changed" (or modified, if that criteria works). So first you have to calculate the minutes since midnight, then run find.

min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)

Unrolling that a bit:

  • $(date +%s) == seconds since epoch until 'now'
  • "(date -I) 0" == todays date in format "YYYY-MM-DD 0" with 0 indicating 0 seconds into the day
  • $(date -d "(date -I 0" +%s)) == seconds from epoch until today at midnight
  • Then we (effectively) echo ( $now - $midnight ) / 60 to bc to convert the results into minutes.

The find call is passed the minutes since midnight with a leading '-' indicating up to X minutes ago. A'+' would indicate X minutes or more ago.

find /path/to/base/folder -cmin -"$min_since_mid"

The actual answer

Finally to create a tgz archive of files in the given directory (and subdirectories) that have been changed since midnight today, use these two commands:

min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)

find /path/to/base/folder -cmin -"${min_since_mid:-0}" -print0 -exec tar czvf /path/to/new/tarball.tgz {} +

The -print0 argument to find tells it to delimit the files with a null string which will prevent issues with spaces in names, among other things.

The only thing I'm not sure on is you should use the changed time (-cmin), the modified time (-mmin) or the accessed time (-amin). Take a look at your backup files and see which field accurately reflects the date/time of the backup - I would think changed time, but I'm not certain.

Update: changed -"$min_since_mid" to -"${min_since_mid:-0}" so that if min_since_mid isn't set you won't error out with invalid argument - you just won't get any results. You could also surround the find with an if statement to block the call if that variable isn't set properly.