Peter Piper Peter Piper - 5 months ago 9
Ruby Question

Delete files with almost similar name with a Ruby script

I got a list of photos with names that look like this:

/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-1.jpg
/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59.jpg
/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54-1.jpg
/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54.jpg
[...]


I am trying to delete all the photos whose names are "similar". What I am trying to do is some kind of pattern matching.

How can I find out if the first n characters are the same for two strings?

Answer

The problem

Reading between the lines, it appears that you have determined that, in

arr = [
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-1.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-2.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-21.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54-2.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-32.jpg',
  '/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.56-1.jpg'
]

the files arr[0,3]are the same, because the file names differ only by the optional inclusion of a hyphen, followed by one or more digits, immmediately prior to ".jpg", and you want to delete all but one file in the group (i.e. delete two of the three files). Moreover, if one file in the group does not contain the optional hyphen followed by one or more digits prior to '.jpg", that file--the "base file"--is the one that is not to be deleted. Similarly, all but one file of the group [arr[3], arr[6]] and of the group arr[4,2] are to be deleted. The group [arr[7]] contains only one file, so no file from that group is to be deleted.

Code

You can do that by using the regular expression

r = /
    \A           # match beginning of string
    .+?          # match one or more of any character, lazily
    (?=          # begin a positive lookahead
      (?:-\d+)?  # match hypen, one or more digits, in non-group, optionally ("?")
      \.jpg\z    # match ".jpg" followed by end of string
    )            # end positive lookahead
    /x           # free-spacing regex definition mode

in conjunction with instance methods Enumerable#group_by, String#[], Hash#values, Enumerable#flat_map and Array#-, and the class method File::#delete:

arr.group_by { |f| f[r] }.values.flat_map { |a| a-[a.max] }.each { |f| File.delete(f) } 

Explanation

The steps follow.

h = arr.group_by { |f| f[r] }
  #=> {"/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59"=>
  #      ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-1.jpg",
  #       "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59.jpg",
  #       "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-2.jpg"],
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55"=>
  #      ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-21.jpg",
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-32.jpg"],
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54"=>
  #      ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54-2.jpg",
  #       "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54.jpg"],
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.56"=>
  #      ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.56-1.jpg"]} 
v = h.values
  #=> [["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-1.jpg",
  #     "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59.jpg",
  #     "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-2.jpg"],
  #    ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54-2.jpg",
  #     "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54.jpg"],
  #    ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-21.jpg",
  #     "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-32.jpg"],
  #    ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.56-1.jpg"]]
b = v.flat_map { |a| a-[a.max] }
  #=> ["/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-1.jpg",
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.05.59-2.jpg",
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.55-21.jpg",
  #    "/Users/foo/Desktop/argentinien-chile 2/2009-12-21 17.16.54-2.jpg"] 

Note that, because "." > "-" #=> true, if a group contains a "base file", it will be a.max; hence, the file that is not deleted.

b.each { |f| File.delete(f) } 
Comments