Dellu Dellu - 1 month ago 4
AppleScript Question

Find and remove duplicates in a Bibtex (BibDesk) using AppleScript

I have more than a thousand duplicates in my Bibtex library. The duplicates have no identical Citation Keys. They have identical titles.
I have tried both BibDesk and Jabref to remove the duplicates. They are however don't manage to find them all; not even half of them.

I find one promising AppleScript in here:

But, since I am total beginner with AppleScript, I couldn't adopt it to my needs.

Here is the AppleScript:

on run {}
end run

-- IMPORTANT NOTE: The following routine is an identical copy as contained in files 'Cleanup Duplicates.scpt' and 'Fix PDF and URL Links.scpt'. Make sure the two copies are always kept identical.
on CleanupDuplicates()
set theBibDeskDocu to document 1 of application "BibDesk"
tell document 1 of application "BibDesk"
-- get all publications sorted by cite key ensuring that in any set of publications with the same cite key the youngest comes first and the oldest, typically the only one of the set that is still member of any static groups, comes last. To retain static group memberships we have to ensure that such "membership info" is copied from the last to the first publication of any set of publications with the same cite key (see vars 'aPub', 'prevPub', 'youngestPub').
set thePubs to (sort (get publications) by "Cite Key" subsort by "Date-Added" without ascending)
set theDupes to {}
set prevCiteKey to missing value
set prevPub to missing value
set youngestPub to missing value
repeat with aPub in thePubs
set aCiteKey to cite key of aPub
ignoring case
if aCiteKey is prevCiteKey then
set end of theDupes to aPub
-- we fix the static group membership redundantly in cases where aPub is also merely an obsolete duplicate, since we have possibly not yet advanced to the end of the set with the same cite key. But this is unavoidable with this algorithm looping simply through all publications. The end result will be that youngestPub (first in set of publications with same cite key) will be member of all static groups of the publications in the set (unification). The latter should be no big issue, since typically in multiple sets of publications it is only the last publication that matters. If this should be an issue, then we would need to first delete all static group membership info in 'youngestPub' in case we encounter a 3rd, or 4th etc. same cite key in 'aPub', and copy only those of 'aPub'. However, for the sake of efficiency I wish not to support this behavior.
my fixGroupMembership(theBibDeskDocu, aCiteKey, aPub, youngestPub)
-- remember in 'youngestPub' possible candiate for a new set of publications with the same cite key
set youngestPub to aPub
end if
end ignoring
set prevCiteKey to aCiteKey
set prevPub to aPub
end repeat
repeat with aPub in theDupes
delete aPub
end repeat
end tell
end CleanupDuplicates

on fixGroupMembership(theBibDeskDocu, theCiteKey, oldPub, newPub)
tell application "BibDesk"
tell theBibDeskDocu
set thePubsGroups to (get static groups whose publications contains oldPub)
if (count of thePubsGroups) is greater than 0 then
repeat with aGroup in thePubsGroups
add newPub to aGroup
end repeat
end if
end tell
end tell
end fixGroupMembership

So, what I want is to be able to find the duplicates by Title: and to be able to delete the Oldest (that means, by modification date).

Can you guys help me modify this script please?


Use this script:

on run {}
end run

on CleanupDuplicates()
    script o
        property thePubs : {}
    end script
    tell document 1 of application "BibDesk"
        -- get all publications sorted by Title (same titles are sorted by Date-Modified, descending)
        set o's thePubs to (sort (get publications) by "Title" subsort by "Date-Modified" without ascending)
        set tc to count o's thePubs
        set i to 1

        repeat while i < tc
            set theTitle to title of item i of o's thePubs
            repeat with j from (i + 1) to tc -- check the next title
                considering case --  match the case, *** remove this if you want to ignore the case
                    if (title of item j of o's thePubs) is not theTitle then exit repeat ---  not the same title, so exit this loop ---
                end considering

                delete item j of o's thePubs --- the title is the same, so remove this publication (a duplicate, oldest modification date) ---
            end repeat
            set i to j
        end repeat
    end tell
end CleanupDuplicates


Caveat: some publications have no modification date.

To sort publications by modification date properly, you need to define the Date-Modified field on publications that have not been modified.

An AppleScript can't change the date property of a publication in BibDesk because these dates are read-only.

Here's a solution:

  1. Close the document in BibDesk.
  2. Open the ".bib" file in the "TextWrangler" application.
  3. Run this script:


-- This script add the modification date on publications that have no "Date-Modified", the date will be that of the "Date-Added".
-- so, open a ".bib" file in "TextWrangler", and run this script
tell application "TextWrangler"
    tell text document 1
        select line 1 -- to start the search at the beginning of the document

        repeat -- until not found
            -- search "Date-Added" + (a blank line or the end of the document)
            set r to find "(?s)^\\tDate-Added = {.+?(^$|\\z)" searching in it options {search mode:grep, wrap around:false} with selecting match
            if found of r then
                if "Date-Modified = {" is not in (found text of r) then -- the Date-Modified field is not in this publication
                    set x to startLine of found object of r
                    set t to text 12 thru -1 of (get contents of line x) -- get the value of the Date-Added field --> " = {2016.09.10 03:34}," as example
                    add suffix (line x) suffix "\\n\\tDate-Modified" & t -- append (a line break + a tab + "Date-Modified" + the value of the Date-Added) to this line
                end if
                exit repeat -- no found or end of the document
            end if
        end repeat
    end tell
end tell
  1. From TextWrangler, Save or "Save as..." and close the document.
  2. Open the ".bib" file in BibDesk.