gizmo gizmo - 1 year ago 47
Bash Question

Delete lines that only contain 1 of each item

I have a rather interesting problem that I'm not sure how to approach. My file looks something like this:

GROUP1, 1 Tall.hat, 1 Bow.tie, 1 Shiny.shoe,
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,


Each line contains a group which contains items, e.g. GROUP1 contains 1 Tall.hat, 1 Bow.tie and 1 Shiny.shoe. Columns are separated by commas. I want to delete lines (or GROUPs) that only contain 1 of each item.

Desired output:

GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,


So GROUP1 has been deleted because it only contains 1 of each item. All other groups have at least one item with two copies or more.

Thoughts so far:

I need to ignore (but retain) column1, since that contains the group number. So start off with something like
awk -F "," 'NF>1'
. Then for each row, cycle through all the columns and record all the possible numbers found. E.g GROUP1=1; GROUP2=1350 or 1; GROUP30=2 or 4, GROUP170=1 or 2. If the only unique number found is 1, then delete that line.

Not sure how to actually implement this though...Any ideas would be great!

Answer Source

Here's a solution using awk:

awk -F', *' '{ 
    split("", counts) # empty the counts array at the start of each line
    for (i = 2; i <= NF; ++i) { # loop through fields, starting from 2nd
        split($i, a, /[. ]/) # split each field into parts
        counts[a[3]] += a[1] # accumulate count for each type
        if (counts[a[3]] > 1) { print; next } # print and skip to next line
    }
}' file

counts will contain keys like "apple", "pencil", "pen", etc. For each key, the value is the total number of occurrences.

If you keep separate counts for "Blue.pen" and "Green.pen", then just split on a single space split($i, a, / /), rather than on spaces and dots. Now each field will only be split into two parts, so replace a[3] with a[2] in the subsequent lines.

splitting an empty string to clear the counts array is a workaround for non-GNU versions of awk, which can be replaced by delete(counts).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download