More Than Five More Than Five - 4 months ago 11
Linux Question

Uniq but only on part of the string

I have strings such as:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4
import z.y.x.d.f.Class5
import z.y.x.d.f.Class6


I want to get all unique occurrences of the first part of the String. More specifically up to the third period. So I do:

grep "import curam" -hr --include \*.java | sort | gawk -F "." '{print $1"."$2"."$3}' | uniq


which gives me:

import a.b.c
import a.b.g
import a.b.h
import z.y.x


However, I'd like to get the full String for the first occurrence when the String up until the third period was unique. So, I want to get:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4


Any ideas?

Answer

Just keep track of the unique 2nd field:

awk -F '[ .]' '!uniq[$2]++' file

That is, start by setting the field separators to either a space or a dot. This way, the second field is always the first word in the dot-separated name:

$ awk -F '[ .]' '{print $2}' file
a
a
a
z
z
z

Then, just check when they appear for the first time:

$ awk -F '[ .]' '!uniq[$2]++' file
import a.b.c.d.f.Class1
import z.y.x.d.f.Class4

There are some subtle variations on the first three tokens between the String so I need to do just [.] Can't do space. I updated the question.

So if you have:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4
import z.y.x.d.f.Class5
import z.y.x.d.f.Class6

Then you need to split the second .-separeted field and check when the first three slices are repeated. This can be done using the same approach as above, only that using split() and then using the three first slices to check the uniqueness:

$ awk '{split($2, a, ".")} !uniq[a[1] a[2] a[3]]++' file
import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4