dlite922 dlite922 - 7 months ago 10
Perl Question

Delete a SPECIFIC duplicate line from XML file in place

I've been reading about deleting duplicate lines all over stack. There's perl, awk, and sed solutions, however none as specific as I want and I'm at a loss.

I want to delete the duplicate

<path>
tags from this XML case INSENSITIVELY with a quick bash/shell perl command. Leave all other duplicate lines (like
<start>
and
<end>
) intact!

Input XML:

<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
<model type="B">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
<upath>/example/dir/here</upath> <------ Duplicate line to REMOVE
</userinterface>
</package>


So far I've been able to grab the duplicate lines, but don't know how to remove them. The following

grep -H path *.[Xx][Mm][Ll] | sort | uniq -id


Gives the result:

test.xml: <upath>/example/dir/here</upath>


How do I remove that line now?

Doing the perl version or awk version below erases the
<start>
and
<end>
dates as well.

perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new

Answer
$ awk '!(/<upath>/ && seen[tolower($1)]++)' file
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
    </userinterface>
  </package>
Comments