ZacEsa ZacEsa - 6 months ago 18
Linux Question

Bash script to filter out non-adjacent duplicates in logs

I'm trying to create a script to filter out duplicates in my logs and keep the latest of each message. A sample would be below;

May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
May 29 22:25:19 servername.com Fdm: this is just a message
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543


The logs are split between two files, I've already gotten down to creating the script to merge the two files and sort the files by date using sort -s -r -k1.

I've also managed to create the script so that it asks for the date I want then it uses grep to filter out by date.

Right now, I only need to find a way to remove the non-adjacent duplicate lines which also have different timestamps. I tried awk but, my knowledge with awk isn't that great. Any awk-gurus out there able to assist me?

P.S., One of the issue I'm encountering is that there are same lines with different error codes, I want to remove those lines but, I can only go so by grep -v "Constant part of line". If there's a way for me to remove duplicates by percentage of similarity, that'll be great. Also, I can't get the script to ignore certain fields or columns because there are lines with error codes at different fields/columns.

Expected output as below;

May 29 22:25:30 servername.com Fdm: another error message 3 76543
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890


I only want the errors but, that's easily done with grep -i error. The only issue is the duplicate lines with different error codes.

Answer

I managed to find a way to do it. Just to give you guys more details about the issue I had and what this script does.

Issue: I had logs which I have to clear but, the logs have multiple lines with repeating error. Unfortunately, the repeating errors have different error codes so, I'm not able to just grep -v them. Plus, the logs have tens of thousands of lines so, to keep "grep -v"-ing them would consume lots of time so, I've decided to semi-automate it using scripts. Below is the script. If you have ideas on how to improve the script, please do comment!

#!/usr/local/bin/bash

rm /tmp/tmp.log /tmp/tmpfiltered.log 2> /dev/null

printf "Please key in full location of logs: "

read log1loc log2loc

cat $log1loc $log2loc >> /tmp/tmp.log

sort -s -r -k1 /tmp/tmp.log -o /tmp/tmp.log

printf "Please key in the date: "

read logdate

while [[ $firstlineedit != "n" ]]

        do

        grep -e "$logdate" /tmp/tmp.log | grep -i error | less

        firstline=$(head -n 1 /tmp/tmp.log)

        head -n 1 /tmp/tmp.log >> /tmp/tmpfiltered.log

        read -p "Enter line to remove(enter n to quit): " -e -i "$firstline" firstlineedit

        firstlinecount=$(grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -o "$firstlineedit" | wc -l)

        grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -v "$firstlineedit" > /tmp/tmp2.log

        mv /tmp/tmp2.log /tmp/tmp.log

        if [ "$firstlineedit" != "n" ];

                then

                echo That line and it"'"s variations have appeared $firstlinecount times in the log!

        fi
done

cat /tmp/tmpfiltered.log | less