ellsusan ellsusan - 2 months ago 7
Linux Question

Using grep cmd to filter by first letter, @, and "."

I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:

I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "@" are included.

I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.

Eg. john.doe@gmail.com
However, johndoe@gmail.com would be included.

Lastly, I want to get the count of all the email addresses that follow these rules.

So far I've only been able to make sure they are email addresses by doing

grep -c "@" testdata.txt


.

Using the grep cmd I also want to check how many email addresses have a government domain ("gov").

I wanted to do a check that it has a @ sign in the line and that it also contains gov. However, I don't get the answer I want when I do any of the following.

grep -c "@\|gov" testdata.txt I get the amount of lines that have a @ not @ and gov
grep -c "@/|gov" testdata.txt I get 0
grep -c "@|gov" testdata.txt I get 0

Answer

Going bottom-up with your questions.

You are using grep in its Basic regular expressions mode. In this mode \| means OR, | means the symbol |, and /| mean the symbols /|.

If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with @ and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.

Borrowing from another post on this site you would end up with something like

   grep -c "@[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"

skipping another 5 possible spellings for the top level domain, e.g. GoV. However I would use the -i switch that means ignore case to simplify the expression

   grep -ci "@[a-z0-9][a-z0-9.-]*\.gov"

Now you were not very clear regarding the use of dots separating parts of the name:

I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. john.doe@gmail.com However, johndoe@gmail.com would be included.

So I will not touch this part.

Finally You could use range expressions to filter the addresses that start with the letters A-M

   grep -ci "[a-m][a-z0-9._%+-]*@[a-z0-9][a-z0-9.-]*\.gov"

Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.

Comments