I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:
I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "@" are included.
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.
However, email@example.com would be included.
Lastly, I want to get the count of all the email addresses that follow these rules.
So far I've only been able to make sure they are email addresses by doing
grep -c "@" testdata.txt
grep -c "@\|gov" testdata.txt I get the amount of lines that have a @ not @ and gov
grep -c "@/|gov" testdata.txt I get 0
grep -c "@|gov" testdata.txt I get 0
Going bottom-up with your questions.
You are using
grep in its Basic regular expressions mode. In this mode
\| means OR,
| means the symbol |, and
/| mean the symbols /|.
If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with @ and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.
Borrowing from another post on this site you would end up with something like
grep -c "@[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"
skipping another 5 possible spellings for the top level domain, e.g. GoV.
However I would use the
-i switch that means ignore case to simplify the expression
grep -ci "@[a-z0-9][a-z0-9.-]*\.gov"
Now you were not very clear regarding the use of dots separating parts of the name:
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. firstname.lastname@example.org However, email@example.com would be included.
So I will not touch this part.
Finally You could use range expressions to filter the addresses that start with the letters A-M
grep -ci "[a-m][a-z0-9._%+-]*@[a-z0-9][a-z0-9.-]*\.gov"
Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.