ellsusan ellsusan - 5 months ago 17
Linux Question

Using grep cmd to filter

Using grep or egrep

How many email addresses are in ‘first.last’ name format AND involve someone
whose first name starts with a letter in the first half of the alphabet?
(I want to get the count)

excerpt of testingfile.txt

glorious@uole.com
hhhhhh
ItzStatic
jackass
The_Epic_Turtle
david.webb@cia.gov
overthemoon34
smiley362
emilio
rico@uole.com
ddc44ever
check.it@geocities.com
dickens@uole.com
middle614
IntegrityJeff
5432
jdm-mojo@geocities.com
zack.gertrude@gmail.com


To do this I wanted to filter each line to check if it had a "@" and check the first letter to see if it was A-M or a-m, and check if it had a period.

grep -c "@" testingfile.txt


grep -c "\." testingfile.txt
(although this only checks if there is 1 occurrence in the line.)

grep -c "[a-mA-M]" testingfile.txt
(still haven't gotten this one to work)

How would I combine the 3 statements together, and how would I check to see if the first character of each line is a letter between a-m or A-M?

Answer

Finding email addresses that start with [a-mA-Z]

Because you were interested in the problem of more than one email on a line, let's consider this test file:

$ cat testingfile.txt 
glorious@uole.com
hhhhhh
david.webb@cia.gov overthemoon34 rico@uole.com
Check.it@geocities.com dickens@uole.com
IntegrityJeff
5432
jdm-mojo@geocities.com
zack.gertrude@gmail.com

This shows the first parts of all the matching email addresses whose first letters are in the first half of the alphabet:

$ grep -o "\b[a-mA-M][^[:blank:]]*@" testingfile.txt 
glorious@
david.webb@
Check.it@
dickens@
jdm-mojo@
gertrude@

This counts them:

$ grep -o "\b[a-mA-M][^[:blank:]]*@" testingfile.txt | wc -l
6

Being more strict about the "first" name

If we want to restrict the match to email addresses whose name part includes a period:

$ grep -o "\b[a-mA-M][^[:blank:]]*\.[^[:blank:]]*@" testingfile.txt 
david.webb@
Check.it@

And to count them:

$ grep -o "\b[a-mA-M][^[:blank:]]*\.[^[:blank:]]*@" testingfile.txt | wc -l
2

Notes

  1. The regex used here, \b[a-mA-M][^[:blank:]]*@ is quite simple. Regexes exist that accurately select true email addresses but they are quite complex.

  2. grep -c counts lines. We first have to use grep -o to put each match on a separate line and then use wc -l to count the lines.

  3. The regex [a-mA-M] is not unicode-safe.