ellsusan ellsusan - 2 months ago 7
Linux Question

Using grep cmd to filter

Using grep or egrep

How many email addresses are in ‘first.last’ name format AND involve someone
whose first name starts with a letter in the first half of the alphabet?
(I want to get the count)

excerpt of testingfile.txt

glorious@uole.com
hhhhhh
ItzStatic
jackass
The_Epic_Turtle
david.webb@cia.gov
overthemoon34
smiley362
emilio
rico@uole.com
ddc44ever
check.it@geocities.com
dickens@uole.com
middle614
IntegrityJeff
5432
jdm-mojo@geocities.com
zack.gertrude@gmail.com


To do this I wanted to filter each line to check if it had a "@" and check the first letter to see if it was A-M or a-m, and check if it had a period.

grep -c "@" testingfile.txt


grep -c "\." testingfile.txt
(although this only checks if there is 1 occurrence in the line.)

grep -c "[a-mA-M]" testingfile.txt
(still haven't gotten this one to work)

How would I combine the 3 statements together, and how would I check to see if the first character of each line is a letter between a-m or A-M?

Answer

Finding email addresses that start with [a-mA-Z]

Because you were interested in the problem of more than one email on a line, let's consider this test file:

$ cat testingfile.txt 
glorious@uole.com
hhhhhh
david.webb@cia.gov overthemoon34 rico@uole.com
Check.it@geocities.com dickens@uole.com
IntegrityJeff
5432
jdm-mojo@geocities.com
zack.gertrude@gmail.com

This shows the first parts of all the matching email addresses whose first letters are in the first half of the alphabet:

$ grep -o "\b[a-mA-M][^[:blank:]]*@" testingfile.txt 
glorious@
david.webb@
Check.it@
dickens@
jdm-mojo@
gertrude@

This counts them:

$ grep -o "\b[a-mA-M][^[:blank:]]*@" testingfile.txt | wc -l
6

Being more strict about the "first" name

If we want to restrict the match to email addresses whose name part includes a period:

$ grep -o "\b[a-mA-M][^[:blank:]]*\.[^[:blank:]]*@" testingfile.txt 
david.webb@
Check.it@

And to count them:

$ grep -o "\b[a-mA-M][^[:blank:]]*\.[^[:blank:]]*@" testingfile.txt | wc -l
2

Notes

  1. The regex used here, \b[a-mA-M][^[:blank:]]*@ is quite simple. Regexes exist that accurately select true email addresses but they are quite complex.

  2. grep -c counts lines. We first have to use grep -o to put each match on a separate line and then use wc -l to count the lines.

  3. The regex [a-mA-M] is not unicode-safe.

Comments