aistesk aistesk -4 years ago 105
Python Question

How to print only the matched regex strings from a block of text?

The end goal I have is to input a block of text (multiple lines) which contains domains and output just a list of domains.

Example input:

2017-03-02: 173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02: 173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04: 173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04: 54.202.16.39 port 80 - pentsshoperqunity.top -


The output I want in this case:

www.hlowdolax.top
www.hjaoopoa.top
www.foolalexas.top
pentsshoperqunity.top


Eventually I found out that the best tool for this purpose is
re.findall()
and tried to do it this way:

matchedDomains=re.findall(myRegex, fileWithMessyText.read())
print matchedDomains


And in the output I see that it matched all the domains but the result looks like this:

[('www', 'hlowdolax', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'hjaoopoa', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'foolalexas', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('pentsshoperqunity', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('nikesportweardewvv', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('www', 'dpooldoopl', 'a', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('www', 'sosgenerga', 'lz', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('search', 'p', 'h', 'p')]


If that's relevant, here is the regex I use:

([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})


I googled a variety of keywords, tested my regex with pythex.org and learned about a term "match captures" and that it has to do something with "capture groups", but all the advice I found here with using
group
appears to not be compatible with
findall
, but if I try to use
search
or
match
it only works for the first line and prints the whole line instead of just the match (looks like rambling but I didn't document my wanderings so I don't remember what exactly I've tried). Also intuitively it seems like a workaround to use cycles and match line by line when there is a tool that matches the whole block. Problem is, I don't know how to use it.

I'm not looking for someone to write the code for me but I'm really lost at this point. Is there a way to use
findall
and output just nicely formatted matches?

Answer Source

The parenthesis you have in your regex create capturing groups, just remove them:

[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}

Here is a demonstration.

>>> re.findall(r'[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}', s)
['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 
 'pentsshoperqunity.top']
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download