aistesk aistesk -4 years ago 105
Python Question

How to print only the matched regex strings from a block of text?

The end goal I have is to input a block of text (multiple lines) which contains domains and output just a list of domains.

Example input:

2017-03-02: port 80 - - GET /usp?f=1if
2017-03-02: port 80 - - GET /uf=1if
2017-03-04: port 80 - - GET /userif
2017-03-04: port 80 - -

The output I want in this case:

Eventually I found out that the best tool for this purpose is
and tried to do it this way:

print matchedDomains

And in the output I see that it matched all the domains but the result looks like this:

[('www', 'hlowdolax', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'hjaoopoa', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'foolalexas', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('pentsshoperqunity', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('nikesportweardewvv', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('www', 'dpooldoopl', 'a', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('www', 'sosgenerga', 'lz', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('search', 'p', 'h', 'p')]

If that's relevant, here is the regex I use:


I googled a variety of keywords, tested my regex with and learned about a term "match captures" and that it has to do something with "capture groups", but all the advice I found here with using
appears to not be compatible with
, but if I try to use
it only works for the first line and prints the whole line instead of just the match (looks like rambling but I didn't document my wanderings so I don't remember what exactly I've tried). Also intuitively it seems like a workaround to use cycles and match line by line when there is a tool that matches the whole block. Problem is, I don't know how to use it.

I'm not looking for someone to write the code for me but I'm really lost at this point. Is there a way to use
and output just nicely formatted matches?

Answer Source

The parenthesis you have in your regex create capturing groups, just remove them:


Here is a demonstration.

>>> re.findall(r'[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}', s)
['', '', '', 
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download