BitFlow BitFlow - 1 year ago 62
Python Question

set() not removing duplicates

I'm trying to find unique instances of IP addresses in a file using regex. I find them fine and try to append them to a list and later try to use

on my list to remove duplicates. I'm finding each item okay and there are duplicates but I can't get the list to simplify. The output of printing my set is the same as printing ips as a list, nothing is removed.

ips = [] # make a list
count = 0
count1 = 0
for line in f: #loop through file line by line
match ="\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) #find IPs
if match: #if there's a match append and keep track of the total number of Ips
ips.append(match) #append to list
count = count + 1
ipset = set(ips)
print(ipset, count)

This string
<_sre.SRE_Match object; span=(0, 13), match=''>
shows up 60+ times in the output before and after trying to
the list

Answer Source

You are not storing the matched strings. You are storing the re.Match objects. These don't compare equal even if they matched the same text, so they are all seen as unique by a set object:

>>> import re
>>> line = '\n'
>>> match1 ="\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match1
<_sre.SRE_Match object; span=(0, 13), match=''>
>>> match2 ="\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match2
<_sre.SRE_Match object; span=(0, 13), match=''>
>>> match1 == match2

Extract the matched text instead:

ips.append( #append to list without arguments returns the part of the string that was matched (group 0):

>>> ==