xNightmare67x xNightmare67x - 6 months ago 23
Python Question

How do I delete duplicate lines and create a new file without duplicates?

I searched on here an found many postings, however none that I can implement into the following code

with open('TEST.txt') as f:
seen = set()
for line in f:
line_lower = line.lower()
if line_lower in seen and line_lower.strip():

I can find the duplicate lines inside my TEST.txt file which contains hundreds of URLs.

However I need to remove these duplicates and create a new text file with these removed and all other URLs intact.

I will be Checking this newly created file for 404 errors using r.status_code.

In a nutshell I basically need help getting rid of duplicates so I can check for dead links. thanks for your help.


Sounds simple enough, but what you did looks overcomplicated. I think the following should be enough:

with open('TEST.txt', 'r') as f:
    unique_lines = set(f.readlines())
with open('TEST_no_dups.txt', 'w') as f:

A couple things to note:

  • If you are going to use a set, you might as well dump all the lines at creation, and f.readlines(), which returns the list of all the lines in your file, is perfect for that.
  • f.writelines() will write a sequence of lines to your files, but using a set breaks the order of the lines. So if that matters to you, I suggest replacing the last line by f.writelines(sorted(unique_lines, key=whatever you need))