G. Rossin G. Rossin - 4 months ago 12
Python Question

How can I find where files are used as links in html pages?

I have a static website where versions of old pages are still stored in the root. I want to find these pages and look if they are used in some link somewhere in the root's files.

So I made a list of all the files inside the root using powershell's command

ls -R -Name
and store it on a file 'filelist.txt' and now I have something like:

directory1
directory2
5s.htm
5s.html
5s_introduction.htm
...
images\icons
images\icons\linkedin.png
images\icons\project-slider-arrow-left.png
images\icons\project-slider-arrow-right.png


I now want to look where these files are used, so I thought I could use a simple script in python (as I don't know windows' powershell) where it takes a line from the list and then look for occurences in each html page inside root.

To extract only the file name I then tried this regex on notepad++:

[^\\^\n]+\.[a-z]{0,4}


and seemed to work...(^\n is to exclude all lines that represent directories)

Second step, I tried to adapt this Python lines i found on stackoverflow:

import re
with open('filelist.txt') as f:
for l in f:
m = re.match('([^\\^\n]+\.[a-z]{0,4})', l)
if m:
print(m.group(1))


but it returns me strings completely wrong, full of spaces or single letters, like if regex is wrong.
Then I thought I could use regex result as a variable and check it somehow on each html pages on my root directory, but I'm stuck here.

Answer

Since you are sure that the file names contain '.', each path can be split on '\' and checked if it contains '.'. Also, stripping each line would remove new line characters.

with open('filelist.txt') as f:
    for l in f:
      l= l.strip()
      if '.' in l.split('\\')[-1]:
          print l.split('\\')[-1]

Output:

5s.htm
5s.html
5s_introduction.htm
linkedin.png
project-slider-arrow-left.png
project-slider-arrow-right.png
Comments