antred antred - 5 months ago 45x
Python Question

Excluding directories in os.walk

I'm writing a script that descends into a directory tree (using os.walk()) and then visits each file matching a certain file extension. However, since some of the directory trees that my tool will be used on also contain sub directories that in turn contain a LOT of useless (for the purpose of this script) stuff, I figured I'd add an option for the user to specify a list of directories to exclude from the traversal.

This is easy enough with os.walk(). After all, it's up to me to decide whether I actually want to visit the respective files / dirs yielded by os.walk() or just skip them. The problem is that if I have, for example, a directory tree like this:

--- dirA
--- dirB
--- uselessStuff --
--- moreJunk
--- yetMoreJunk

and I want to exclude uselessStuff and all its children, os.walk() will still descend into all the (potentially thousands of) sub directories of uselessStuff, which, needless to say, slows things down a lot. In an ideal world, I could tell os.walk() to not even bother yielding any more children of uselessStuff, but to my knowledge there is no way of doing that (is there?).

Does anyone have an idea? Maybe there's a third-party library that has implemented something like this?


Modifying dirs in-place will prune the (subsequent) files and directories visited by os.walk:

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

From help(os.walk):

When topdown is true, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search...