user3535074 user3535074 - 4 months ago 49
Python Question

Python os.walk complex directory criteria

I need to scan a directory with hundreds or GB of data which has structured parts (which I want to scan) and non-structured parts (which I don't want to scan).

Reading up on the os.walk function, I see that I can use a set of criteria in a set to exclude or include certain directory names or patterns.

For this particular scan I would need to add specific include/exclude criteria per level in a directory, for example:

In a root directory, imagine there are two useful directories, 'Dir A' and 'Dir B' and a non-useful trash directory 'Trash'. In Dir A there are two useful sub directories 'Subdir A1' and 'Subdir A2' and a non useful 'SubdirA Trash' directory, then in Dir B there are two useful subdirectories Subdir B1 and Subdir B2 plus a non useful 'SubdirB Trash' subdirectory. Would look something like this:

Example Directory

I need to have a specific criteria list for each level, something like this:

level1DirectoryCriteria = set("Dir A","Dir B")

level2DirectoryCriteria = set("Subdir A1","Subdir A2","Subdir
B1","Subdir B2")

the only ways I can think to do this are quite obviously non-pythonic using complex and lengthy code with a lot of variables and a high risk of instability. Does anyone have any ideas for how to resolve this problem? If successful it could save the codes running time several hours at a time.


You could try something like this:

to_scan = {'set', 'of', 'good', 'directories'}
for dirpath, dirnames, filenames in os.walk(root):
    dirnames[:] = [d for d in dirnames if d in to_scan]
    #whatever you wanted to do in this directory

This solution is simple, and fails if you want to scan directories with a certain name if they appear in one directory and not another. Another option would be a dictionary that maps directory names to lists or sets of whitelisted or blacklisted directories.

Edit: We can use dirpath.count(os.path.sep) to determine depth.

root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
sets_by_level = [{'root', 'level'}, {'one', 'deep'}]
for dirpath, dirnames, filenames in os.walk(root):
    depth = dirpath.count(os.path.sep) - root_depth
    dirnames[:] = [d for d in dirnames if d in sets_by_level[depth]]
    #process this directory