PyNEwbie PyNEwbie - 5 months ago 10
Python Question

How can I avoid getting duplicate paths from os.walk

I have the following directory structure

/mnt/type/split/v2/doc/RESOURCE_ID/YYYY/FY/DOCUMENT_ID


for example, one path might be

/mnt/type/split/v2/doc/100045/2008/FY/28


where

RESOURCE_ID = 100045
YYYY = 2008
DOCUMENT_ID = 28


Note, DOCUMENT_ID is the last directory in the path - there will be files in the DOCUMENT_ID directory

I was trying to take inventory of this structure using the following code

def survey():
magic_paths = []
for (resource_id, dirname,filename) in os.walk('/mnt/type/split/v2/doc'):
if resource_id:
for (magic_path, dirname2,filename2) in os.walk(resource_id):
if len(magic_path.split(os.sep)) == 10:
magic_paths.append(magic_path + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x


I am getting five copies of each path in my magic_paths list. I have 1,500,000 paths, so I am getting 7,500,00 items in my list.

The first 1,500,000 are the unique values. The next 6,000,000 consist of groups that are rooted on the RESOURCE_ID, repeated 4 times

/mnt/type/split/v2/doc/100045/2008/FY/28 #obs_1
/mnt/type/split/v2/doc/100045/2008/FY/29 #obs_2
/mnt/type/split/v2/doc/100045/2008/FY/30 #obs_3
/mnt/type/split/v2/doc/100045/2008/FY/31 #obs_4
/mnt/type/split/v2/doc/1028/2008/FY/28 #obs_5 # see the new RESOURCE_ID
.
. 1,499,995 more unique values
.
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of first repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of second repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of third repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of fourth repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/1028/2008/FY/28 #series of 4 repetitions based on RESOURCE ID 1028


There are various files in the directories and subs at each level, I just need to inventory the paths to the DOCUMENT_IDs.

I do not understand why the results are patterned as they are. I believed that I was starting at RESOURCE_ID and finding only the directories that were 9 deep since splitting on os.sep gives me a list with ten items.

'/mnt/type/split/v2/doc/100045/2008/FY/31'.split(os.sep) = ['','mnt',type','split','v2','doc','100045','2008','FY','31']


In response to the questions in the comments


  1. I believed that I was getting each RESOURCE_ID directory and then walking it. That the other items returned from the first os.walk (dirnames and filenames) would be ignored

  2. I did not think os.listdir would work, I can make this work with glob but am worried about it eating my memory


Answer

os.walk() will recursively walk a directory structure. For each directory you encounter (except the top level), you start another recursive call. So for every directory, you recursively walk that directory plus all nested directories. That includes nested directories.

Call os.walk() just once:

def survey():
    magic_paths = []
    for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
         if len(resource_id.split(os.sep)) == 10:
              magic_paths.append(resource_id + os.linesep)
    write_survey(magic_paths)
    x = len(magic_paths)
    return x