Fgblanch Fgblanch - 12 days ago 6
Java Question

How to list a 2 million files directory in java without having an "out of memory" exception

I have to deal with a directory of about 2 million xml's to be processed.

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

I've tried using the

File.listFiles()
method, but it gives me a java
out of memory: heap space
exception. Any ideas?

Answer

First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.

If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.


Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:

public File[] listFiles(FilenameFilter filter) {
    String ss[] = list();
    if (ss == null) return null;
    ArrayList v = new ArrayList();
    for (int i = 0 ; i < ss.length ; i++) {
        if ((filter == null) || filter.accept(this, ss[i])) {
            v.add(new File(ss[i], this));
        }
    }
    return (File[])(v.toArray(new File[v.size()]));
}

so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

Btw, could you give an example of a file name? Are they "guessable"? Like

for (int i = 0; i < 100000; i++)
    tryToOpen(String.format("file%05d", i))