aa8y aa8y - 3 months ago 17
Java Question

Filtering input files using globStatus in MapReduce

I have a lot of input files and I want to process selected ones based on the date that has been appended in the end. I am now confused on where do I use the globStatus method to filter out the files.

I have a custom RecordReader class and I was trying to use globStatus in its next method but it didn't work out.

public boolean next(Text key, Text value) throws IOException {
Path filePath = fileSplit.getPath();

if (!processed) {
key.set(filePath.getName());

byte[] contents = new byte[(int) fileSplit.getLength()];
value.clear();
FileSystem fs = filePath.getFileSystem(conf);
fs.globStatus(new Path("/*" + date));
FSDataInputStream in = null;

try {
in = fs.open(filePath);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}


I know it returns a FileStatus array, but how do I use it to filter the files. Can someone please shed some light?

Answer

The globStatus method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a PathFilter.

Regarding the glob pattern, the following are supported:

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------
*      | Matches zero or more characters
?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter

PathFilter is simply an interface like this:

public interface PathFilter {
    boolean accept(Path path);
}

So you can implement this interface and implement the accept method where you can put your logic to filter files.

An example taken from Tom White's excellent book which allows you to define a PathFilter to filter files that match a certain regular expression:

public class RegexExcludePathFilter implements PathFilter {
    private final String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

You can directly filter your input with a PathFilter implementation by calling FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class) when initializing your job.

EDIT: Since you have to pass the class in setInputPathFilter, you can't directly pass arguments, but you should be able to do something similar by playing with the Configuration. If you make your RegexExcludePathFilter also extend from Configured, you can get back a Configuration object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the accept.

For example if you initialize like this:

conf.set("date", "2013-01-15");

Then you can define your filter like this:

public class RegexIncludePathFilter extends Configured implements PathFilter {
    private String date;
    private FileSystem fs;

    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            }
        } catch (IOException e) {}
        return path.toString().endsWith(date);
    }

    public void setConf(Configuration conf) {
        if (null != conf) {
            this.date = conf.get("date");
            try {
                this.fs = FileSystem.get(conf);
            } catch (IOException e) {}
        }
    }
}

EDIT 2: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.

Comments