user3705662 user3705662 - 6 months ago 43
Apache Configuration Question

How to read multiple text files into a single RDD?

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.

JavaRDD<String> records = ctx.textFile(args[1], 1);
is capable of reading only one file at a time.

I want to read more than one file and process them as a single RDD. How?


You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:


As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).