eboni eboni - 5 months ago 60
Scala Question

How to use regex to include/exclude some input files in sc.textFile?

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function


I have attempted to do the following:


This should match the following:


Any idea how to achieve this?


Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

  • * (match 0 or more character)
  • ? (match single character)
  • [ab] (character class)
  • [^ab] (negated character class)
  • [a-b] (character range)
  • {a,b} (alternation)
  • \c (escape character)

Following the example in the accepted answer, it is possible to write your path as:


It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary: