So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting
You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting
parquet.block.size configuration in the writer.
The source of ParquetOuputFormat is here, if you want to dig into details.
The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.