user1261445 user1261445 - 1 month ago 6
Java Question

Using threadpools/threading for reading large txt files?

On a previous question of mine I posted:

I have to read several very large txt files and have to either use multiple threads or a single thread to do so depending on user input.
Say I have a main method that gets user input, and the user requests a single thread and wants to process 20 txt files for that thread. How would I accomplish this? Note that the below isn't my code or its setup but just what the "idea" is.


Example:

int numFiles = 20;
int threads = 1;

String[] list = new String[20];
for(int i = 1; i < 21; i++){
list[i] = "hello" + i + ".txt";//so the list is a hello1.txt, hello2.txt, ..., hello20.txt
}

public void run(){
//processes txt file
}


So in summary, how would I accomplish this with a single thread? With 20 threads?

And a user suggested using threadPools:

When the user specifies how many threads to use, you'd configure the pool appropriately, submit the set of file-read jobs, and let the pool sort out the executions.
In the Java world, you'd use the Executors.newFixedThreadPool factory method, and submit each job as a Callable. Here's an article from IBM on Java thread pooling.


So now I have I have a method called sortAndMap(String x) which takes in a txt file name and does the processing, and for the example above, would have

Executors.newFixedThreadPool(numThreads);

How do I use this with threadPools so that my example above is doable?

Answer

Ok, bear with me on this, because I need to explain a few things.

First off, unless you have multiple disks or perhaps a single disk which is SSD, it's not recommended to use more than one thread to read from the disk. Many questions on this topic have been posted and the conclusion was the same: using multiple threads to read from a single mechanical disk will hurt performance instead of improving it.

The above happens because the disk's mechanical head needs to keep seeking the next position to read. Using multiple threads means that when each thread gets a chance to run it will direct the head to a different section of the disk, thus making it bounce between disk areas inefficiently.

The accepted solution for processing multiple files is to have a single producer (a reader thread) - multiple consumer (processing threads) system. The ideal mechanism is a thread pool in this case, with a thread acting as the producer and putting tasks in the pool queue for the workers to process.

Something like this:

int numFiles = 20;
int threads = 4;

ExecutorService exec = Executors.newFixedThreadPool(threads);

for(int i = 0; i < numFiles; i++){
    String[] fileContents = // read current file;
    exec.submit(new ThreadTask(fileContents));
}

exec.shutdown();
exec.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);
...

class ThreadTask implements Runnable {

   private String[] fileContents;

   public ThreadTask(String[] fileContents) {
        this.fileContents = fileContents;
   }

   public void run(){
      //processes txt file
   }
}