Jeni Jeni - 6 days ago 7
Java Question

Efficient way to replace all special characters and numbers in a large text file in Java

I'm currently working on a program that creates a pie chart based on frequencies of letters in a text file, my test file is relatively large and although my program works great on smaller files it is very slow for large files. I want to cut down the time it takes by figuring out a more efficient way to search through the text file and remove special characters and numbers. This is the code I have right now for this portion:

public class readFile extends JPanel {
protected static String stringOfChar = "";
public static String openFile(){
String s = "";
try {
BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
while((s = reader.readLine()) != null){
String newstr = s.replaceAll("[^a-z A-Z]"," ");
stringOfChar+=newstr;
}
reader.close();
return stringOfChar;
}
catch (Exception e) {
System.out.println("File not found.");
}
return stringOfChar;
}


The code reads through the text file character by character, replacing all special characters with a space, after this is done I sort the string into a hashmap for characters and frequencies.

I know from testing that this portion of the code is what is causing the bulk of extra time to process the file, but I'm not sure how I could replace all the characters in an efficient manner.

Answer

Your code has two inefficiencies:

  • It constructs throw-away strings with special characters replaced by space in s.replaceAll
  • It builds large strings by concatenating String objects with +=

Both of these operations create a lot of unnecessary objects. On top of this, the final String object is thrown away as well as soon as the final result, the map of counts, is constructed.

You should be able to fix both these deficiencies by constructing the map as you read through the file, avoiding both the replacements and concatenations:

public static Map<Character,Integer> openFileAndCount() {
    Map<Character,Integer> res = new HashMap<Character,Integer>();
    BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
    String s;
    while((s = reader.readLine()) != null) {
        for (int i = 0 ; i != s.length() ; i++) {
            char c = s.charAt(i);
            // The check below lets through all letters, not only Latin ones.
            // Use a different check to get rid of accented letters
            // e.g. è, à, ì and other characters that you do not want.
            if (!Character.isLetter(c)) {
                c = ' ';
            }
            res.put(c, res.containsKey(c) ? res.get(c).intValue()+1 : 1);
        }
    }
    return res;
}
Comments