lydiaP lydiaP - 6 months ago 10
Java Question

How to use regular expressions in java to remove certain characters

General question is: how to parse a string and eliminate punctuation and replace some of them?

I'm trying to modify some input text. The case is that I have an normal text file, with punctuation and I want to get all of them eliminated. If the Symbol is an . ! ? ... I want to replace that with an "" string.

I never used regex and so I tried with string comparison, but obviously it isn't sufficient for all cases. I have trouble if there are two punctuation marks; like in the text "the second Day (the 4ht).", when I have ). togheter.

For example, from given Input I expect the following:

Input : [...] at it!" This speech caused
Excpected output : at it <s> this speech caused


Every word in my code is added to an ArrayList because I need to work with that later.

Thanks a lot!

FileInputStream fileInputStream = new FileInputStream("TEXT.txt");
InputStreamReader inputStreamReader = new InputStreamReader(
fileInputStream, "UTF-8");
BufferedReader bf = new BufferedReader(inputStreamReader);

words.add("<s>");
String s;
while ((s = bf.readLine()) != null) {
String[] var = s.split(" ");

for (int i = 0; i < var.length; i++) {
if (var[i].endsWith(",") || var[i].endsWith(")")
|| var[i].endsWith("(") || var[i].endsWith(":")
|| var[i].endsWith(";") ||var[i].endsWith("'")) {
var[i] = var[i].substring(0, var[i].length() - 1);
words.add(var[i].toLowerCase());
} else if ( var[i].startsWith("'")) {
var[i] = var[i].substring(1, var[i].length() );
words.add(var[i].toLowerCase());
} else if (var[i].endsWith(".") || var[i].endsWith("...")
|| var[i].endsWith("!") || var[i].endsWith("?")) {
var[i] = var[i].substring(0, var[i].length() - 1);
words.add(var[i].toLowerCase());
words.add("<s>");
} else {
words.add(var[i].toLowerCase()); //
// System.out.println("\n neu eingelesenes Wort: " + var[i]);
}}
}

Answer

First use a regex to filter out the punctuations and only then split it by space and add the result to your list:

FileInputStream fileInputStream = new FileInputStream("TEXT.txt");
InputStreamReader inputStreamReader = new InputStreamReader(
        fileInputStream, "UTF-8");
BufferedReader bf = new BufferedReader(inputStreamReader);
words.add("<s>");
String s;
while ((s = bf.readLine()) != null) {
    s = s.replaceAll("[^a-zA-Z ]", ""); // replace all non-word/non-space characters with an empty string
    String[] var = s.split(" ");
    words.addAll(var);
}