The Time The Time - 1 month ago 17
Java Question

parse a document with million words

I have implemented some code to find the anagrams word in the txt

sample.txt
file and output them on the console. The txt document contains String (word) in each of line.

Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?

I appreciate any help.

Sample

abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi


oupt

abac aabc abca
xyz zyx yxz


Code

package org.reader;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;

public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;

BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));

for (int i = 0; i < list.size(); i++) {

flagCheckWord1InMatch = true;
String word1 = list.get(i);

for (int j = i + 1; j < list.size(); j++) {

String word2 = list.get(j);

boolean isExist = false;

if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);

}

if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {

if (flag) {
sb.append(word1 + " ");
flag = false;
}

sb.append(word2 + " ");

}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;

}

}

}
}

} catch (

IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}

for (String item : match) {
System.out.println(item);
}

// System.out.println("Sihwail");

}

private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;

}
}
}
return isAvailable;
}

public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}

}

Answer

For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.

20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.

This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.

Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).