Aleksandr Aleksandr - 7 days ago 5
Java Question

Counting distinct words V2

I've asked this question before ( Counting distinct words ) and made the code more appropriate. As described in first question I need to count the distinct words from a file.

De-Bug shows that all my words are stored and sorted correctly, but the issue now is an infinite "while" loop in the Test class that keeps on going after reading all the words (De-bug really helped to figure out some points...).
I'm testing the code on a small file now with no more than 10 words.

DataSet class has been modified mostly.

I need some advice how to get out of the loop.

Test looks like this:

package test;

import java.io.File;
import java.io.IOException;

import junit.framework.Assert;
import junit.framework.TestCase;
import main.DataSet;
import main.WordReader;

public class Test extends TestCase
{

public void test2() throws IOException
{
File words = new File("resources" + File.separator + "test2.txt");

if (!words.exists())
{
System.out.println("File [" + words.getAbsolutePath()
+ "] does not exist");
Assert.fail();
}

WordReader wr = new WordReader(words);
DataSet ds = new DataSet();

String nextWord = wr.readNext();
// This is the loop
while (nextWord != "" && nextWord != null)
{
if (!ds.member(nextWord))
{
ds.insert(nextWord);
}
nextWord = wr.readNext();
}
wr.close();
System.out.println(ds.toString());
System.out.println(words.toString() + " contains " + ds.getLength()
+ " distinct words");

}

}


Here is my updated DataSet class, especially member() method, I'm still not sure about it because at some point I used to get a NullPointerExeption (don't know why...):

package main;

import sort.Sort;

public class DataSet
{

private String[] data;
private static final int DEFAULT_VALUE = 200;
private int nextIndex;
private Sort bubble;

public DataSet(int initialCapacity)
{
data = new String[initialCapacity];
nextIndex = 0;
bubble = new Sort();
}

public DataSet()
{
this(DEFAULT_VALUE);
nextIndex = 0;
bubble = new Sort();
}

public void insert(String value)
{
if (nextIndex < data.length)
{
data[nextIndex] = value;
nextIndex++;
bubble.bubble_sort(data, nextIndex);
}
else
{
expandCapacity();
insert(value);
}
}

public int getLength()
{
return nextIndex + 1;
}


public boolean member(String value)
{
for (int i = 0; i < data.length; i++)
{

if (data[i] != null && nextIndex != 10)
{
if (data[i].equals(value))
return true;
}
}
return false;
}

private void expandCapacity()
{
String[] larger = new String[data.length * 2];
for (int i = 0; i < data.length; i++)
{
data = larger;
}
}
}


WordReader class didn't change much. ArrayList was replaced with simple array, storing method also has been modified:

package main;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;

public class WordReader
{

private File file;

private String[] words;

private int nextFreeIndex;

private BufferedReader in;

private int DEFAULT_SIZE = 200;

private String word;

public WordReader(File file) throws IOException
{
words = new String[DEFAULT_SIZE];
in = new BufferedReader(new FileReader(file));
nextFreeIndex = 0;
}

public void expand()
{
String[] newArray = new String[words.length * 2];
// System.arraycopy(words, 0, newArray, 0, words.length);
for (int i = 0; i < words.length; i++)
newArray[i] = words[i];
words = newArray;
}

public void read() throws IOException
{

}

public String readNext() throws IOException
{
char nextCharacter = (char) in.read();

while (in.ready())
{
while (isWhiteSpace(nextCharacter) || !isCharacter(nextCharacter))
{
// word = "";
nextCharacter = (char) in.read();

if (!in.ready())
{
break;
}
}

word = "";
while (isCharacter(nextCharacter))
{
word += nextCharacter;
nextCharacter = (char) in.read();
}
storeWord(word);

return word;
}

return word;
}

private void storeWord(String word)
{
if (nextFreeIndex < words.length)
{
words[nextFreeIndex] = word;
nextFreeIndex++;
}
else
{
expand();
storeWord(word);
}

}

private boolean isWhiteSpace(char next)
{
if ((next == ' ') || (next == '\t') || (next == '\n'))
{
return true;
}
return false;
}

private boolean isCharacter(char next)
{
if ((next >= 'a') && (next <= 'z'))
{
return true;
}
if ((next >= 'A') && (next <= 'Z'))
{
return true;
}
return false;
}

public boolean fileExists()
{
return file.exists();
}

public boolean fileReadable()
{
return file.canRead();
}

public Object wordsLength()
{
return words.length;
}

public void close() throws IOException
{
in.close();
}

public String[] getWords()
{
return words;
}

}


And Bubble Sort class for has been changed for strings:

package sort;

public class Sort
{
public void bubble_sort(String a[], int length)
{
for (int j = 0; j < length; j++)
{
for (int i = j + 1; i < length; i++)
{
if (a[i].compareTo(a[j]) < 0)
{
String t = a[j];
a[j] = a[i];
a[i] = t;
}
}
}
}
}

Answer

I suppose the method that actually blocks is the WordReader.readNext(). My suggestion there is that you use Scanner instead of BufferedReader, it is more suitable for parsing a file into words.

Your readNext() method could be redone as such (where scan is a Scanner):

public String readNext() {
    if (scan.hasNext()) {
        String word = scan.next();
        if (!word.matches("[A-Za-z]+"))
            word = "";
        storeWord(word);
        return word;
    }
    return null;
}

This will have the same functionality as your code (without using isCharacter() or isWhitespace() - the regex (inside matches())checks that a word contains only characters. The isWhitespace() functionality is built-in in next() method which separates words. The added functionality is that it returns null when there are no more words in the file.

You'll have to change your while-loop in Test class for this to work properly or you will get a NullPointerException - just switch the two conditions in the loop definition (always check for null before, or the first will give a NPE either way and the null-check is useless).

To make a Scanner, you can use a BufferedReader as a parameter or the File directly as well, as such:

Scanner scan = new Scanner(file);