R. Williams - 4 months ago 19

C++ Question

I am writing a program for class that reads in a set of ordered pairs from a file and implements the k-means algorithm to identify data clusters. This involves the use of the distance formula, which requires the calculation of a square root. As such, I have included the cmath library, and used the sqrt() function, and everything compiles correctly. However, at run-time, the program generates an infinite loop, which I have determined in gdb is caused by the use of the sqrt function, which causes gdb to generate the line "w_sqrt.c: No such file or directory." I have a similar unresolved problem with the rand() function from the cstdlib library, however I am unable to replicate it until the sqrt() problem is resolved.

Note: This is my first time posting a question on StackOverflow so I apologize in advance if I have overlooked any conventions or rule of posting here.

Additional note: Please limit feedback to that directly related to the problem I have described above. For reasons of academic honesty, I am not seeking advice on the implementation of the k-means function, I am simply trying to get the program to run so that, if there are further problems, I can identify and solve them myself.

Thank you in advance.

`#include <iostream>`

#include <fstream>

#include <cstdlib>

#include <cmath>

using namespace std;

const int ARRAY_MAX = 1000;

//DECIMAL_MULT = 10^(required number of decimal places)

const int DECIMAL_MULT = 10000;

struct node{

double xVal;

double yVal;

char symbol;

int clusterIndex;

};

//Returns the distance between the two input nodes

double getDistance(node A, node B) {

double xSquare = ((B.xVal - A.xVal) * (B.xVal - A.xVal));

double ySquare = ((B.yVal - A.yVal) * (B.yVal - A.yVal));

double result = sqrt(xSquare + ySquare);

return result;

}

int main(){

int numNodes = 0;

char temp[ARRAY_MAX];

string filename;

ifstream file;

cout << "Enter the name of the file to be read: ";

getline(cin, filename);

file.open(filename.c_str());

//Check the number of data entries present in the file

while (!(file.eof())) {

numNodes++;

file.getline(temp, ARRAY_MAX);

}

numNodes--;

file.close();

node list[numNodes];

file.open(filename.c_str());

//Build a list of data points

double xMin, xMax, yMin, yMax;

double value;

file >> value;

file.ignore();

xMin = value;

xMax = value;

list[0].xVal = value;

file >> value;

file.ignore();

yMin = value;

yMax = value;

list[0].yVal = value;

for (int i = 1; i < numNodes; i++) {

file >> value;

file.ignore();

list[i].xVal = value;

if (value < xMin) (xMin = value);

if (value > xMax) (xMax = value);

file >> value;

file.ignore();

list[i].yVal = value;

if (value < yMin) (yMin = value);

if (value > yMax) (yMax = value);

}

//Prompt user for number of clusters and symbol to be used for each

int numClusters;

cout << "Please enter the number of clusters to be analyzed: ";

cin >> numClusters;

cin.ignore();

node centerList[numClusters];

char usedSymbols[numClusters];

char entry;

bool validEntry;

for (int i = 0; i < numClusters; i++) {

centerList[i].clusterIndex = i;

do {

validEntry = true;

cout << "Please enter the character representing "

<< "cluster " << i + 1 << ": ";

cin >> entry;

cin.ignore();

for (int j = 0; j < i; j++) {

if (centerList[j].symbol == entry){

cout << "Character has already been "

<< "used.\n";

validEntry = false;

break;

}

}

} while (!validEntry);

centerList[i].symbol = entry;

}

//Assign random starting points to cluster centers

srand(time(NULL));

int xMaxCast = xMax * DECIMAL_MULT;

int xMinCast = xMin * DECIMAL_MULT;

int yMaxCast = yMax * DECIMAL_MULT;

int yMinCast = yMin * DECIMAL_MULT;

int xRange = xMaxCast - xMinCast;

int yRange = yMaxCast - yMinCast;

int randValue;

for (int i = 0; i < numClusters; i++) {

randValue = std::rand() % xRange + xMinCast;

centerList[i].xVal = randValue / DECIMAL_MULT;

randValue = std::rand() % yRange + yMinCast;

centerList[i].yVal = randValue / DECIMAL_MULT;

}

//Determine the cluster of each node

for (int i = 0; i < numNodes; i++) {

list[i].clusterIndex = centerList[0].clusterIndex;

for (int j = 1; j < numClusters; j++) {

if (getDistance(list[i], centerList[list[i].clusterIndex])

> getDistance(list[i], centerList[j])) {

list[i].clusterIndex = j;

}

}

list[i].symbol = centerList[list[i].clusterIndex].symbol;

}

bool proceed = true;

double average;

int clusterCount;

while (proceed) {

proceed = false;

//Move each cluster center to the centroid of its currently

// assigned points; if all centers are already in the

// correct positions, discontinue this operation

for (int i = 0; i < numClusters; i++) {

average = 0;

clusterCount = 0;

for (int j = 0; j < numNodes; j++) {

if (list[j].clusterIndex == i) {

average += list[j].xVal;

clusterCount++;

}

}

average /= clusterCount;

if (centerList[i].xVal != average) {

proceed = true;

centerList[i].xVal = average;

}

average = 0;

clusterCount = 0;

for (int j = 0; j < numNodes; j++) {

if (list[j].clusterIndex == i) {

average += list[j].yVal;

clusterCount++;

}

}

average /= clusterCount;

if (centerList[i].yVal != average) {

proceed = true;

centerList[i].yVal = average;

}

}

if (proceed) {

//Update cluster assignment of each node

for (int i = 0; i < numNodes; i++) {

for (int j = 0; j < numClusters; j++) {

if (getDistance(list[i],

centerList[list[i].clusterIndex])

> getDistance(list[i],

centerList[j])) {

list[i].clusterIndex = j;

}

}

list[i].symbol =

centerList[list[i].clusterIndex].symbol;

}

}

}

}

Answer

Note that in an expression like `centerList[i].xVal != average`

(i.e., an int is not equal to a double), both values will be compared as doubles. This ends up being a terrible way of comparing floating point numbers, and this is widely discussed.

But here's where the real problem happens. `centerList[i].xVal = average;`

does not assign `centerList[i].xVal`

to `average`

but `(int) average`

(the nearest integer to `average`

towards zero). This leads to a condition that is infinitely repeated. If `node::xVal`

and `node::yVal`

were doubles or the not-equals comparison was fixed by comparing integers or the not-equals comparison was fixed by appropriately comparing floating point numbers, then the program will likely not loop indefinitely (it did not loop indefinitely for the input that I tried).

Note that using the flag `-Wconversion`

, warnings for `centerList[i].xVal = average;`

and `centerList[i].yVal = average;`

appear. Additionally, `-Wconversion`

is not covered by either `-Wall`

or `-Wextra`

. I have found that `-Wconversion`

is a worthwhile flag to use when developing code for numerical algorithms even though it may produce many seemingly benign warnings. Everything is fine and dandy until a devastating implicit conversion from a `float`

to an `int`

causes a massive relative error.

I would hypothesize that `gdb`

indicated `sqrt`

when you inserted a break point because `sqrt`

is relatively expensive and called frequently. `gdb`

simply couldn't find the source for `sqrt`

(maybe because the source isn't installed on your system).