alessandro alessandro - 11 days ago 5
C++ Question

partitioning an float array into similar segments (clustering)

I have an array of floats like this:

[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]


Now, I want to partition the array like this:

[[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]


// [200] will be considered as an outlier because of less cluster support

I have to find this kind of segment for several arrays and I don't know what should be the partition size. I tried to do it by using hierarchical clustering (Agglomerative) and it gives satisfactory results for me. However, issue is, I was suggested not to use clustering algorithms for one-dimensional problem as their is no theoretical justification (as they are for multidimensional data) to do that.

I spent lots of time to find solution. However, suggestions seem quite different like: this and this VS. this and this and this.

I found another suggestion rather than clustering i.e. natural breaks optimization. However, this also needs to declare the partition number like K-means (right ?).

It is quite confusing (specially because I have to perform those kind of segmentation on several arrays and it is impossible to know the optimal partition number).

Are there any ways to find partitions (thus we can reduce the variance within partitions and maximize the variance between partitions) with some theoretical justification?

Any pointers to article/papers (if available C/C++/Java implementation) with some theoretical justification will be very helpful for me.

Answer

I think I'd sort the data (if it's not already), then take adjacent differences. Divide the differences by the smaller of the numbers it's a difference between to get a percentage change. Set a threshold and when the change exceeds that threshold, start a new "cluster".

Edit: Quick demo code in C++:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <numeric>
#include <functional>

int main() {
    std::vector<double> data{ 
        1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200 
    };

    // sort the input data
    std::sort(data.begin(), data.end());

    // find the difference between each number and its predecessor
    std::vector<double> diffs;
    std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs));

    // convert differences to percentage changes
    std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(),
        std::divides<double>());

    // print out the results
    for (int i = 0; i < data.size(); i++) {

        // if a difference exceeds 40%, start a new group:
        if (diffs[i] > 0.4)
            std::cout << "\n";

        // print out an item:
        std::cout << data[i] << "\t";
    }

    return 0;
}

Result:

1.91    2.87    3.61
10.91   11.91   12.82
100.71  100.73  101.89
200