Zack Newsham - 1 year ago 57

Java Question

I am trying to implement the above community detection algorithm in Java, and while I have access to C++ code, and the original paper - I can't make it work at all. My major issue is that I don't understand the purpose of the code - i.e. how the algorithm works. In practical terms, my code gets stuck in what seems to be an infinite loop at

`mergeBestQ`

`heap`

`topQ`

The graph I am testing this on is quite large (300,000 nodes, 650,000 edges). The original code I am using for my implementation is from the SNAP library (https://github.com/snap-stanford/snap/blob/master/snap-core/cmty.cpp). What would be great is if someone could explain to me the intuition of the algorithm, it seems to be initially setting each node to be in it's own community, then recording the modularity value (whatever that is) of each pair of connected nodes in the graph, then finding the pair of nodes which have the highest modularity and moving them to the same community. In addition, if someone could provide some mid level pseudo code, that would be great. Here is my implementation thus far, I have tried to keep it in one file for the sake of brevity, however CommunityGraph and CommunityNode are elsewhere (should not be required). Graph maintain a list of all nodes and each node maintains a list of its connections to other nodes. When running it never gets past the line

`while(this.mergeBestQ()){}`

UPDATE - found several bugs in my code after a thorough review. The code now completes VERY quickly, but doesnt fully implement the algorithm, for example of the 300,000 nodes in the graph, it states there are approximately 299,000 communities (i.e. roughly 1 node per community). I have listed the updated code below.

/// Clauset-Newman-Moore community detection method.

/// At every step two communities that contribute maximum positive value to global modularity are merged.

/// See: Finding community structure in very large networks, A. Clauset, M.E.J. Newman, C. Moore, 2004

public class CNMMCommunityMetric implements CommunityMetric{

private static class DoubleIntInt implements Comparable{

public double val1;

public int val2;

public int val3;

DoubleIntInt(double val1, int val2, int val3){

this.val1 = val1;

this.val2 = val2;

this.val3 = val3;

}

`@Override`

public int compareTo(DoubleIntInt o) {

//int this_sum = this.val2 + this.val3;

//int oth_sum = o.val2 + o.val3;

if(this.equals(o)){

return 0;

}

else if(val1 < o.val1 || (val1 == o.val1 && val2 < o.val2) || (val1 == o.val1 && val2 == o.val2 && val3 < o.val3)){

return 1;

}

else{

return -1;

}

//return this.val1 < o.val1 ? 1 : (this.val1 > o.val1 ? -1 : this_sum - oth_sum);

}

@Override

public boolean equals(Object o){

return this.val2 == ((DoubleIntInt)o).val2 && this.val3 == ((DoubleIntInt)o).val3;

}

@Override

public int hashCode() {

int hash = 3;

hash = 79 * hash + this.val2;

hash = 79 * hash + this.val3;

return hash;

}

}

private static class CommunityData {

double DegFrac;

TIntDoubleHashMap nodeToQ = new TIntDoubleHashMap();

int maxQId;

CommunityData(){

maxQId = -1;

}

CommunityData(double nodeDegFrac, int outDeg){

DegFrac = nodeDegFrac;

maxQId = -1;

}

void addQ(int NId, double Q) {

nodeToQ.put(NId, Q);

if (maxQId == -1 || nodeToQ.get(maxQId) < Q) {

maxQId = NId;

}

}

void updateMaxQ() {

maxQId=-1;

int[] nodeIDs = nodeToQ.keys();

double maxQ = nodeToQ.get(maxQId);

for(int i = 0; i < nodeIDs.length; i++){

int id = nodeIDs[i];

if(maxQId == -1 || maxQ < nodeToQ.get(id)){

maxQId = id;

maxQ = nodeToQ.get(maxQId);

}

}

}

void delLink(int K) {

int NId=getMxQNId();

nodeToQ.remove(K);

if (NId == K) {

updateMaxQ();

}

}

int getMxQNId() {

return maxQId;

}

double getMxQ() {

return nodeToQ.get(maxQId);

}

};

private TIntObjectHashMap<CommunityData> communityData = new TIntObjectHashMap<CommunityData>();

private TreeSet<DoubleIntInt> heap = new TreeSet<DoubleIntInt>();

private HashMap<DoubleIntInt,DoubleIntInt> set = new HashMap<DoubleIntInt,DoubleIntInt>();

private double Q = 0.0;

private UnionFind uf = new UnionFind();

@Override

public double getCommunities(CommunityGraph graph) {

init(graph);

//CNMMCommunityMetric metric = new CNMMCommunityMetric();

//metric.getCommunities(graph);

// maximize modularity

while (this.mergeBestQ(graph)) {

}

// reconstruct communities

HashMap<Integer, ArrayList<Integer>> IdCmtyH = new HashMap<Integer, ArrayList<Integer>>();

Iterator<CommunityNode> ns = graph.getNodes();

int community = 0;

TIntIntHashMap communities = new TIntIntHashMap();

while(ns.hasNext()){

CommunityNode n = ns.next();

int r = uf.find(n);

if(!communities.contains(r)){

communities.put(r, community++);

}

n.setCommunity(communities.get(r));

}

System.exit(0);

return this.Q;

}

private void init(Graph graph) {

double M = 0.5/graph.getEdgesList().size();

Iterator<Node> ns = graph.getNodes();

while(ns.hasNext()){

Node n = ns.next();

uf.add(n);

int edges = n.getEdgesList().size();

if(edges == 0){

continue;

}

CommunityData dat = new CommunityData(M * edges, edges);

communityData.put(n.getId(), dat);

Iterator<Edge> es = n.getConnections();

while(es.hasNext()){

Edge e = es.next();

Node dest = e.getStart() == n ? e.getEnd() : e.getStart();

double dstMod = 2 * M * (1.0 - edges * dest.getEdgesList().size() * M);//(1 / (2 * M)) - ((n.getEdgesList().size() * dest.getEdgesList().size()) / ((2 * M) * (2 * M)));// * (1.0 - edges * dest.getEdgesList().size() * M);

dat.addQ(dest.getId(), dstMod);

}

Q += -1.0 * (edges*M) * (edges*M);

if(n.getId() < dat.getMxQNId()){

addToHeap(createEdge(dat.getMxQ(), n.getId(), dat.getMxQNId()));

}

}

}

void addToHeap(DoubleIntInt o){

heap.add(o);

}

DoubleIntInt createEdge(double val1, int val2, int val3){

DoubleIntInt n = new DoubleIntInt(val1, val2, val3);

if(set.containsKey(n)){

DoubleIntInt n1 = set.get(n);

heap.remove(n1);

if(n1.val1 < val1){

n1.val1 = val1;

}

n = n1;

}

else{

set.put(n, n);

}

return n;

}

void removeFromHeap(Collection<DoubleIntInt> col, DoubleIntInt o){

//set.remove(o);

col.remove(o);

}

DoubleIntInt findMxQEdge() {

while (true) {

if (heap.isEmpty()) {

break;

}

DoubleIntInt topQ = heap.first();

removeFromHeap(heap, topQ);

//heap.remove(topQ);

if (!communityData.containsKey(topQ.val2) || ! communityData.containsKey(topQ.val3)) {

continue;

}

if (topQ.val1 != communityData.get(topQ.val2).getMxQ() && topQ.val1 != communityData.get(topQ.val3).getMxQ()) {

continue;

}

return topQ;

}

return new DoubleIntInt(-1.0, -1, -1);

}

boolean mergeBestQ(Graph graph) {

DoubleIntInt topQ = findMxQEdge();

if (topQ.val1 <= 0.0) {

return false;

}

// joint communities

int i = topQ.val3;

int j = topQ.val2;

uf.union(i, j);

Q += topQ.val1;

CommunityData datJ = communityData.get(j);

CommunityData datI = communityData.get(i);

datI.delLink(j);

datJ.delLink(i);

int[] datJData = datJ.nodeToQ.keys();

for(int _k = 0; _k < datJData.length; _k++){

int k = datJData[_k];

CommunityData datK = communityData.get(k);

double newQ = datJ.nodeToQ.get(k);

//if(datJ.nodeToQ.containsKey(i)){

// newQ = datJ.nodeToQ.get(i);

//}

if (datI.nodeToQ.containsKey(k)) {

newQ = newQ + datI.nodeToQ.get(k);

datK.delLink(i);

} // K connected to I and J

else {

newQ = newQ - 2 * datI.DegFrac * datK.DegFrac;

} // K connected to J not I

datJ.addQ(k, newQ);

datK.addQ(j, newQ);

addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));

}

int[] datIData = datI.nodeToQ.keys();

for(int _k = 0; _k < datIData.length; _k++){

int k = datIData[_k];

if (!datJ.nodeToQ.containsKey(k)) { // K connected to I not J

CommunityData datK = communityData.get(k);

double newQ = datI.nodeToQ.get(k) - 2 * datJ.DegFrac * datK.DegFrac;

datJ.addQ(k, newQ);

datK.delLink(i);

datK.addQ(j, newQ);

addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));

}

}

datJ.DegFrac += datI.DegFrac;

if (datJ.nodeToQ.isEmpty()) {

communityData.remove(j);

} // isolated community (done)

communityData.remove(i);

return true;

}

}

UPDATE:the currently listed code is fairly quick, and has half the memory usage compared to the "quickest" solution, while only being ~5% slower. the difference is in the use of hashmap + treest vs priority queue, and ensuring only a single object for a given i, j pair exists at any time.

Answer

So here's the original paper, a neat lil' six pages, only two of which are about the design & implementation. Here's a cliffnotes:

- For a partition of a given graph, the authors define the
*modularity*,`Q`

, of the partition to be the ratio of the number of edges*within*each community to the number of edges*between*each community, minus the ratio you'd expect from a completely random partition. - So it's effectively "how much better is this partition at defining communities than a completely random one?"
- Given two communities
`i`

and`j`

of a partition, they then define`deltaQ_ij`

to be how much the modularity of the partition would change if communities`i`

and`j`

were merged. So if`deltaQ_ij > 0`

, merging`i`

and`j`

will improve the modularity of the partition. - Which leads to a simple greedy algorithm: start with every node in its own community. Calculate
`deltaQ_ij`

for every pair of communities. Whichever two communities`i, j`

have the largest`deltaQ_ij`

, merge those two. Repeat. - You'll get maximum modularity when the
`deltaQ_ij`

all turn negative, but in the paper the authors let the algorithm run until there's only one community left.

That's pretty much it for understanding the algorithm. The details are in how to compute `deltaQ_ij`

quickly and store the information efficiently.

Edit: Data structure time!

So first off, I think the implementation you're referencing does things a different way to the paper. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets in place of the author's binary trees and multiple heaps. Not a clue *why* they do it a different way. You might want to email the guy who wrote it and ask.

Anyway, the algorithm in the *paper* needs several things from the format `deltaQ`

is stored in:

- First, it needs to be able to recover the largest value in
`dQ`

quickly. - Second, it needs to be able to remove all
`deltaQ_ik`

and`deltaQ_ki`

for a fixed`i`

quickly. - Third, it needs to be able to update all
`deltaQ_kj`

and`deltaQ_jk`

for a fixed`j`

quickly.

The solution the authors come up to for this is as follows:

- For each community
`i`

, each**non-zero**`deltaQ_ik`

is stored in a balanced binary tree, indexed by`k`

(so elements can be found easily), and in a heap (so the maximum for that community can be found easily). - The maximum
`deltaQ_ik`

from each community`i`

's heap is then stored in*another*heap, so that the overall maximums can be found easily.

When community `i`

is merged with community `j`

, several things happen to the binary trees:

- First, each element from the
`i`

th community is added to the`j`

th community's binary tree. If an element with the same index`k`

already exists, you sum the old and new values. - Second, we update all the remaining "old" values in the
`j`

th community's binary tree to reflect the fact that the`j`

th community has just increased in size. - And for each other community's binary tree
`k`

, we update any`deltaQ_kj`

. - Finally, the tree for community
`i`

is thrown away.

And similarly, several things must happen to the heaps:

- First, the heap for community
`i`

is thrown away. - Then the heap for community
`j`

is rebuilt from scratch using the elements from the community's balanced binary tree. - And for each other community
`k`

's heap, the position of entry`deltaQ_kj`

is updated. - Finally, the entry for community
`i`

in the overall heap is thrown away (causing bubbling) and the entries for community`j`

and each community`k`

connected to`i`

or`j`

are updated.

Strangely, when two communities are merged there's no reference in the paper as to removing `deltaQ_ki`

values from the `k`

th community's heap or tree. I think this might be dealt with by the setting of `a_i = 0`

, but I don't understand the algorithm well enough to be sure.

Edit: Trying to decipher the implementation you linked. Their primary datastructures are

`CmtyIdUF`

, a union-find structure that keeps track of which nodes are in which community (something that's neglected in the paper, but seems necessary unless you want to reconstruct community membership from a trace of the merge or something),`MxQHeap`

, a heap to keep track of which`deltaQ_ij`

is largest overall. Strangely, when they update the value of a`TFltIntIntTr`

in the heap, they don't ask the heap to re-heapify itself. This is worrying. Does it do it automatically or something?`CmtyQH`

, a hashmap that maps a community ID`i`

to a structure`TCmtyDat`

which holds what looks heap of the`deltaQ_ik`

for that community. I think. Strangely though, the`UpdateMaxQ`

of the`TCmtyDat`

structure takes linear time, obviating any need for a heap. What's more, the`UpdateMaxQ`

method only appears to be called when an element of the heap is deleted. It should definitely also be getting called when the value of any element in the heap is updated.