Shengjie Zhang Shengjie Zhang - 3 months ago 12
Python Question

Python networkx link prediction with adamic_adar_index

I have a networkx graph object, which is weighted and undirected.
I'm trying to predict 10 new links for every nodes with Adamic Adar Index. The function adamic_adar_index in Networkx returns a generator of tuples, which are formatted as (nodeid1, nodeid2, adamic_adar_index). I'm not familiar generators in Python. What I want to do is group the generator by nodeid1 and return the largest 10 index for nodeid1.

Here is my code, where "coauthor" is the network object and "preds" is the generator. The data file is here https://www.dropbox.com/s/hyr1hgjs4yt03x2/coauthor.csv?dl=0

import csv
import networkx as nx
g = nx.read_weighted_edgelist("coauthor.csv", delimiter='\t', encoding='utf-8')
coauthor = nx.read_weighted_edgelist("coauthor.csv", delimiter='\t', encoding='utf-8')
preds = nx.adamic_adar_index(coauthor)

Answer

Take a look at heapq.nlargest It takes an iterable and returns the n largest things in that iterable. Since I don't have your coauthor list, I'll use the karate graph. Instead of looking at all non-edges right away (as adamic_adar_index does by default), I'm going to go through each node u in G and do this for all non neighbors of u

import networkx as nx
import heapq


def nonedges(G,u):  #a generator with (u,v) for every non neighbor v
    for v in nx.non_neighbors(G, u):
        yield (u, v)


G = nx.karate_club_graph()

for u in G.nodes_iter():# you may want to check that there will be at least 10 choices.
    preds = nx.adamic_adar_index(G,nonedges(G,u))
    tenlargest = heapq.nlargest(10, preds, key = lambda x: x[2])
    print tenlargest

Warning: if you aren't careful here there's a bug in the algorithm you described: for node 1 you might find that some of the tuples are being returned as (1, 2, 3.2), (1, 3, 0.3), (4, 1, 100). The way you've described your grouping, you'll miss the (4,1) pair. My example checks each pair twice to avoid this. It may be possible to eliminate this duplication of computer effort with some effort on your part.

Generators and iterators are closely related. More info on iterators is at https://docs.python.org/2/glossary.html#term-iterator (you can also find generators on that page). You can think of it as being a list, but there are rules about how you're allowed to access it. Each time you look at it, you get the next element. Once you look at the element, it's removed from the iterator. You can only get one thing at a time from the iterator. In the computer memory, it doesn't have to hold the entire thing (it generates the next element when it's asked for it). So for example, you can see I used an iterator rather than G.nodes() in my loop. This means that the computer never had to hold all the nodes in G in its memory.

for u in G.nodes_iter():

versus

for u in G.nodes()