gogurt gogurt - 20 days ago 5
Python Question

What's the most efficient way to extract an edge list from a huge adjacency list?

This might be a stupid question, but let's say I have a large (~billions of lines) CSV file which contains adjacency lists where the vertices are represented by strings like:

| id | neighbors |
| 'james' | 'michael, jane, pete' |
| 'doug' | 'cliff' |
| 'amy' | 'bobby, russell, richard' |
| 'richard' | 'kam, earl, cliff' |
| 'marshawn' | |
| 'bobby' | 'emily, james, doug' |

From these kinds of adjacency lists, all I want to do is output a vertex set and an edge set comprised of undirected pairs of vertices. That is all.

What is the most efficient strategy to accomplish this, and how do we implement it in Python?

For the sake of brevity in outlining the algorithm below, let:

  • add('bobby')
    : operation to add vertex 'bobby' to the vertex set

  • edge('bobby','emily')
    : operation to add ('bobby', 'emily') to the edge set

  • ingraph('bobby')
    : check if vertex 'bobby' is in vertex set

Suppose we take the approach of starting with an empty graph and adding vertices sequentially. Then my first try (in very raw psuedocode) would be something like:

ids = [...all id's in the CSV...]
unexplored = list(ids)

for i in ids:
for j in unexplored:
if i in neighbors(j):
if not ingraph(j): add(j)
edge(i, j)
del unexplored[0]

  1. Is there an obvious way to improve this algorithm in general (independent of Python)?

  2. What's the best way to implement such a solution in Python? Iterating through the raw CSV file? Loading it into
    and using
    to vectorize this somehow (assuming I have enough memory...)?

EDIT: By writing "neighbors" I hoped to make it clear that I just want an undirected graph. Sorry if this wasn't obvious.


If I understand you right, you want to have the graph represented as G(V, E) where V and E are two sets with Vertices and Edges

As the Edges are undirected you need to think about some way, to represent them. Either you don't care for their direction, and always check if there is an edge in one of both directions, or you canonicalize them, e.g. by using alphanumeric sorting for tuples.

So, we assume you choose the latter, then E is a set of tuples, where the entries follow a strict order

e = (v1, v2), v1 < v2.

With this defined you can just process your file, line by line, adding the ID to the Set V, creating the tuple containing the the neighbors (ID, neighbor) or (neighbor, ID) depening on their alphanumerical order, and add this to your Set E.

If you stick to the canonical representation of edges, Python will take care, that there will be no duplicates of your edges in the Set as it is defined as an unordered set of unique elements. https://docs.python.org/2/library/sets.html

As long as you can assume that your file is correct, and there is no edge, that has no end (for the ID is missing), you can just create the edges first, and later - once you arrive at the corresponding line, you'll create the vertex.
If you can't hold this assumption, you can still create your graph representation in this manner, you just need to implement some clean-up at the end, where you iterate over the edge-set once again, checking if there is any edge left dangeling in the nowhere (points to an non-existing vertex), and handle this, by either deleting this edge, or creating the vertex - what ever suits your needs.