gogurt - 1 year ago 36

Python Question

This might be a stupid question, but let's say I have a large (~billions of lines) CSV file which contains adjacency lists where the vertices are represented by strings like:

`+------------+---------------------------+`

| id | neighbors |

+------------+---------------------------+

| 'james' | 'michael, jane, pete' |

| 'doug' | 'cliff' |

| 'amy' | 'bobby, russell, richard' |

| 'richard' | 'kam, earl, cliff' |

| 'marshawn' | |

| 'bobby' | 'emily, james, doug' |

+------------+---------------------------+

From these kinds of adjacency lists, all I want to do is output a vertex set and an edge set comprised of

For the sake of brevity in outlining the algorithm below, let:

- : operation to add vertex 'bobby' to the vertex set
`add('bobby')`

- : operation to add ('bobby', 'emily') to the edge set
`edge('bobby','emily')`

- : check if vertex 'bobby' is in vertex set
`ingraph('bobby')`

Suppose we take the approach of starting with an empty graph and adding vertices sequentially. Then my first try (in very raw psuedocode) would be something like:

`ids = [...all id's in the CSV...]`

unexplored = list(ids)

for i in ids:

add(i)

for j in unexplored:

if i in neighbors(j):

if not ingraph(j): add(j)

edge(i, j)

del unexplored[0]

- Is there an obvious way to improve this algorithm in general (independent of Python)?
- What's the best way to implement such a solution in Python? Iterating through the raw CSV file? Loading it into and using
`pandas`

to vectorize this somehow (assuming I have enough memory...)?`numpy`

Answer Source

If I understand you right, you want to have the graph represented as G(V, E) where V and E are two sets with **V**ertices and **E**dges

As the **E**dges are undirected you need to think about some way, to represent them. Either you don't care for their direction, and always check if there is an edge in one of both directions, or you canonicalize them, e.g. by using alphanumeric sorting for tuples.

So, we assume you choose the latter, then E is a set of tuples, where the entries follow a strict order

```
e = (v1, v2), v1 < v2.
```

With this defined you can just process your file, line by line, adding the ID to the `Set V`

, creating the tuple containing the the neighbors `(ID, neighbor)`

or `(neighbor, ID)`

depening on their alphanumerical order, and add this to your `Set E`

.

If you stick to the canonical representation of edges, Python will take care, that there will be no duplicates of your edges in the `Set`

as it is defined as an unordered set of unique elements.
https://docs.python.org/2/library/sets.html

As long as you can assume that your file is correct, and there is no edge, that has no end (for the ID is missing), you can just create the edges first, and later - once you arrive at the corresponding line, you'll create the vertex.

If you can't hold this assumption, you can still create your graph representation in this manner, you just need to implement some clean-up at the end, where you iterate over the edge-set once again, checking if there is any edge left dangeling in the nowhere (points to an non-existing vertex), and handle this, by either deleting this edge, or creating the vertex - what ever suits your needs.