user248237dfsf - 1 month ago 8

Python Question

I'm trying to find an efficient way to pair together rows of data containing integer points, and storing them as Python objects. The data is made up of

`X`

`Y`

`(x_1, y_1), (x_2, y_2), ...`

`get_data`

`def get_data(N=100000, M=10):`

import random

data = []

for n in range(N):

pair = [[str(random.randint(1, 10)) for x in range(M)],

[str(random.randint(1, 10)) for x in range(M)]]

row = [",".join(pair[0]),

",".join(pair[1])]

data.append(row)

return data

The parsing code I have now is:

`class Point:`

def __init__(self, a, b):

self.a = a

self.b = b

def test():

import time

data = get_data()

all_point_sets = []

time_start = time.time()

for row in data:

point_set = []

first_points, second_points = row

# Convert points from strings to integers

first_points = map(int, first_points.split(","))

second_points = map(int, second_points.split(","))

paired_points = zip(first_points, second_points)

curr_points = [Point(p[0], p[1]) \

for p in paired_points]

all_point_sets.append(curr_points)

time_end = time.time()

print "total time: ", (time_end - time_start)

Currently, this takes nearly 7 seconds for 100,000 points, which seems very inefficient. Part of the inefficiency seems to stem from the calculation of

`first_points`

`second_points`

`paired_points`

Another part of the inefficiency seems to be the building up of

`all_point_sets`

`all_point_sets.append(...)`

How can this be sped up? thanks.

Answer

Simply running with pypy makes a big difference

```
$ python pairing_strings.py
total time: 2.09194397926
$ pypy pairing_strings.py
total time: 0.764246940613
```

disable gc didn't help for pypy

```
$ pypy pairing_strings.py
total time: 0.763386964798
```

namedtuple for Point makes it worse

```
$ pypy pairing_strings.py
total time: 0.888827085495
```

using itertools.imap, and itertools.izip

```
$ pypy pairing_strings.py
total time: 0.615751981735
```

Using a memoized version of int and an iterator to avoid the zip

```
$ pypy pairing_strings.py
total time: 0.423738002777
```

Here is the code I finished with.

```
def test():
import time
def m_int(s, memo={}):
if s in memo:
return memo[s]
else:
retval = memo[s] = int(s)
return retval
data = get_data()
all_point_sets = []
time_start = time.time()
for xs, ys in data:
point_set = []
# Convert points from strings to integers
y_iter = iter(ys.split(","))
curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
all_point_sets.append(curr_points)
time_end = time.time()
print "total time: ", (time_end - time_start)
```