Vahid Mir - 3 months ago 35

Python Question

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html

A sample of my array is as follows:

`>>> d[0:10]`

array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',

'GATTT', 'TCTTT', 'ACTTT'],

dtype='|S5')

However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:

`>>> import editdist`

>>> import scipy

>>> import scipy.spatial

>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist

[X] = _copy_arrays_if_base_present([_convert_to_double(X)])

File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double

X = np.double(X)

ValueError: could not convert string to float: TTTTT

Answer

If you really must use `pdist`

, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:

```
numeric_d = d.view(np.uint8).reshape((len(d),-1))
```

This simply views your array of strings as a long array of `uint8`

bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:

```
In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
[65, 84, 84, 84, 84],
[67, 84, 84, 84, 84],
[71, 84, 84, 84, 84],
[84, 65, 84, 84, 84],
[65, 65, 84, 84, 84],
[67, 65, 84, 84, 84],
[71, 65, 84, 84, 84],
[84, 67, 84, 84, 84],
[65, 67, 84, 84, 84]], dtype=uint8)
```

Then, you can use `pdist`

as you normally would. Just make sure that your `editdist`

function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling `.tostring()`

:

```
def editdist(x, y):
s1 = x.tostring()
s2 = y.tostring()
... rest of function as before ...
```