wrwt wrwt - 30 days ago 12
Python Question

Fast checking if a string can be converted to float or int in python

I need to convert all strings in a large array to int or float types, if they can be converted. Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python), but it turns out to be very slow.

The question is: how to write that code the fastest way possible?

I found that there is .isdigit() method of a string. Is there something like that for floats?

Here is the current (slow) code.

result = []
for line in lines:
resline = []
for item in line:
try:
resline.append(int(item))
except:
try:
resline.append(float(item))
except:
resline.append(item)
result.append(resline)
return np.array(result)


There is also some evidence (https://stackoverflow.com/a/2356970/3642151) that regex approach is even slower.

Answer Source

All generalizations are false (irony intended). One cannot say that try: except: is always faster than regex or vice versa. In your case, regex is not overkill and would be much faster than the try: except: method. However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); the library is called fastnumbers. Below are timing tests using your try: except: method, using regex, and using fastnumbers.


from __future__ import print_function
import timeit

prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''

try_method = '''\
def converter_try(vals):
    resline = []
    for item in vals:
        try:
            resline.append(int(item))
        except ValueError:
            try:
                resline.append(float(item))
            except ValueError:
                resline.append(item)

'''

re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
    resline = []
    for item in vals:
        if int_match(item):
            resline.append(int(item))
        elif float_match(item):
            resline.append(float(item))
        else:
            resline.append(item)

'''

fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
    resline = []
    for item in vals:
        resline.append(fast_real(item))

'''

print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()

The output looks like this on my machine:

Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds

Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds

fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds

A few things are pretty clear

  • try: except: is very slow for non-numeric input; regex beats that handily
  • try: except: becomes more efficient if exceptions don't need to be raised
  • fastnumbers beats the pants off both in all cases

So, if you don't want to use fastnumbers, you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that.