Master-chip Master-chip - 7 months ago 10
Python Question

Reading Very Large One Liner Text File

I have a 30MB .txt file, with one line of data (30 Million Digit Number)

Unfortunately, every method I've tried (

mmap.read()
,
readline()
, allocating 1GB of RAM, for loops) takes 45+ minutes to completely read the file.
Every method I found on the internet seems to work on the fact that each line is small, therefore the memory consumption is only as big as the biggest line in the file. Here's the code I've been using.

start = time.clock()
z = open('Number.txt','r+')
m = mmap.mmap(z.fileno(), 0)
global a
a = int(m.read())
z.close()
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,start,secs,z,m


Other than splitting the number from one line to various lines; which I'd rather not do, is there a cleaner method which won't require the better part of an hour?

By the way, I don't necessarily have to use text files.

I have: Windows 8.1 64-Bit, 16GB RAM, Python 3.5.1

Answer

The file read is quick (<1s):

with open('number.txt') as f:
    data = f.read()

Converting a 30-million-digit string to an integer, that's slow:

z=int(data) # still waiting...

If you store the number as raw big- or little-endian binary data, then int.from_bytes(data,'big') is much quicker.

If I did my math right:

>>> import math
>>> math.log(10)/math.log(2)  # Number of bits to represent a base 10 digit.
3.3219280948873626
>>> 30000000/_                # Number of bits to represent 30M-digit #.
9030899.869919434
>>> _/8                       # Number of bytes to represent 30M-digit #.
1128862.4837399293            # Only ~1MB so file will be smaller :^)
>>> import os
>>> data=os.urandom(1128863)  # Generate some random bytes
>>> z=int.from_bytes(data,'big')  # Convert to integer (<1s)
>>> z.bit_length()
9030902
Comments