Swoldier Swoldier - 3 months ago 5x
Python Question

How to read only part of a list of strings in python

I need to find a way to be able to read x bytes of data from a list containing strings. Each item in the list is ~36MB. I need to be able to run through each item in the list, but only grabbing about ~1KB of that item at a time.

Essentially it looks like this:

for item in list:
#grab part of item
#do something with that part
#Move onto next part, until you've gone through the whole item

My current code (which kind of works, but seems to be rather slow and inefficient) is such:

for character in bucket:
print character
packet = "".join(character)
if(len(packet.encode("utf8")) >= packetSizeBytes):
print "Bytes: " + str(len(packet.encode("utf8")))
return packet

I'm wondering if there exists anything like
, but for strings.

Not sure if it's relevant, but for more context this is what I'm doing:

I'm reading data from a very large file (several GB) into much smaller (and more manageable chunks). I chunk the file using
, and store those as
However, even those buckets are still too large for what I ultimately need to do with the data, so I want to grab only parts of the bucket at a time.

Originally, I bypassed the whole bucket thing, and just chunked the file into chunks that were small enough for my purposes. However, this led to me having to chunk the file hundreds of thousands of times, which got kind of slow. My hope now is to be able to have buckets queued up so that while I'm doing something with one bucket, I can begin reading from others. If any of this sounds confusing, let me know and I'll try to clarify.



If you're using str's (or byte's in python 3), each character is a byte, so f.read(5) is the same as f[:5]. If you want just the first 5 bytes from every string in a list, you could do

[s[:5] for s in buckets]

But be aware that this is making a copy of all those strings. It would be more memory efficient to take just the data that you want as you're reading it, rather than create a bunch of intermediary lists, then send that data to another thread to process it and continue reading the file.

import threading

def worker(chunk):
    # do stuff with chunk

def main():
    with open('file', 'r') as f:
        bucket = f.read(500)
        while bucket:
            chunk = bucket[:5]
            thread = threading.Thread(target=worker, args=(chunk,))
            bucket = f.read(500)