For a deep learning model I need to load my data by batches. For every epoch (full iteration over all the data) every row needs to be passed once, but it's important that the data is fed in a random order to the algorithm. My dataset is too big to read it fully in memory. It's sequence data with a variable length, the input format can be changed since it's a dump from a cluster that my other script outputs. Currently it's some meta info per row and then the sequences split by ';'.
My current solution is a generator that shuffles all the line numbers, chunks them up in 4 pieces and reads the file, parsing the lines that match the chunk line numbers. It yields batch sized sequences until there is nothing left, and then it parses the next chunk of line numbers. It works, but I feel like there might be a better solution. Who has a better workflow? This is a problem I run into regularly. The problem is that I'm fully scanning the file for every chunk, every epoch. Even though I can get it to work with just 4 chunks, with 30 epochs that is 120 times reading a big file.
Build an index of the lines in memory (which requires a single pass through the file, but not all in memory) and then you can access lines randomly and fast.
This isn't robust (no validation/range-checking, etc.) but:
import sys BUFFER_LEN = 1024 def findNewLines(s): retval =  lastPos = 0 while True: pos = s.find("\n", lastPos) if pos >= 0: retval.append(pos) lastPos = pos+1 else: break return retval class RandomAccessFile(object): def __init__(self, fileName): self.fileName = fileName self.startPositions =  with open(fileName, "rb") as f: looking = True fileOffset = 0 while (looking): bytes = f.read(BUFFER_LEN) if len(bytes) < BUFFER_LEN: looking = False newLines = findNewLines(bytes) for newLine in newLines: self.startPositions.append(fileOffset+newLine) fileOffset += len(bytes) def GetLine(self, index): start, stop = self.startPositions[index], self.startPositions[index+1] with open(self.fileName, "rb") as f: f.seek(start) return f.read(stop-start) raf = RandomAccessFile('/usr/share/dict/words') print raf.GetLine(0) print raf.GetLine(10) print raf.GetLine(456) print raf.GetLine(71015)
python indexedFile.py A Aaronic abrim flippantness