user3591955 user3591955 - 12 days ago 7
C Question

Fastest way to count number of lines?

The easiest approach to count line numbers in a file can be this:

while(!feof(fp))
{
ch = fgetc(fp);
if(ch == '\n')
{
lines++;
}
}


But now the requirement is that I have to count the number of lines in large files. It will have a performance impact.

Is there a better approach?

Answer

For fastest I/O, you usually want to read/write in multiples of the block size of your filesystem/OS.

You can query the block size by calling statfs or fstatfs on your file or file descriptor (read the man pages).

The struct statfs has a field f_bsize and sometimes also f_iosize:

optimal transfer block size

The f_bsize field exists on all POSIX systems, AFAIK. On Mac OS X and iOS, there's also f_iosize which is the one you'd prefer on these platforms (but f_bsize works on Mac OS X/iOS as well and should usually be same as f_iosize, IIRC).

struct statfs fsInfo = {0};
int fd = fileno(fp); // Get file descriptor from FILE*.
long optimalSize;

if (fstatfs(fd, &fsInfo) == -1) {
    // Querying failed! Fall back to a sane value, for example 8kB or 4MB.
    optimalSize = 4 * 1024 * 1024;
} else {
    optimalSize = fsInfo.f_bsize;
}

Now allocate a buffer of that size and read (using read or fread) blocks of that size. Then iterate this in-memory block and count the number of newlines. Repeat until EOF.

A different approach is the one @Ioan proposed: use mmap to map the file into memory and iterate that buffer. This probably gives you optimal performance as the kernel can read the data in the most efficient way, but this might fail for files that are "too large" while the approach I've described above always works with files of arbitrary size and gives you near-optimal performance.

Comments