KBriggs KBriggs - 3 months ago 18
C Question

Optimizing disk IO

I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.

There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:

int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;

int64_t i;
int64_t read = 0;

if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}

for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}


The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.

The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:

int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];

int64_t i;
int64_t read = 0;

if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}

for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}


Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?

Answer

First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.

Convince yourself of this by putting together 2 small programs that read the stream.

1) read it as you are in the example above, of 2 doubles.

2) read it the same way, but make it 10,000 doubles.

Time both runs a few times, and odds are you will be observe #2 runs much faster.

Best of luck.