ameerosein ameerosein - 1 month ago 15
C++ Question

How to find the length of a string in a file without reading the entire file

I have a file containing a header and a very long string like:

>Ecoli100k
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
....


I tried to retrieve the file size and header size using:

ifstream file(fileName.c_str(), ifstream::in | ifstream::binary);

string line1;
getline(file,line1);
int line1Size = line1.size();

file.seekg(0, ios::end);
long long fileSize = file.tellg();
file.close();


And for example for a file containing a string of length 100k with header
>Ecoli100k
,
fileSize
is 101261 and
line1Size
is 10. now for calculating the length of the string without reading anymore:

101261 - (10+1) = 101250 that means without the header, this file contains 101250 more characters

101250/81 = 1250 that means there's 1250 full lines (but the last line has no \n) so we must subtract 1249 from 101250 to get the length of the string, but it is wrong. we get 100k+1 instead of 100k.

In code:

int remainedLineCount =
(fileSize - line1Size - 1 - 1 /*the last line has no \n*/)/81 ;
cout<<(fileSize - line1Size - 1 - remainedLineCount )<<"\n";


in another example i only add another character and because of a newline in file the size changes to 101263 and again with this calculation we will get into 100k+2 instead of 100k+1.

Anyone know where this [[ extra 1 ]] comes from? is there anything at the the end of a file?

Edit:

As requested, here is the binary value (in hexadecimal) of the bytes at begin and end of the file:


offset 0: 3e 45 63 6f 6c 69 31 30 30 6b

offset 0000018b83: 54 47 47 43 41 47 41 41 43 0a


Thanks All.

Answer

There are several candidates:

  • If you're under windows, and if the file was written in text mode, then the first line + the newline will be stored on 10+2 chars, as '\n' is translated into '\r'+'\n';
  • again, if the file was written in text mode, it is possible that an end of file char was added (not visible in text mode), that becomes readable in binary mode.
  • it is also implmenetation dependent whether or not a '\n' is added to the last line of the file (see explanations under my second edit)

Additional reading:

Edit:

In case of doubt about the encoding, you could display the binary value (in hexadecimal) of the bytes at begin and end of your file:

void show (istream &ifs, int count) {  // utility function
    cout <<"offset "<<setw(10)<<ifs.tellg()<<": ";
    for (int i=0; i<10; i++) 
        cout << setw(2) << setfill('0') <<hex<<ifs.get()<<" ";
    cout <<endl; 
}

// with your newly opened filestream: 
show(ifs, 12);  
ifs.seekg(-10,ios::end);
show(ifs, 10);  

Edit 2:

So it appears that you have a newline at the end of your last line (ending ASCII code 0a in your output).

It's important to understand that text mode and binary mode may have differences. The C++ standard doesn't detail these but relies in its section 27.1.9.4 on the C stdio, which are described in the C11 standard:

7.21.2/2: A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one- to-one correspondence between the characters in a stream and those in the external representation. Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. Whether space characters that are written out immediately before a new-line character appear when read in is implementation-defined.

Comments