user6615434 user6615434 - 1 month ago 7
C Question

Parsing token strings in C

I am trying to parse a CSV file in C. I have each line of my file scanned into the array called lines, which works. Then, I check each character in the line to see if it is a comma (44).

I am having trouble with the last else statement, which should start a new token when there is a comma.

The first token of the line is always read correctly, but the rest are not (strange symbols/characters appear in output). I tried removing the '\0' statement, since I'm not sure that I needed it, but I have the same problem. I am guessing this is some kind of undefined behavior, but I am not sure.

Thanks!

//[rows = num strings] [max num chars per string]
int max_len = 21;
int num_strings = 12;
char lines[num_strings][max_len];

//Open file
data = fopen("data.txt", "r");

//Check if file opened correctly
if (data == NULL) {
printf ("File did not open correctly.\n");
}

//Parse each token
char tokens[60][21];
int counter = 0;
//Read each line
for(int i=0; i<num_strings; i++)
{
//Scan line into lines[i]
fscanf(data, "%s", lines[i]);

printf("\nThis line = %s\n",lines[i]);

//Read each char in line
for(int j=0; j<strlen(lines[i]); j++)
{
char *c = &lines[i][j];
//printf("Current char of line: %c\n", c[0]);

//If it's not a comma (or null character), add to current token
if(c[0] != 44) {
tokens[counter][j] = c[0];
} else {//If it is, terminate string and go to next token
tokens[counter][j] = '\0';
printf("This token = %s\n",tokens[counter]);
counter++;
}
}
}

Answer

My suggestion is to draw the diagram of your strings, Say you have this line and you'll find the first comma:

      .          1         2
      .01234567890123456789012
 i -> |aaaa,bbb,cccccc,dddd,e\0
      .    ^ 
           j

This is the tokens array:

          01234      
 counter |aaaa\0 

Now you increment counter but j will continue, so next time you will have:

      .          1         2
      .01234567890123456789012
 i -> |aaaa,bbb,cccccc,dddd,e\0
      .        ^ 
               j

and the next line in the tokens array will be:

            01234 567     
           |aaaa\0 
   counter |????? bbb\0 

Not exactly what you intended, right?

You should find another way to copy the characters in the token array.

May I suggest that if you need just to fill the token array, you can get rid of the lines entirely and read the file one character at the time?

Also, I suppose this is just for practice as you did not mention the fact that a CSV may contain a comma within a string:

  aaaa,"bb,bb",ccc

has three field.