Allie H Allie H - 2 months ago 12
C Question

Parsing a tab-delimited text file

I'm trying to write a code that will parse a tab-delimited text file by assigning each string between tabs to a given element of a sample struct that I've defined. In the input file, the first row will have all the class identifiers (c_name), the second row will have all the sample identifiers (s_name), and the rest of the rows will contain data.

I know it's going to be a bit more complicated because the first column will actually just contain labels, but I figured I'd start with trying to figure out the general parsing scheme.

I can gather that, for the class identifiers for example, I should probably be using fscanf in a for loop add each identifier to the class field of a given sample, but I'm getting lost in the actual implementation. Based on one post I saw, I thought I could do something along the lines of using

%[^\t]\t
in fscanf to read into an array everything that's not a tab up to a tab, but I don't think I have this quite right.

Any suggestions would be greatly appreciated.

#define LENGTH 30
#define MAX_OBS 80000

typedef struct
{
char c_name[LENGTH];
char s_name[LENGTH];
double value[MAX_OBS];
}
sample;

// I've already calculated the number of columns in the file
sample sample[total_columns];
for (int i = 0; i < total_columns; i++)
{
fscanf(input, "%[^\t]\t", sample[i].s_name);
}


Edit: I've tried several different variations of the code below ("%[^\t\n\r]\t\n\r", or "%[^\t\n\r]%*1[\t\n\r]", or " %[^\t\n\r]") and they all seem to be generally working except that, depending on the size I'm allocating to data and how long I'm iterating, it gives a segmentation fault at some point. The code below gives a segmentation fault immediately, but if I arbitrarily change total_columns in both places to 3, it will print Class Case Case. This seems to work up until 14, at which point the whole program segmentation faults. I'm fairly confused about the issue here. I've also tried mallocing memory to the sample data array to see if it was an issue of stack vs heap, but that doesn't seem to be helping either. Thanks so much for your help!

sample data[total_columns];
fseek(input, 0, SEEK_SET);
for (int i = 0; i < total_columns; i++)
{
fscanf(input, "%[^\t\n\r]\t\n\r", data[i].s_name);
printf("%s\n", data[i].s_name);
}


An example input file would look like:

Class Case Case Case Case Case Case Case Case Case Case Case Case Case Case Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control
Subject G038 G144 G135 G161 G116 G165 G133 G069 G002 G059 G039 G026 G125 G149 G108 G121 G060 G140 G127 G113 G023 G147 G011 G019 G148 G132 G010 G142 G020 G021
Data1 0.000741628 0.00308607 0.000267431 0.001418697 0.001237904 0.000761145 0.0008281 0.002426075 0.000236698 0.004924871 0.000722752 0.003758006 0.000104813 0.000986619 0.000121803 0.000666854 0 0.000171394 0.000877993 0.002717391 0.001336501 0.000812089 0.001448743 5.28E-05 0.001944298 0.000292529 0.000469631 0.001674047 0.000651526 0.000336615
Data2 0.102002396 0.108035127 0.015052531 0.079923731 0.020643362 0.086480609 0.017907667 0.016279315 0.076263965 0.034876124 0.187481931 0.090615572 0.037460171 0.143326961 0.029628502 0.049487575 0.020175439 0.122975405 0.019754837 0.006702899 0.014033264 0.040024363 0.076610375 0.069287599 0.098896479 0.011813681 0.293331246 0.037558052 0.303052867 0.137591517
Data2 0.218495065 0.242891829 0.23747851 0.101306336 0.309040188 0.237477347 0.293837554 0.34351816 0.217572429 0.168651691 0.179387106 0.166516699 0.099970652 0.181003474 0.076126675 0.10244981 0.449561404 0.139257863 0.127579104 0.355797101 0.354544105 0.262855651 0.10167146 0.186068602 0.316763006 0.187466247 0.05701315 0.123825467 0.064780343 0.069847682
Data4 0.141137543 0.090948286 0.102502388 0.013063365 0.162060849 0.166292135 0.070215996 0.063535037 0.333743609 0.131011609 0.140936687 0.150108506 0.07812762 0.230704405 0.069792935 0.120770743 0.164473684 0.448110378 0.42599534 0.074094203 0.096525097 0.157661185 0.036737518 0.213931398 0.091119285 0.438073807 0.224921728 0.187034237 0.06611442 0.086005218
Data5 0.003594044 0.003948354 0.008137536 0.001327901 0.002161974 0.003552012 0.002760334 0.001898667 0.001420186 0.003165988 0.001011853 0.001217382 0.000314439 0.004254794 0.000213155 0.003650147 0 0.002742309 0.002633978 0 0.002524503 0.002146234 0.001751465 0.006543536 0.003941146 0.00049505 0.00435191 0.001944054 0.001303053 0.004207692
Data6 0.000285242 2.27E-05 0 1.13E-05 0.0002964 3.62E-05 0.000138017 0.000210963 0.000662753 0 0 0 0 4.11E-05 0 0 0 0 0.000101307 0 0 0 0 5.28E-05 0.00152391 0 0 0 0 0
Data7 0.002624223 0.001134584 0.00095511 0.000419934 0.000401011 0.001739761 0.00272583 0.002566717 0.000520735 0.002311674 0.006287944 0 6.29E-05 0.000143882 3.05E-05 0.000491366 0 0 3.38E-05 0 0.001782002 0.000957104 0.002594763 0.000527704 0.000105097 0.001192619 3.13E-05 0 0.000744602 0.000252461
Data8 0.392777683 0.383875286 0.451499522 0.684663315 0.387394299 0.357992026 0.488406597 0.423473155 0.27267563 0.47454646 0.331020526 0.484041709 0.735955056 0.338841956 0.781699147 0.625403622 0.313596491 0.270545891 0.379259109 0.498913043 0.372438372 0.446271644 0.606698813 0.305593668 0.360535996 0.29889739 0.328710081 0.521222594 0.419924299 0.584111756


Edit: I seem to have fixed it by changing the MAX_OBS definition - pretty sure I have a fundamental misunderstanding of what that actually means. I'll have to look into that. Thanks again for the help!

Answer

try this:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define LENGTH 30
#define MAX_OBS 80000

typedef struct{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
} Sample;//Duplication of type and variable names should be avoided. pointed out by Jonathan Leffler.

int main(void){
    char line[1024];
    FILE *input = fopen("data.txt", "r");

    fgets(line, sizeof(line), input);

    int total_columns = 0;
    char *p = strtok(line, "\t\n");

    while(p){
        ++total_columns;
        p = strtok(NULL, "\t\n");
    }
    --total_columns;//first column is field name
    rewind(input);
 //*******************************************************************************
    Sample *sample = malloc(total_columns * sizeof(*sample));//To allocate in the stack is large. So allocate by malloc.

    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].c_name);//\n for last column
    }
    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].s_name);
    }
    int r;
    for(r = 0; r < MAX_OBS; ++r){
        if(EOF==fscanf(input, "%*s")) break;
        for (int i = 0; i < total_columns; i++){
            fscanf(input, "%lf", &sample[i].value[r]);
        }
    }
    fclose(input);

    //test print
    printf("%s\n", sample[0].c_name);
    printf("%s\n", sample[0].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[0].value[i]);
    printf("\n%s\n", sample[29].c_name);
    printf("%s\n", sample[29].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[29].value[i]);
    free(sample);
}