RoadRunner RoadRunner - 2 months ago 15
C Question

Reading file into linked list

I am trying to read a text file I made into a linked list, the text file looks like this:

around 1 2 1
bread 2 4 3 5 1
four 1 3 2
head 3 1 2 2 1 5 1
has 2 3 1 5 2


Where the first string of each line are just words from a paragraph. The first number after the word is the number of lines the word was found in, in the paragraph. Then the following numbers are pairs of (line, occurrences) in the paragraph.

For example, for the word
bread
:

It was found in
2
lines in the paragraph. In the first line, line
4
, it was found
3
times. Then in the second line, line
5
, it was found
1
time.

I am trying to create a linked list from this text file, my program looks like this so far:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <ctype.h>

#define MAXWORD 999

typedef struct node node_t;

struct node {
char *word;
int num_lines;
int paragraph;
int freq;
node_t *next;
};

int
main(int argc, char *argv[]) {
FILE *fp;
char word[MAXWORD+1];
int ch, line_count = 0, len = 0;
node_t *node = (node_t*)malloc(sizeof(*node));
node_t *curr, *prev;

fp = fopen(argv[1], "r");

if (fp == NULL) {
fprintf(stderr, "Error reading file\n");
exit(EXIT_FAILURE);
}

/* Just trying to store the string so far */
while ((ch = getc(fp)) != EOF) {
if (ch == '\n') {
line_count++;
strcpy(node->word, word);
}

if (isalpha(ch)) {
word[len] = ch;
len++;
word[len] = '\0';
}

if (isdigit(ch)) {
len = 0;
}
}

printf("line count = %d", line_count);

free(node)

fclose(fp);

return 0;
}


In this snippet, I have been trying to store the string in the linked list data structure, but I have not yet used dynamic arrays to store the numbers after the word which occur in the text file. I know I will need to build this data structure using
malloc()
and
realloc()
, but I am unsure of how to do this.

How should I do this?

My desired output would look like this:

There are five words in the text file,
and 9 pairs of (line, occurences)

Word: pairs
"around": 2,1
"bread": 4,3; 5,1
"four": 3,2
"head": 1,2; 2,1; 5,1
"has": 3,1; 5,2


UPDATE

I have been researching this and it seems to be very similar to the inverted index problem, where I have seen that using a binary search tree would be best.

Could I implement my binary search tree like this:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <ctype.h>

#define MAXWORD 999

typedef char word_t[MAXWORD+1];

typedef struct node node_t;

struct node {
void *data;
int *ints;
node_t *rght;
node_t *left;
};

typedef struct {
node_t *root;
int (*cmp)(void*, void*);
} tree_t;

int
main(int argc, char *argv[]) {
FILE *fp;

fp = fopen(argv[1], "r");

if (fp == NULL) {
fprintf(stderr, "Error reading file\n");
exit(EXIT_FAILURE);
}

while ((ch = getc(fp)) != EOF) {
if (ch == '\n') {
line_count++;
}
}

fclose(fp);

return 0;
}

Answer

I wrote a program that does what I think you are looking for. I modified the structs I was thinking about before:

typedef node node_t;

struct node {
    char *word;
    int num_lines;
    int *location;
    int *frequency;
    node_t *next;
};

This way the nodes contain pointers to arrays of int to store the location and frequency information. Nodes and storage for the word strings, location arrays, and frequency arrays are all dynamically allocated. I didn't add any code to free this memory yet, but that should be simple. The main code looks like:

struct node *head, *current;
char *pline, *pword;
char line[MAXLINE + 1];
char word[MAXWORD + 1];
int i, n, line_count = 0;

head = NULL;

...

/* Read in lines and process words */
pline = line;
pword = word;
while (fgets(pline, MAXLINE, fp) != NULL) {
    ++line_count;
    strip(pline);
    while ((pword = strtok(pline, " ")) != NULL) {
        normalize_word(pword);
        if (*pword != '\0')      // don't add empty words
            head = update_word(pword, line_count, head);
        pline = NULL;
    }
}

/* Display list contents */

But most of the interesting stuff happens in the update_word() function. Here is a Pastebin link to the source code: http://pastebin.com/yDkUZUGu

I have no doubt that the code can be improved, but this does seem to work. I wouldn't say that this is exactly simple, but relatively straightforward. I ran it against this text file:

Three blind mice. Three blind mice.
See how they run. See how they run.
They all ran after the farmer's wife,
Who cut off their tails with a carving knife,
Did you ever see such a sight in your life,
As three blind mice?

Here are the results:

There are 31 words in the text file,
and 37 pairs of (line, occurrences)
Word: pairs
three: 1, 2; 6, 1;
blind: 1, 2; 6, 1;
mice: 1, 2; 6, 1;
see: 2, 2; 5, 1;
how: 2, 2;
they: 2, 2; 3, 1;
run: 2, 2;
all: 3, 1;
ran: 3, 1;
after: 3, 1;
the: 3, 1;
farmer's: 3, 1;
wife: 3, 1;
who: 4, 1;
cut: 4, 1;
off: 4, 1;
their: 4, 1;
tails: 4, 1;
with: 4, 1;
a: 4, 1; 5, 1;
carving: 4, 1;
knife: 4, 1;
did: 5, 1;
you: 5, 1;
ever: 5, 1;
such: 5, 1;
sight: 5, 1;
in: 5, 1;
your: 5, 1;
life: 5, 1;
as: 6, 1;