Melissa goodall Melissa goodall - 20 days ago 7
C++ Question

locating matched words in a string

I have a file A that has multiple paragraphs. I need to identify where I matched words from another file B. I need to tell the paragraph, line number, and word number of every word, including those matching a word in file B. I've finally gotten so far, having given up on vectors, and arrays, and string splitting. I learned (I think) stringstream. Currently, I read in the line, then split it on the "." into sentences, then read those sentences back in again, splitting on the " ". I have the line numbers counting, and the words counting and matching, but I just can't seem to get the paragraph numbers (I've realized that the p++ is actually counting the lines, and the l++ is counting words as well). Could someone please help me? edit Each paragraph is separated by "\n" and each sentence is separated by a "." I'll still have to figure out a way to ignore all other punctuation so that words match 100%, and are not thrown off by a comma, semi-colon, or other punctuation. I'm guessing that will be a regex in there somewhere.

input from file with the text would look like:

    My dog has fleas in his weak knees. This is a line.  The paragraph is ending.'\n'
Fleas is a word to be matched. here is another line. The paragraph is ending.'\n'


output should look something like:


w1, p1, s1, word, My
w2, p1, s1, word, dog
w3, p1, s1, word, has
w4, p1, s1, namedEntity, fleas


#include <iostream>
#include <string>
#include <fstream> // FILE I/O
#include <sstream> // used for splitting strings based upon a delimiting character

using namespace std;

int main() {

ofstream fout;
ifstream fin;
ifstream fmatch;
string line;
string word;
string para;
string strmatch;
int p = 0, l = 0, w = 0;

stringstream pbuffer, lbuffer;
fin.open("text.txt");

while (getline(fin, para)) { //get the paragraphs
pbuffer.clear();
pbuffer.str("."); //split on periods
pbuffer << para;
p++; //increase paragraph number

while (pbuffer >> line) { //feed back into a new buffer

lbuffer.clear();
lbuffer.str(" "); //splitting on spaces
lbuffer << line;
l++; //line counter

while (lbuffer >> word) { //feed back in
cout << "l " << l << " W: " << w << " " << word;
fmatch.open("match.txt");
while (fmatch >> strmatch) { //did I find a match?
if (strmatch.compare(word) == 0) {
cout << " Matched!\n";
}
else {
cout << "\n";
}

}
fmatch.close();
w++; //increase word count
}

}
}

fin.close();

cin.sync();
cin.get();
}

Answer

Since you say that you can write each word on read, we won't bother with a collection. We'll just use istringstream and istream_iterator and counter the indices.
Assuming that fin is good, I'm going to simply write to cout you can make the appropriate adjustments to write to your file.

1st you'll need to read in your "fmatch.txt" into a vector<string> like so:

const vector<string> strmatch{ istream_iterator<string>(fmatch), istream_iterator<string> }

Then you'll just wanna use that in a nested loop:

string paragraph;
string sentence;

for(auto p = 1; getline(fin, paragraph, '\n'); ++p) {
    istringstream sentences{ paragraph };

    for(auto s = 1; getline(sentences, sentence, '.'); ++s) {
        istringstream words{ sentence };

        for_each(istream_iterator<string>(words), istream_iterator<string>(), [&, i = 1](const auto& word) mutable { cout << 'w' << i++ << ", p" << p << ", s" << s << (find(cbegin(strmatch), cend(strmatch), word) == cend(strmatch) ? ", word, " : ", namedEntity, ") << word << endl; });
    }
}

Live Example

EDIT:

By way of explaination, I'm using a for_each to call a lambda on each word in the sentence.

Let's break apart the lambda and explain what each section does:

  • [& This exposes, by reference, any variable in the scope in which the lambda was declared to the lambda for use: http://en.cppreference.com/w/cpp/language/lambda#Lambda_capture Because I'm using strmatch, p, and s in the lamda those will be captured by reference
  • , i = 1] C++14 allowed us to declare a variable in the lambda capture of type auto so i is an int which will be reinitialized each time the scope in which the lambda is declared is rentered, here that's the nested for-loop
  • (const auto& word) This is the parameter list passed into the lambda: http://en.cppreference.com/w/cpp/language/lambda Here for_each will just be passing in strings
  • mutable Because I'm modifying i, which is a owned by the lambda, I need it to be non-const so I declare the lambda mutable

In the lambda's body I'll just use find with standard insertion operators to write the values.

Comments