user3010126 user3010126 - 3 months ago 11
R Question

R - Counting characters of strings in different cells, but including spaces

I have a data frame that looks like this:

SentenceID IA_ID label dt indx IA_TYPE count
1 1 This 271 1 non_target 4
1 2 is 98 2 non_target 2
1 3 an 159 3 non_target 2
1 4 example 319 4 non_target 7
1 5 of 284 5 non_target 2
1 6 a 235 6 non_target 1
1 7 data 218 7 non_target 4
1 8 file. 303 8 non_target 5
1 9 The 173 9 non_target 3
1 10 goal 387 10 target 4
1 11 is 155 11 non_target 2
1 12 to 278 12 non_target 2
1 13 extract 97 13 non_target 7
1 14 content 248 14 non_target 7
1 15 from 273 15 non_target 4
1 16 specific 225 16 non_target 8
1 17 cells 119 17 non_target 5
1 18 in 207 18 non_target 2
1 19 this 199 19 non_target 4
1 20 column. 93 20 non_target 7
2 1 The 206 21 non_target 3
2 2 cells 195 22 non_target 5
2 3 to 220 23 non_target 2
2 4 be 247 24 non_target 2
2 5 extracted 368 25 target 9
2 6 for 213 26 non_target 3
2 7 each 215 27 non_target 4
2 8 sentence 386 28 non_target 8
2 9 are 186 29 non_target 3
2 10 identified 137 30 non_target 10
2 11 by 154 31 non_target 2
2 12 an 101 32 non_target 2
2 13 ID 197 33 non_target 2
2 14 number 297 34 non_target 6
2 15 in 344 35 non_target 2
2 16 the 333 36 non_target 3
2 17 second 386 37 non_target 6
2 18 column. 346 38 non_target 7


And so on, with the value of "SentenceID" (first column) increasing every few lines when a new sentence begins. I was able to get a character count for each word (i.e. each cell in the column "label") and the total number of characters in each sentence with:

data$count <- with(data, nchar(as.character(label)))
sentence.count <- (sqldf("SELECT SentenceID, sum(count) as sentChar FROM data GROUP BY SentenceID"))


However, that sentence.count does not include spaces, which I need. Essentially, I would need to add to it "n-1", where "n" is the total number of words in a sentence, or the total number of rows that have each sentence ID (-1 because there is no space to be counted after the final word). I can't seem to figure out the syntax for it, though. All the options I seem to find would work if I were dealing with a single string (i.e. if all the words in "label" were concatenated with spaces), rather than a series of strings in subsequent cells of a column in a data frame. Any ideas?

Answer

where "n" is the total number of words in a sentence, or the total number of rows that have each sentence ID

Shouldn't you get that with your SQL call with a small modification like

 sentence.count <- sqldf("SELECT SentenceID, count(count), sum(count) as sentChar 
                          FROM data GROUP BY SentenceID")

or maybe even

 sentence.count <- sqldf("SELECT SentenceID, sum(count)+count(Count)-1 as sentChar 
                          FROM data GROUP BY SentenceID")
Comments