PD1 PD1 - 3 months ago 9
R Question

Search and replace entries in a dataframe in two columns

I have a certain data set in which there are few missing values.
the dataset looks like the following:

a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 NA 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 NA 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 NA 2
3 10 NA NA 5 1 NA 5
4 5 NA NA 6 2 NA 3
4 11 25 10 NA NA 2.5 NA


My data is in the above mentioned format. Column
a
is a kind of time period which is in sequence and has multiple codes corresponding to it.
Column
b
just shows an item. This item either has a repeated entry in time or has an unique value.
Column
g
and
h
are just the columns made by dividing column
c0
/
d0
=
g
and
c1
/
d1
=
h
. Out here, column
g
holds more importance.

Now, since it is clear that there are few NA and some of the column
b
entries are duplicate whereas rest are unique.

I have to perform the following steps in order to compute the NA's in column 'g':


  1. I have to find in the 'column b' that is the entry repetitive or has an unique value.Eg : Entry 6 and 5 are repeated, whereas 7,8 9,10 and 11 are unique.

  2. Once it has been found, next step is to that whether there is some value in 'column g' already for the item or not.

  3. If there is, then we need to take average of the repetaed value in 'column g' if it's other than NA, like for item 5, I can find that the values are 2 and 2.5 and hence the average of 2.25 should be place in 'column g' for the repeated 5 value at a=4.

  4. Now, if there is a repeated value but still column g is NA, then I can simply take the 'column h' value as value of 'column g'.

  5. For the non repetitive items, like 9,10,7, etc. since they are unique, just replace the column g entry by column h.



The final output should be as follows:

a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 4 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 1 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 2 2
3 10 NA NA 5 1 5 5
4 5 NA NA 6 2 2.25 3
4 11 25 10 NA NA 2.5 NA


Request you to help me out with it. In case, you have any question in understanding the question, do let me know or even if some more details are required.

Answer

We can reduce it to "one" step once we recognize that when grouped by b, duplicates imply that there are more than one row grouped. Therefore, the condition to replace the NA values in g by the mean of its group (that are not NA) is if:

the number of rows grouped by b is greater than one and not all of g in the group is NA

Otherwise, replace the NA values in g with h:

library(data.table)
setDT(df)[, g := if (.N > 1 & !all(is.na(g))) {
                   replace(g, is.na(g), mean(g, na.rm = TRUE)) 
                 } else {
                   replace(g, is.na(g), as.double(h)) 
                 }, by=b][]
##    a  b c0 d0 c1 d1    g  h
## 1: 1  5 20 10 NA NA 2.00 NA
## 2: 1  6 NA NA  8  2 4.00  4
## 3: 2  5 25 10 NA NA 2.50 NA
## 4: 2  7 NA NA  2  2 1.00  1
## 5: 2  8 50 10 NA NA 5.00 NA
## 6: 3  9 10 10 NA NA 1.00 NA
## 7: 3  6 NA NA  8  2 4.00  4
## 8: 3 10 NA NA  5  1 5.00  5
## 9: 4  5 NA NA  6  2 2.25  3
##10: 4 11 25 10 NA NA 2.50 NA