Kevin Sun - 1 year ago 79
R Question

Efficiently assigning a new column value based on multiple column conditions in R

I have a dataframe that contains information about many seller's IDs, and the period they made a sell. I want to create a new column called inactive if they didn't make a sell for the next 6 periods.

Here is the dput of a sample dataset:

``````structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10)), .Names = c("SellerID",
"Period"), row.names = c(NA, -13L), class = "data.frame")
``````

Here is the dput of my ideal outcome(row 5 has an Inactive of 1 because for that row, sellerID 1 made a sale at Period 2, but his next sale was at period 9 [row 10]. Thus, he was inactive for at least 6 periods and thus we want to record that in order to predict when a seller will be inactive):

``````structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0,
0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0)), .Names = c("SellerID",
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")
``````

I tried solving this problem using a nest-for loop approach but my dataset is very large and it will take a very long time to run (about 200,000 rows). I also tried my approach on the sample dataset, but it seems to not work. Here is my approach below:

``````full.df\$Inactive <- NA
for (i in 1:nrow(full.df)){
temp = subset(full.df, SellerID = unique(full.df\$SellerID[i]))
for(j in 1:(nrow(temp) -1)){
if(temp\$Period[j+1] - temp\$Period[j] <6)
temp\$Inactive[j] <-0
else
temp\$Inactive[j] <-1
}
full.df[rownames(full.df) %in% rownames(temp), ]\$Inactive <- temp\$Inactive
}
``````

The output from the dummy dataset, using my approach puts a 0 in all the rows in "Inactive" except the last row is NA. Here is the dput of the output that I got:

``````structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA)), .Names = c("SellerID",
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")
``````

I am assuming 1 things here. Max range of period variable is 12.

Here is the logic: You order the data frame. Then you append 12 to the end of the list and take a difference. This would also categorise seller 3 who is inactive withing the range of 7 days.

``````df_s=df[with(df, order(SellerID, Period)),]
g=split(df\$Period, df\$SellerID)
l=lapply(g, function(x) c(x,12) )
j=lapply(l, diff)
u=unlist(j, use.names = F)
df_s\$ind=ifelse(u>=7,1,0)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download