Phdaml Phdaml - 1 month ago 9
R Question

find First/ Last observation value by group?

I am trying to find the first/last observation by group. I tired both R and excel (because it is so slow in R so I tried excel). The excel took less than one second, but R took 8 MINUTES!!!. The code logic for both are almost the same.

The data is a panel data regarding purchasing fruit. The same shopper could buy multiple times at different time. I have 233,000 observations. Data is like(sorted by day shopper first and day):

Day Shopper Choice
1 A apple
2 A apple
1 B Banana
1 C apple
2 C Banana
3 C apple
1 D berry
2 D berry


My r code for the first obseravtion. I want to tag the first observation in a group by indicator "1", a new column.

for (i in 1:n)
{ ifelse (dt$shopper[i+1]==dt$shopper[i],newcol[i+1]<-0,newcol[i+1]<-1)
}


My excel code is:
if(B2<>B1,1,0)

I need the repurchase matrix given the same shopper. To define the repurchase : the repurchase of the "1st purchase" is the "2nd purchase"; And the repurchase of the "2nd purchase" is the "3rd purchase" The last purchase has no repurchase. Sorry it sounds like a twister.
so my solution is get two choice col and move the second col one row above so I can calculate the repurchase matrix by shopper/or aggregate.The desired output for tagging first obs by group should be as follow. With the columns of choice and choice 2 and I can calculate the repurchase matrix by nrow.

Day Shopper Choice tagging choice 2
1 A apple 0 *apple*
2 A apple *apple* 0
1 B Banana 0 0
1 C apple 0 Banana
2 C Banana Banana apple
3 C apple apple 0
1 D berry 0 *berry*
2 D berry *berry* 0


[update]. If there is only one purchase for this user, there is no repurchase. If the purchase is the last purchase of the user, there is no repurchase. So the final repurchase matrix by choice in this case is

second inside bracket are the probability
first apple banana berry
apple 1 (0.5) 1 0
banana 1 0 (0) 0
berry 0 0 1 (1)


Below is how I calculate the repurchase matrix after tagging the first purchase of the user. i is the row (apple, banana, Berry etc) and j (apple, banana, Berry etc) is the column. [the speed is ok give the fact that I repurchase matrix is 40*40 after tagging and adding the seconding choice column)

for (i in 1:n){
for(j in 1:n){
repurchase_matrix[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}}

Answer

First, assuming the data are sorted by Shopper and then by Day in ascending order, you can add a column indicating the purchase number with

df$Purchase <- unlist(with(df, tapply(Shopper, Shopper, seq_along)))
df
#  Day Shopper Choice Purchase
#1   1       A  apple        1
#2   2       A  apple        2
#3   1       B Banana        1
#4   1       C  apple        1
#5   2       C Banana        2
#6   3       C  apple        3
#7   1       D  berry        1
#8   2       D  berry        2

Then reshape the data-frame to "wide" format with

df.w <- reshape(df[c('Shopper', 'Choice', 'Purchase')],
                idvar='Shopper', v.names='Choice', timevar='Purchase',
                direction='wide')
df.w
#  Shopper Choice.1 Choice.2 Choice.3
#1       A    apple    apple     <NA>
#3       B   Banana     <NA>     <NA>
#4       C    apple   Banana    apple
#7       D    berry    berry     <NA>

Finally you calculate the repurchase matrix of the first two purchases

with(df.w, prop.table(table(First=Choice.1, Second=Choice.2)))
#        Second
#First        apple    Banana     berry
#  apple  0.3333333 0.3333333 0.0000000
#  Banana 0.0000000 0.0000000 0.0000000
#  berry  0.0000000 0.0000000 0.3333333

To calculate the repurchase matrix of all purchases, start with the repurchase matrices of every two consecutive purchases

repurchase <- lapply(seq(2, ncol(df.w) - 1),
                     function(i) table(First=df.w[[i]], Second=df.w[[i + 1]]))
repurchase <- simplify2array(repurchase)
repurchase
#, , 1
#
#        Second
#First    apple Banana berry
#  apple      1      1     0
#  Banana     0      0     0
#  berry      0      0     1
#
#, , 2
#
#        Second
#First    apple Banana berry
#  apple      0      0     0
#  Banana     1      0     0
#  berry      0      0     0

then add all matrices to get the "total" repurchase matrix

apply(repurchase, 1:2, sum)
#        Second
#First    apple Banana berry
#  apple      1      1     0
#  Banana     1      0     0
#  berry      0      0     1

(absolute frequencies)

prop.table(apply(repurchase, 1:2, sum))
#        Second
#First    apple Banana berry
#  apple   0.25   0.25  0.00
#  Banana  0.25   0.00  0.00
#  berry   0.00   0.00  0.25

(relative frequencies)