Phdaml - 8 months ago 46

R Question

I am trying to find the **first/last observation by group**. I tired both R and excel (because it is so slow in R so I tried excel). The **excel** took less than **one** **second**, but **R** took **8 MINUTES!!!**. The code logic for both are almost the same.

The data is a **panel data regarding purchasing fruit**. The same shopper could buy multiple times at different time. **I have 233,000 observations.** Data is like(sorted by day shopper first and day):

`Day Shopper Choice`

1 A apple

2 A apple

1 B Banana

1 C apple

2 C Banana

3 C apple

1 D berry

2 D berry

My r code for the first obseravtion. I want to tag the first observation in a group by indicator "1", a new column.

`for (i in 1:n)`

{ ifelse (dt$shopper[i+1]==dt$shopper[i],newcol[i+1]<-0,newcol[i+1]<-1)

}

My excel code is:

if(B2<>B1,1,0)

I need the

so my solution is get two choice col and move the second col one row above so I can calculate the repurchase matrix by shopper/or aggregate.The desired output for

`Day Shopper Choice tagging choice 2`

1 A apple 0 *apple*

2 A apple *apple* 0

1 B Banana 0 0

1 C apple 0 Banana

2 C Banana Banana apple

3 C apple apple 0

1 D berry 0 *berry*

2 D berry *berry* 0

[

`second inside bracket are the probability`

first apple banana berry

apple 1 (0.5) 1 0

banana 1 0 (0) 0

berry 0 0 1 (1)

Below is how I calculate the repurchase matrix after tagging the first purchase of the user. i is the row (apple, banana, Berry etc) and j (apple, banana, Berry etc) is the column. [the speed is ok give the fact that I repurchase matrix is 40*40 after tagging and adding the seconding choice column)

`for (i in 1:n){`

for(j in 1:n){

repurchase_matrix[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}}

Answer

First, assuming the data are sorted by `Shopper`

and then by `Day`

in ascending order, you can add a column indicating the purchase number with

```
df$Purchase <- unlist(with(df, tapply(Shopper, Shopper, seq_along)))
df
# Day Shopper Choice Purchase
#1 1 A apple 1
#2 2 A apple 2
#3 1 B Banana 1
#4 1 C apple 1
#5 2 C Banana 2
#6 3 C apple 3
#7 1 D berry 1
#8 2 D berry 2
```

Then reshape the data-frame to "wide" format with

```
df.w <- reshape(df[c('Shopper', 'Choice', 'Purchase')],
idvar='Shopper', v.names='Choice', timevar='Purchase',
direction='wide')
df.w
# Shopper Choice.1 Choice.2 Choice.3
#1 A apple apple <NA>
#3 B Banana <NA> <NA>
#4 C apple Banana apple
#7 D berry berry <NA>
```

Finally you calculate the repurchase matrix of the first two purchases

```
with(df.w, prop.table(table(First=Choice.1, Second=Choice.2)))
# Second
#First apple Banana berry
# apple 0.3333333 0.3333333 0.0000000
# Banana 0.0000000 0.0000000 0.0000000
# berry 0.0000000 0.0000000 0.3333333
```

To calculate the repurchase matrix of all purchases, start with the repurchase matrices of every two consecutive purchases

```
repurchase <- lapply(seq(2, ncol(df.w) - 1),
function(i) table(First=df.w[[i]], Second=df.w[[i + 1]]))
repurchase <- simplify2array(repurchase)
repurchase
#, , 1
#
# Second
#First apple Banana berry
# apple 1 1 0
# Banana 0 0 0
# berry 0 0 1
#
#, , 2
#
# Second
#First apple Banana berry
# apple 0 0 0
# Banana 1 0 0
# berry 0 0 0
```

then add all matrices to get the "total" repurchase matrix

```
apply(repurchase, 1:2, sum)
# Second
#First apple Banana berry
# apple 1 1 0
# Banana 1 0 0
# berry 0 0 1
```

(absolute frequencies)

```
prop.table(apply(repurchase, 1:2, sum))
# Second
#First apple Banana berry
# apple 0.25 0.25 0.00
# Banana 0.25 0.00 0.00
# berry 0.00 0.00 0.25
```

(relative frequencies)

Source (Stackoverflow)