User2321 - 8 months ago 40

R Question

I have the following data table in R:

`set.seed(5)`

my_data <- data.table(cat_1=c(1,1,1,2,2,1,1,1,3,4,5,4,5),

cat_2 = sample(c("A","B"), 13, replace = T),

ao = rnorm(13,500,10))

And I would like to know the number of rows per cat_1, the sum of ao per cat_1, and the number of As in cat_2 per cat_1. So ideally I would like to get this:

`merge(my_data[, .(cat1_lines = .N, total_ao = sum(ao, na.rm = T)), by = cat_1],`

my_data[cat_2 == "A", .(A_lines = .N), by = cat_1], by = "cat_1", all.x = T)

cat_1 cat1_lines total_ao A_lines

1: 1 6 3015.5034 1

2: 2 2 1015.8838 2

3: 3 1 516.9518 NA

4: 4 2 984.0768 2

5: 5 2 983.8361 2

Is there a way of doing this in the same by statement without having to merge? Something like (I know this does not work):

`my_data[, .(cat1_lines = .N, A_lines = .N[cat_2 == "A"],`

total_ao = sum(ao, na.rm = T)), by = cat_1]

Answer

You can easily do this with a `by`

statement in your `data.table`

. Try this:

```
my_data[,.(cat1_lines=.N,total_ao=sum(ao),A_lines=sum(cat_2=="A")),by=.(cat_1)]
cat_1 cat1_lines total_ao A_lines
1: 1 6 3015.5034 1
2: 2 2 1015.8838 2
3: 3 1 516.9518 0
4: 4 2 984.0768 2
5: 5 2 983.8361 2
```

Source (Stackoverflow)