Teja - 1 year ago 52

R Question

Below is my code, and I am trying to create multiple dataframe subsets (df) from an original 'dataframe' and some dummy matrices.

Then, depending on the column values in the subset dataframes, wanted to populate the matrices

`# In Here I create some 20 dataframes (df), by subsetting the 'dataframe' and also #create 20 dummy matrices.`

for (i in min(dataframe$week):max(dataframe$week)) {

assign(paste0("df",i), dataframe[dataframe$week==i,] )

assign(paste0("mat",i), matrix(10, nrow=length(issue.list), ncol=length(category.list)))

# Now, I fill values for the matrices by evaluating the contents of the 'df' created above

for (j in 1:length(issue.list)) {

for (k in 1:length(category.list)) {

issue=issue.list[j]

category=category.list[k]

assign(paste0("mat",i,"[j,k]"), sum(paste0("df",i,"[issue]")==1 & paste0("df",i,"[category]")==1) )

}

}

}

Now, the issue is: In above,

`paste0("mat",i,"[j,k]")`

`"mati[j,k]"`

`mat1[1,2]`

I cannot use

`eval(parse(text=paste0(...)))`

Similarly, how do I get

`paste0("df",i,"[issue]")==1`

`df1[issue] ==1`

Answer Source

It is much better to keep track of related data using a list, rather than assigning loose variables in the global environment.

Based on your code, I think we can accomplish your required task with the following:

```
## generate data
set.seed(3L);
N <- 15L; NI <- 3L; NC <- 3L;
dataframe <- cbind(week=sample(2:4,N,T),setNames(nm=paste0('issue',1:NI),as.data.frame(replicate(NI,sample(0:1,N,T)))),setNames(nm=paste0('category',1:NI),as.data.frame(replicate(NC,sample(0:1,N,T)))));
issue.list <- grep(value=T,'^issue',names(dataframe));
category.list <- grep(value=T,'^category',names(dataframe));
```

```
dataframe;
## week issue1 issue2 issue3 category1 category2 category3
## 1 2 1 0 0 1 1 0
## 2 4 0 0 0 0 0 0
## 3 3 1 0 0 1 0 0
## 4 2 1 0 0 0 1 1
## 5 3 0 0 1 1 1 1
## 6 3 0 0 0 0 1 0
## 7 2 0 1 0 1 1 0
## 8 2 0 0 1 1 0 0
## 9 3 0 1 1 0 0 0
## 10 3 0 0 1 0 1 1
## 11 3 1 0 1 1 1 1
## 12 3 1 1 0 1 0 1
## 13 3 1 0 0 1 0 0
## 14 3 1 1 0 1 1 1
## 15 4 1 0 0 1 1 1
issue.list;
## [1] "issue1" "issue2" "issue3"
category.list;
## [1] "category1" "category2" "category3"
```

Solution:

```
## compute dfs
dfs <- split(dataframe,dataframe$week);
## compute mats
cmb <- expand.grid(issue=issue.list,category=category.list);
mats <- lapply(dfs,function(df) matrix(apply(cmb,1L,function(x,il,cl) sum(il[,x['issue']] & cl[,x['category']]),df[issue.list]==1,df[category.list]==1),length(issue.list),dimnames=list(issue.list,category.list)));
```

```
dfs;
## $`2`
## week issue1 issue2 issue3 category1 category2 category3
## 1 2 1 0 0 1 1 0
## 4 2 1 0 0 0 1 1
## 7 2 0 1 0 1 1 0
## 8 2 0 0 1 1 0 0
##
## $`3`
## week issue1 issue2 issue3 category1 category2 category3
## 3 3 1 0 0 1 0 0
## 5 3 0 0 1 1 1 1
## 6 3 0 0 0 0 1 0
## 9 3 0 1 1 0 0 0
## 10 3 0 0 1 0 1 1
## 11 3 1 0 1 1 1 1
## 12 3 1 1 0 1 0 1
## 13 3 1 0 0 1 0 0
## 14 3 1 1 0 1 1 1
##
## $`4`
## week issue1 issue2 issue3 category1 category2 category3
## 2 4 0 0 0 0 0 0
## 15 4 1 0 0 1 1 1
##
```

```
mats;
## $`2`
## category1 category2 category3
## issue1 1 2 1
## issue2 1 1 0
## issue3 1 0 0
##
## $`3`
## category1 category2 category3
## issue1 5 2 3
## issue2 2 1 2
## issue3 2 3 3
##
## $`4`
## category1 category2 category3
## issue1 1 1 1
## issue2 0 0 0
## issue3 0 0 0
##
```

```
mats <- lapply(dfs,function(df) ...);
```

This processes each data.frame in the list individually, aliasing the current data.frame as parameter `df`

of the lambda. The result of the `lapply()`

call will be a list whose components will consist of the return values of each evaluation of the lambda.

```
apply(cmb,1L,function(x,il,cl) ...,df[issue.list]==1,df[category.list]==1)
```

Inside the lambda we call `apply()`

on `cmb`

. Recall that `cmb`

is a data.frame with two columns, `issue`

and `category`

, where each row holds one unique combination of the two sets, with all possible combinations being represented within `cmb`

. Running `apply()`

with `MARGIN=1L`

executes yet another lambda (we can call this the "inner lambda" to distinguish it from the "outer lambda") once for each row of `cmb`

(which is actually coerced to a matrix first inside `apply()`

, although that's not significant). The inner lambda will receive in its first parameter (which I've called `x`

) the current row as a two-element character vector. Conveniently, `x`

possesses the same names as the input object (specifically on its `names`

attribute), which we will make use of in the body of the lambda when we index `x`

.

Take a look at the documentation for `apply()`

. Observe that after the target object parameter `X`

, the margin parameter `MARGIN`

, and the lambda parameter `FUN`

, the `apply()`

function accepts optional variadic arguments which will be relayed directly to the calls to `FUN()`

which are made internally within `apply()`

. I am making use of that feature here. I am effectively precomputing a logical matrix that represents which cells of the issue columns of `df`

are equal to 1, and I'm doing the same for the category columns. These two logical matrices will end up being passed as two additional arguments to the inner lambda calls. That is why I wrote the lambda to take 3 parameters: the current row of the target object `x`

, the issue logical matrix `il`

, and the category logical matrix `cl`

. Note that the variadic arguments are only evaluated once (specifically when they are instantiated for the first call to `FUN()`

made within `apply()`

, due to R's lazy evaluation mechanism), so there is no performance penalty here due to redundant reevaluation of a constant expression. Also note that when you index out a subset of the columns of a data.frame (e.g. `df[issue.list]`

) the column names come with the subset, and when you compute a logical matrix from a data.frame using a comparison operation (e.g. `df[issue.list]==1`

) the column names once again are brought along into the new matrix; we will make use of this in the body of the lambda when we index `il`

and `cl`

.

```
sum(il[,x['issue']] & cl[,x['category']])
```

Finally, we reach the body of the inner lambda. Here, we carry out the logic you showed in your question. Namely, for the current issue/category combination, we find which rows of `df`

are equal to 1 in *both* the issue and category columns, and count the number of rows for which that condition is true.

Recall that the test of which cells in (all) the issue and category columns are equal to 1 was already precomputed in the variadic arguments to the `apply()`

call, and we have those two logical matrices available as `il`

and `cl`

. But, we need to retrieve the specific columns of the two logical matrices that correspond to the current issue/category combination.

First, we index `x`

with its column name `'issue'`

to get the current issue as a character string, and then we index `il`

with that string, since its column names came from `dataframe`

, which has the specific issue strings as column names. This gives us a logical vector representing which rows of `df`

are equal to 1 for this issue column. We can do the same for the category, namely, index `x`

with column name `category`

, then index `cl`

with the resulting string. We can then perform the vectorized AND operation `&`

against those two logical vectors to get a single logical vector representing which rows of `df`

are equal to 1 in *both* columns. Taking the `sum()`

of the logical vector effectively counts how many of its elements are TRUE, and that integer count will be the return value from the inner lambda.

In general, the type and dimensions of the return value of a call to `apply()`

depend on the dimensions of the input object, the margin, and what is returned from each evaluation of the lambda (this is complex!), but for the relatively simple case of a matrix input, row margin, and a scalar integer being returned from every evaluation of the lambda, the return value of the `apply()`

call will be an integer vector correspondent to the rows of the input matrix. Hence, because there are 9 rows in `cmb`

(which is the case ultimately because there are 9 issue/category combinations), our `apply()`

call will return an integer vector of length 9. This is true for every evaluation of the outer lambda, because `cmb`

is constant for all data.frames in `dfs`

.

```
matrix(...,length(issue.list),dimnames=list(issue.list,category.list))
```

Finally, since you want the result as a matrix, we must construct a matrix out of the vector. This can be done with a call to `matrix()`

.

Now we must consider, what will the dimensions of the matrix need to be? There are `length(issue.list)`

issues and `length(category.list)`

categories. The dimensions will have to correspond to those lengths. But which way should they go? In other words, should we have `length(issue.list)`

rows and `length(category.list)`

columns, or the other way around?

Recall that the vector we received from `apply()`

corresponds to the rows of `cmb`

. This means the order of combinations in `cmb`

will determine the meaning of the received vector.

```
cmb;
## issue category
## 1 issue1 category1
## 2 issue2 category1
## 3 issue3 category1
## 4 issue1 category2
## 5 issue2 category2
## 6 issue3 category2
## 7 issue1 category3
## 8 issue2 category3
## 9 issue3 category3
```

Observe how the combinations in `cmb`

have issues changing more "rapidly" than categories. In other words, as you go down the rows of `cmb`

, the issues cycle through their values first-and-foremost, and only secondarily do the categories cycle. This means that for every `length(issues.list)`

elements of the vector, we cycle through all issues, and we cover only and entirely one category. This means the `length(issues.list)`

length should follow whichever dimension is filled most "rapidly" by `matrix()`

. As it happens, we can control this behavior using the `byrow`

argument of `matrix()`

. Observe:

```
matrix(1:4,2L); ## default is byrow=F
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
matrix(1:4,2L,byrow=T);
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
```

I prefer to use the default `byrow=F`

fill order, which means we need `length(issue.list)`

rows and `length(category.list)`

columns.

To achieve this, we only need to specify one of the `nrow`

or `ncol`

arguments of `matrix()`

, which I've done by specifying `nrow`

. Internally, `matrix()`

derives the required number of columns based on `nrow`

and the length of the input vector `data`

.

Finally, it is desirable to capture as dimension names the issues and categories that correspond to each index of each dimension in the resulting matrix, which can be achieved by specifying the `dimnames`

argument of `matrix()`

with `issue.list`

as the row names and `category.list`

as the column names, which obviously must correspond to the `byrow`

and dimension size choice described above.

Hence, the outer lambda will end up returning this matrix to be used as the respective component of the list that will be returned from the `lapply()`

call and assigned to `mats`

.