Teja Teja - 1 month ago 11
R Question

How to treat a character string as matrix/dataframe object/entry rather than a simple character

Below is my code, and I am trying to create multiple dataframe subsets (df) from an original 'dataframe' and some dummy matrices.
Then, depending on the column values in the subset dataframes, wanted to populate the matrices

# In Here I create some 20 dataframes (df), by subsetting the 'dataframe' and also #create 20 dummy matrices.
for (i in min(dataframe$week):max(dataframe$week)) {

assign(paste0("df",i), dataframe[dataframe$week==i,] )
assign(paste0("mat",i), matrix(10, nrow=length(issue.list), ncol=length(category.list)))

# Now, I fill values for the matrices by evaluating the contents of the 'df' created above
for (j in 1:length(issue.list)) {
for (k in 1:length(category.list)) {
issue=issue.list[j]
category=category.list[k]
assign(paste0("mat",i,"[j,k]"), sum(paste0("df",i,"[issue]")==1 & paste0("df",i,"[category]")==1) )
}
}

}


Now, the issue is: In above,
paste0("mat",i,"[j,k]")
is evaluating to a character
"mati[j,k]"
. How do I get it evaluated to ex:
mat1[1,2]
, which refers to a matrix elements created in the initial for loop.

I cannot use
eval(parse(text=paste0(...)))
, as it is not working for my case. It is referring to the value of the matrix element, which is 10. I wanted to refer to the element itself, to change it.

Similarly, how do I get
paste0("df",i,"[issue]")==1
evaluated to a dataframe column (
df1[issue] ==1
).

Looking for a functionality similar to SAS macro language, & usage

Answer

It is much better to keep track of related data using a list, rather than assigning loose variables in the global environment.

Based on your code, I think we can accomplish your required task with the following:

## generate data
set.seed(3L);
N <- 15L; NI <- 3L; NC <- 3L;
dataframe <- cbind(week=sample(2:4,N,T),setNames(nm=paste0('issue',1:NI),as.data.frame(replicate(NI,sample(0:1,N,T)))),setNames(nm=paste0('category',1:NI),as.data.frame(replicate(NC,sample(0:1,N,T)))));
issue.list <- grep(value=T,'^issue',names(dataframe));
category.list <- grep(value=T,'^category',names(dataframe));

dataframe;
##    week issue1 issue2 issue3 category1 category2 category3
## 1     2      1      0      0         1         1         0
## 2     4      0      0      0         0         0         0
## 3     3      1      0      0         1         0         0
## 4     2      1      0      0         0         1         1
## 5     3      0      0      1         1         1         1
## 6     3      0      0      0         0         1         0
## 7     2      0      1      0         1         1         0
## 8     2      0      0      1         1         0         0
## 9     3      0      1      1         0         0         0
## 10    3      0      0      1         0         1         1
## 11    3      1      0      1         1         1         1
## 12    3      1      1      0         1         0         1
## 13    3      1      0      0         1         0         0
## 14    3      1      1      0         1         1         1
## 15    4      1      0      0         1         1         1
issue.list;
## [1] "issue1" "issue2" "issue3"
category.list;
## [1] "category1" "category2" "category3"

Solution:

## compute dfs
dfs <- split(dataframe,dataframe$week);

## compute mats
cmb <- expand.grid(issue=issue.list,category=category.list);
mats <- lapply(dfs,function(df) matrix(apply(cmb,1L,function(x,il,cl) sum(il[,x['issue']] & cl[,x['category']]),df[issue.list]==1,df[category.list]==1),length(issue.list),dimnames=list(issue.list,category.list)));

dfs;
## $`2`
##   week issue1 issue2 issue3 category1 category2 category3
## 1    2      1      0      0         1         1         0
## 4    2      1      0      0         0         1         1
## 7    2      0      1      0         1         1         0
## 8    2      0      0      1         1         0         0
##
## $`3`
##    week issue1 issue2 issue3 category1 category2 category3
## 3     3      1      0      0         1         0         0
## 5     3      0      0      1         1         1         1
## 6     3      0      0      0         0         1         0
## 9     3      0      1      1         0         0         0
## 10    3      0      0      1         0         1         1
## 11    3      1      0      1         1         1         1
## 12    3      1      1      0         1         0         1
## 13    3      1      0      0         1         0         0
## 14    3      1      1      0         1         1         1
##
## $`4`
##    week issue1 issue2 issue3 category1 category2 category3
## 2     4      0      0      0         0         0         0
## 15    4      1      0      0         1         1         1
##

mats;
## $`2`
##        category1 category2 category3
## issue1         1         2         1
## issue2         1         1         0
## issue3         1         0         0
##
## $`3`
##        category1 category2 category3
## issue1         5         2         3
## issue2         2         1         2
## issue3         2         3         3
##
## $`4`
##        category1 category2 category3
## issue1         1         1         1
## issue2         0         0         0
## issue3         0         0         0
##

Explanation of last line of code

mats <- lapply(dfs,function(df) ...);

This processes each data.frame in the list individually, aliasing the current data.frame as parameter df of the lambda. The result of the lapply() call will be a list whose components will consist of the return values of each evaluation of the lambda.


apply(cmb,1L,function(x,il,cl) ...,df[issue.list]==1,df[category.list]==1)

Inside the lambda we call apply() on cmb. Recall that cmb is a data.frame with two columns, issue and category, where each row holds one unique combination of the two sets, with all possible combinations being represented within cmb. Running apply() with MARGIN=1L executes yet another lambda (we can call this the "inner lambda" to distinguish it from the "outer lambda") once for each row of cmb (which is actually coerced to a matrix first inside apply(), although that's not significant). The inner lambda will receive in its first parameter (which I've called x) the current row as a two-element character vector. Conveniently, x possesses the same names as the input object (specifically on its names attribute), which we will make use of in the body of the lambda when we index x.

Take a look at the documentation for apply(). Observe that after the target object parameter X, the margin parameter MARGIN, and the lambda parameter FUN, the apply() function accepts optional variadic arguments which will be relayed directly to the calls to FUN() which are made internally within apply(). I am making use of that feature here. I am effectively precomputing a logical matrix that represents which cells of the issue columns of df are equal to 1, and I'm doing the same for the category columns. These two logical matrices will end up being passed as two additional arguments to the inner lambda calls. That is why I wrote the lambda to take 3 parameters: the current row of the target object x, the issue logical matrix il, and the category logical matrix cl. Note that the variadic arguments are only evaluated once (specifically when they are instantiated for the first call to FUN() made within apply(), due to R's lazy evaluation mechanism), so there is no performance penalty here due to redundant reevaluation of a constant expression. Also note that when you index out a subset of the columns of a data.frame (e.g. df[issue.list]) the column names come with the subset, and when you compute a logical matrix from a data.frame using a comparison operation (e.g. df[issue.list]==1) the column names once again are brought along into the new matrix; we will make use of this in the body of the lambda when we index il and cl.


sum(il[,x['issue']] & cl[,x['category']])

Finally, we reach the body of the inner lambda. Here, we carry out the logic you showed in your question. Namely, for the current issue/category combination, we find which rows of df are equal to 1 in both the issue and category columns, and count the number of rows for which that condition is true.

Recall that the test of which cells in (all) the issue and category columns are equal to 1 was already precomputed in the variadic arguments to the apply() call, and we have those two logical matrices available as il and cl. But, we need to retrieve the specific columns of the two logical matrices that correspond to the current issue/category combination.

First, we index x with its column name 'issue' to get the current issue as a character string, and then we index il with that string, since its column names came from dataframe, which has the specific issue strings as column names. This gives us a logical vector representing which rows of df are equal to 1 for this issue column. We can do the same for the category, namely, index x with column name category, then index cl with the resulting string. We can then perform the vectorized AND operation & against those two logical vectors to get a single logical vector representing which rows of df are equal to 1 in both columns. Taking the sum() of the logical vector effectively counts how many of its elements are TRUE, and that integer count will be the return value from the inner lambda.

In general, the type and dimensions of the return value of a call to apply() depend on the dimensions of the input object, the margin, and what is returned from each evaluation of the lambda (this is complex!), but for the relatively simple case of a matrix input, row margin, and a scalar integer being returned from every evaluation of the lambda, the return value of the apply() call will be an integer vector correspondent to the rows of the input matrix. Hence, because there are 9 rows in cmb (which is the case ultimately because there are 9 issue/category combinations), our apply() call will return an integer vector of length 9. This is true for every evaluation of the outer lambda, because cmb is constant for all data.frames in dfs.


matrix(...,length(issue.list),dimnames=list(issue.list,category.list))

Finally, since you want the result as a matrix, we must construct a matrix out of the vector. This can be done with a call to matrix().

Now we must consider, what will the dimensions of the matrix need to be? There are length(issue.list) issues and length(category.list) categories. The dimensions will have to correspond to those lengths. But which way should they go? In other words, should we have length(issue.list) rows and length(category.list) columns, or the other way around?

Recall that the vector we received from apply() corresponds to the rows of cmb. This means the order of combinations in cmb will determine the meaning of the received vector.

cmb;
##    issue  category
## 1 issue1 category1
## 2 issue2 category1
## 3 issue3 category1
## 4 issue1 category2
## 5 issue2 category2
## 6 issue3 category2
## 7 issue1 category3
## 8 issue2 category3
## 9 issue3 category3

Observe how the combinations in cmb have issues changing more "rapidly" than categories. In other words, as you go down the rows of cmb, the issues cycle through their values first-and-foremost, and only secondarily do the categories cycle. This means that for every length(issues.list) elements of the vector, we cycle through all issues, and we cover only and entirely one category. This means the length(issues.list) length should follow whichever dimension is filled most "rapidly" by matrix(). As it happens, we can control this behavior using the byrow argument of matrix(). Observe:

matrix(1:4,2L); ## default is byrow=F
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
matrix(1:4,2L,byrow=T);
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

I prefer to use the default byrow=F fill order, which means we need length(issue.list) rows and length(category.list) columns.

To achieve this, we only need to specify one of the nrow or ncol arguments of matrix(), which I've done by specifying nrow. Internally, matrix() derives the required number of columns based on nrow and the length of the input vector data.

Finally, it is desirable to capture as dimension names the issues and categories that correspond to each index of each dimension in the resulting matrix, which can be achieved by specifying the dimnames argument of matrix() with issue.list as the row names and category.list as the column names, which obviously must correspond to the byrow and dimension size choice described above.

Hence, the outer lambda will end up returning this matrix to be used as the respective component of the list that will be returned from the lapply() call and assigned to mats.

Comments