Christoph P. Christoph P. - 1 month ago 19
R Question

Loop to create dummy variables

I'm currently working on a large data set (ca. 30k rows), And I'm working on creating a

hedonic
regression. The next step would be to create weekly dummy variables.

Now my data has assigned weekly number depending on the day the data was measured. There are 50 different weeks (1-52, 2 missing unaccounted). These weekly numbers are repeated until the change after about
10 rows,
however they are also recurring, as new product categories are measures. There are 132 are available in the dataset and one category contains between
100 - 300 rows
.

This is an example of the dataset

UPC Weeks
1111112016 1
1111112016 1
1111112016 2
1111112016 2
1111112016 3
1111112016 3
1111112440 1
1111112440 1
1111112440 2
1111112440 2
1111112440 3
1111112440 3


Now to create dummy variables, I created 50 columns, each having about 30k rows to represent the dataset. I would like to assign 1 to the row of the dummy week whenever dummy week (hence the column name) and real week (row of the orig. dataset) are equal.

Example Dummy (DW = Dummy Week):

DW1 DW2
NA NA
NA NA
NA NA


I tried the following:

for (i in 1:seq(Soap$WEEK)){
if Soap$WEEK[i] == seq(from=1, by=1, to=52){
for (j in names(x)){
x$DW[[j]] = 1
else {
x$DW[[j]] = 0
}}}}


I know it is wrong, however I'm unable to resolve my problem. I would appreciate any help in this matter.

Answer

We can use model.matrix() from the stats package to dummify your data. First, we'll need to convert Weeks to a factor column.

df$Weeks <- as.factor(df$Weeks)

Now we can run model.matrix():

model.matrix(~ Weeks + UPC + 0, data = df)
#   Weeks1 Weeks2 Weeks3        UPC
#1       1      0      0 1111112016
#2       1      0      0 1111112016
#3       0      1      0 1111112016
#4       0      1      0 1111112016
#5       0      0      1 1111112016
#6       0      0      1 1111112016
#7       1      0      0 1111112440
#8       1      0      0 1111112440
#9       0      1      0 1111112440
#10      0      1      0 1111112440
#11      0      0      1 1111112440
#12      0      0      1 1111112440

You can also just use model.matrix(~ . + 0 , data = df), as numeric columns will be automatically passed over. The + 0 in the formula avoids replacing the first level by the Intercept. To see the difference try to run it without 0.

Alternatively, you can also use dummyVars from the caret package. Here, no Intercept is the default behaviour:

library(caret)

dm <- dummyVars(" ~ .", data = df)
data.frame(predict(dm, newdata = df))
#          UPC Weeks.1 Weeks.2 Weeks.3
#1  1111112016       1       0       0
#2  1111112016       1       0       0
#3  1111112016       0       1       0
#4  1111112016       0       1       0
#5  1111112016       0       0       1
#6  1111112016       0       0       1
#7  1111112440       1       0       0
#8  1111112440       1       0       0
#9  1111112440       0       1       0
#10 1111112440       0       1       0
#11 1111112440       0       0       1
#12 1111112440       0       0       1