xineers xineers - 1 month ago 18
R Question

R arules preparing dataset for transactions

I prepared a data set for reading it as transactions using arules package in R. however, one of my data pre-processing is causing an issue when I use the command itemFrequencyplot, specifically, the highest frequency item is " ". Would anyone have any suggestions to resolve this issue?

Original data:

data <- as.data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(data) <- c("Customer", "OrderDate", "Product")
data$Customer <- c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally")
data$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct")
data$Product <- c("Milk", "Eggs", "Bread", "Butter", "Eggs", "Milk", "Bread", "Butter", "Eggs", "Wine")


I make the following transformation

library(reshape2)
library(dplyr)

newdata <- data %>%
group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product") %>%
arrange(OrderDate)

newdata[is.na(newdata)] <- " "
newdata <- newdata[ , 3:6]
newdata[sapply(newdata, is.character)] <- lapply(newdata[sapply(newdata, is.character)], as.factor) #converting is.character columns into as.factor


used write.table to create csv file without column names for reading via arules

write.table(newdata, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",")


using arules package to read the csv file as transactions

library(arules)

transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket")


does not work - throws an error and after reading previous queries on stackoverflow, I was able to resolve it as follows

transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket", rm.duplicates = TRUE)

itemFrequencyPlot(transactiondata, topN = 5)


the result of this plot has " " as the top frequency item, which in reality is not the case and is a result of my data pre-processing. Suggestions to resolve it would be greatly appreciated!

Answer

I would do it this way (following the examples in the manual page for transactions):

data_list <- split(data$Product, paste(data$OrderDate, data$Customer))
trans <- as(data_list, "transactions")
inspect(trans)

    items                    transactionID
[1] {Milk}                   1-Oct John   
[2] {Bread,Eggs}             2-Oct John   
[3] {Butter,Eggs,Milk}       2-Oct Tom    
[4] {Bread,Butter,Eggs,Wine} 3-Oct Sally

itemFrequencyPlot(trans, topN = 5)

Hope this helps!

Comments