John Lynch John Lynch - 3 months ago 12
R Question

Creating new columns in R data set within function

I have a dataset for a class that I'm taking, which comes from the UCI Machine Learning repository. I have to subset it by date, and then plot various measurements by date and time. To prep the dataset, I use the following code:

prep <- function(x) {
setwd("/Users/johnlynch/Google Drive/DataToolbox/Exploring/Week 1")
power <- read.csv("poweruse.txt", sep = ";", stringsAsFactors = FALSE)
power$Date <- strptime(power$Date, "%d/%m/%Y")
power <- subset(power, Date == "2007-02-01"|Date == "2007-02-02")
}


Then, when I run my script in the console, I type "power <- prep()" and the subsetted data is put into the variable "power," exactly as I expect:


head(mydata)

Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3

66637 2007-02-01 00:00:00 0.326 0.128 243.150 1.400 0.000 66637 0.000 0

66638 2007-02-01 00:01:00 0.326 0.130 243.320 1.400 0.000 66638 0.000 0

66639 2007-02-01 00:02:00 0.324 0.132 243.510 1.400 0.000 66639 0.000 0

66640 2007-02-01 00:03:00 0.324 0.134 243.900 1.400 0.000 66640 0.000 0

66641 2007-02-01 00:04:00 0.322 0.130 243.160 1.400 0.000 66641 0.000 0

66642 2007-02-01 00:05:00 0.320 0.126 242.290 1.400 0.000 66642 0.000 0


However, I discovered as I did the plots that, in order to match the course plots, I needed to create a new column in the dataframe, $newdate, by combining the Date and Time columns into one. So I tried adjusting my script to be as follows:

prep <- function(x) {
setwd("/Users/johnlynch/Google Drive/DataToolbox/Exploring/Week 1")
power <- read.csv("poweruse.txt", sep = ";", stringsAsFactors = FALSE)
power$Date <- strptime(power$Date, "%d/%m/%Y")
power <- subset(power, Date == "2007-02-01"|Date == "2007-02-02")
power$newDate <- with(power, paste(Date, Time))
}


I thought, hey, that should create a new column in the dataframe that would be output along with the rest of the data into the mydata variable. However, when I run that function, the ONLY output that I get is the contents of the $newdate column:


head(mydata)

[1] "2007-02-01 00:00:00" "2007-02-01 00:01:00" "2007-02-01 00:02:00" "2007-02-01 00:03:00"

[5] "2007-02-01 00:04:00" "2007-02-01 00:05:00"


What am I doing wrong? Why doesn't the second script output the entire dataset, with a new column added at the end? And can someone tell me how to correct that?

Answer

A function in R returns the last expression that is evaluated. Consider these two functions:

f1 <- function(x) {
  x$a <- 2
  x
}

f2 <- function(x) {
  x$a <- 2
}

Given a list, f1 will return a list, whereas f2 will return a numeric vector of length 1 (the number 2):

> x <- list(a = 1)
> str(f1(x))
List of 1
 $ a: num 2
> str(f2(x))
 num 2
> 

For more details, Hadley Wickham's tutorial on functions is worth reading.

Comments