Ian Gow Ian Gow - 1 month ago 7
R Question

How can I append data to a PostgreSQL table with `dplyr` without `collect()`?

The table

reg_data
is a PostgreSQL table. It turns out to be faster to run the regressions in PostgreSQL. But, as I am running it for 100,000s of data sets, I want to do it data set by data set and append the results of each to a table.

Is there a way to append PostgreSQL data to a PostgreSQL table using native
dplyr
verbs? I'm not sure that there's a huge cost to bringing the data to R then sending them back to PostgreSQL (it's just 6 numbers and a couple of identifying fields), but it does seem inelegant.

library(dplyr)

pg <- src_postgres()

reg_data <- tbl(pg, "reg_data")

reg_results <-
reg_data %>%
summarize(r_squared=regr_r2(y, x),
num_obs=regr_count(y, x),
constant=regr_intercept(y, x),
slope=regr_slope(y, x),
mean_analyst_fog=regr_avgx(y, x),
mean_manager_fog=regr_avgy(y, x)) %>%
collect() %>%
as.data.frame()

# Push to database.
dbWriteTable(pg$con, c("bgt", "within_call_data"), reg_results,
append=TRUE, row.names=FALSE)

Answer

dplyr does not include commands to insert or update records in a database, so there is not a complete native dplyr solution for this. But you could combine dplyr with regular SQL statements to avoid bringing the data to R.

Let's start by reproducing your steps before the collect() statement

library(dplyr)

pg <- src_postgres()

reg_data <- tbl(pg, "reg_data")

reg_results <-
    reg_data %>%
    summarize(r_squared=regr_r2(y, x),
              num_obs=regr_count(y, x),
              constant=regr_intercept(y, x),
              slope=regr_slope(y, x),
              mean_analyst_fog=regr_avgx(y, x),
              mean_manager_fog=regr_avgy(y, x))

Now, you could use compute() instead of collect() to create a temporary table in the database.

reg_results <- reg_results %>% compute()

All we need now is the name of the temporary table. We can get it from reg_results

temp.tbl.name <- as.character(reg_results$from)

Now, we will use the library RPostgreSQL to create an insert query that uses the results stored in the temporary table. As the temporary table only lives in the connection created by src_postgresql() we need to reuse it.

library(RPostgreSQL)
copyconn <- pg$con
class(copyconn) <- "PostgreSQLConnection" # I get an error if I don't fix the class

Finally the insert query

sql <- paste0("INSERT INTO destination_table SELECT * FROM ", temp.tbl.name,";")

dbSendQuery(copyconn, sql)

So, everything is happening in the database and the data is not brought into R.

EDIT

At least in dplyr 0.5.0, the name of the table has been moved from $from to $ops$x$x . So, this line:

temp.tbl.name <- as.character(reg_results$from)

should be replaced by:

temp.tbl.name <- as.character(reg_results$ops$x$x)