forkandwait forkandwait - 4 months ago 29
R Question

Workflow for statistical analysis and report writing

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:

  1. Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.

  2. The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).

  3. The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).

  4. Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.

  5. Write report incorporating tables and graphics.

  6. Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.

At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.



Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/
suffix) and scripts (
suffix). Make uses timestamps to check dependencies, so if you
touch ss07por.csv
, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!


persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
$R --slave -f ImportData.R

persondata.Munged.RData : MungeData.R persondata.RData Functions.R
$R --slave -f MungeData.R

report.txt: TabulateAndGraph.R persondata.Munged.RData Functions.R
$R --slave -f TabulateAndGraph.R > report.txt


I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.