MichaelChirico MichaelChirico - 3 months ago 13
R Question

Strange error plotting by group

Sorry for the massive data dump but I can't reproduce this on the subsets of the data I've tried. Copy-pasted the

dput
of the data (165 obs., not crazy) to this Gist.

I'm trying to plot the data in
DT
by
sport
, according to:


  1. Create empty plot with proper limits to accommodate all data

  2. Plot the column
    gini
    as a scatterplot, with colors varying by
    sport

  3. Plot the column
    five_year_ma
    as a line, with color matching that in 2.



This should be simple and I've done things like it before. Here's what should work:

#empty plot with proper axes
DT[ , plot(
NA, ylim = range(gini), xlim = range(season),
xlab = "Season", ylab = "Gini",
main = "Comparison of Gini Coefficient Across Sports")]

#pick colors for each sport
cols <- c(NHL="black", NBA="red")

DT[ , {
#add points to current plot
points(season, gini, col = cols[.BY$sport])

#add lines to current plot
lines(season, five_yr_ma, col = cols[.BY$sport], lwd = 3)},
by = sport]


But this gives me output/error:

# Empty data.table (0 rows) of 1 col: sport



Error:
x
and
y
lengths differ in
plot.xy()



This is strange. If we skip the grouping and just do it manually, it works perfectly fine:

all_sports[sport == "NBA", {
points(season, gini, col = "red")
lines(season, five_yr_ma, col = "red", lwd = 3)}]

all_sports[sport == "NHL", {
points(season, gini, col = "black")
lines(season, five_yr_ma, col = "black", lwd = 3)}]


expected

Moreover, even in the context of grouping, it's unclear why
plot.xy
has received arguments of different length -- if we make the following adjustment to force R to record the inputs just before they're sent, there doesn't appear to be any issue:

all_sports[ , {
cat("\n\nPlotting for sport: ", .BY$sport)
points(x1 <- season, y1 <- gini, col = cols[.BY$sport])
lines(x2 <- season, y2 <- five_yr_ma, col = cols[.BY$sport], lwd = 3)
cat("\npoints/season: ",length(x1),
"\npoints/gini: ", length(y1),
"\nlines/season: ", length(x2),
"\nlines/five_yr_ma: ", length(y2))},
by = sport]


Has output:

# Plotting for sport: NHL
# points/season: 98
# points/gini: 98
# lines/season: 98
# lines/five_yr_ma: 98

# Plotting for sport: NBA
# points/season: 67
# points/gini: 67
# lines/season: 67
# lines/five_yr_ma: 67


What could be going on??




Since it appears like this is not common across machines, here's my
sessionInfo()
:

R version 3.2.4 (2016-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.9.7

loaded via a namespace (and not attached):
[1] rsconnect_0.4.1.11 tools_3.2.4

Answer

Indeed, as @Arun points out, it seems this is a resurfacing of the (as yet unsolved) issue which was causing the error in this question:

Values of the wrong group are used when using plot() within a data.table() in RStudio

As @Arun discovered there, it seems like RStudio's native graphics device is somehow getting tripped up by the changing pointers used for the different subgroups created when evaluating j when by is present, which lends itself to the workaround of simply copying all of .SD each time, like:

points(copy(season), copy(gini),
       col = cols[.BY$sport])
lines(copy(season), copy(five_yr_ma), 
      col = cols[.BY$sport], lwd = 3)

Or

x <- copy(.SD)
with(x, {points(season, gini, cols = cols[.BY$sport]);
         lines(copy(season), copy(five_yr_ma), 
           col = cols[.BY$sport], lwd = 3)})

Both of which worked for me (since the subgroups are so small, there's no computational efficiency concern at play here -- we can copy away without affecting performance noticeably).

This is #1524 at the data.table GitHub page and I've filed this bug report at RStudio Support; will update this if a fix is pushed.