Clayton Stanley Clayton Stanley - 3 months ago 12
R Question

Profiling data.table's setkey operation with Rprof

I am working with a relatively large

data.table
dataset and trying to profile/optimize the code. I am using
Rprof
, but I'm noticing that the majority of time spent within a setkey operation is not included in the Rprof summary. Is there a way to include this time spent?

Here is a small test to show how time spent setting the key for a data table is not represented in the Rprof summary:

Create a test function that runs a profiled setkey operation on a data table:

testFun <- function(testTbl) {
Rprof()
setkey(testTbl, x, y, z)
Rprof(NULL)
print(summaryRprof())
}


Then create a test data table that is large enough to feel the weight of the setkey operation:

testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))


Then run the code, and wrap it within a system.time operation to show the difference between the system.time total time and the rprof total time:

> system.time(testFun(testTbl))
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.88 75.86 0.88 75.86
"<Anonymous>" 0.08 6.90 1.00 86.21
"regularorder1" 0.08 6.90 0.92 79.31
"radixorder1" 0.08 6.90 0.12 10.34
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72

$by.total
total.time total.pct self.time self.pct
"setkey" 1.16 100.00 0.00 0.00
"setkeyv" 1.16 100.00 0.00 0.00
"system.time" 1.16 100.00 0.00 0.00
"testFun" 1.16 100.00 0.00 0.00
"fastorder" 1.14 98.28 0.00 0.00
"tryCatch" 1.14 98.28 0.00 0.00
"tryCatchList" 1.14 98.28 0.00 0.00
"tryCatchOne" 1.14 98.28 0.00 0.00
"<Anonymous>" 1.00 86.21 0.08 6.90
"regularorder1" 0.92 79.31 0.08 6.90
"sort.list" 0.88 75.86 0.88 75.86
"radixorder1" 0.12 10.34 0.08 6.90
"doTryCatch" 0.12 10.34 0.00 0.00
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
"is.unsorted" 0.02 1.72 0.00 0.00
"simpleError" 0.02 1.72 0.00 0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.16

user system elapsed
31.112 0.211 31.101


Note the 1.16 and 31.101 time differences.

Reading ?Rprof, I see why this difference might have occurred:


Functions will only be recorded in the profile log if they put a
context on the call stack (see sys.calls). Some primitive functions do
not do so: specifically those which are of type "special" (see the ‘R
Internals’ manual for more details).


So is this the reason why time spent within the setkey operation isn't represented in Rprof? Is there a workaround have Rprof watch all of data.table's operations (including setkey, and maybe others that I haven't noticed)? I essentially want to have the system.time and Rprof time match up.

Here is the most-likely relevant sessionInfo():

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

data.table_1.8.11


I still observe this issue when Rprof() isn't within a function call:

> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }

> Rprof()
> system.time(testFun(testTbl))
user system elapsed
28.855 0.191 28.854
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.86 71.67 0.88 73.33
"regularorder1" 0.08 6.67 0.92 76.67
"<Anonymous>" 0.06 5.00 0.98 81.67
"radixorder1" 0.06 5.00 0.10 8.33
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67

$by.total
total.time total.pct self.time self.pct
"system.time" 1.20 100.00 0.00 0.00
"setkey" 1.10 91.67 0.00 0.00
"setkeyv" 1.10 91.67 0.00 0.00
"testFun" 1.10 91.67 0.00 0.00
"fastorder" 1.08 90.00 0.00 0.00
"tryCatch" 1.08 90.00 0.00 0.00
"tryCatchList" 1.08 90.00 0.00 0.00
"tryCatchOne" 1.08 90.00 0.00 0.00
"<Anonymous>" 0.98 81.67 0.06 5.00
"regularorder1" 0.92 76.67 0.08 6.67
"sort.list" 0.88 73.33 0.86 71.67
"radixorder1" 0.10 8.33 0.06 5.00
"doTryCatch" 0.10 8.33 0.00 0.00
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
"formals" 0.02 1.67 0.00 0.00
"is.unsorted" 0.02 1.67 0.00 0.00
"match.arg" 0.02 1.67 0.00 0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.2


EDIT2: Same issue with 1.8.10 on my machine with only the data.table package loaded. Times are not equal even when the Rprof() call is not within a function:

> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> base::source("/tmp/r-plugin-claytonstanley/Rsource-86075-preProcess.R", echo=TRUE)

> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
29.516 0.281 29.760
> Rprof(NULL)
> summaryRprof()


EDIT3: Doesn't work even if
setkey
is not within a function:

> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()


EDIT4: Doesn't work even when R is called from a
--vanilla
bare-bones terminal prompt.

EDIT5: Does work when tested on a Linux VM. But still does not work on darwin machine for me.

EDIT6: Doesn't work after watching the Rprof.out file get created, so it isn't a write access issue.

EDIT7: Doesn't work after compiling data.table from source and creating a new temp user and running on that account.

EDIT8: Doesn't work when compiling R 3.0.2 from source for darwin via MacPorts.

EDIT9: Does work on a different darwin machine, a Macbook Pro laptop running the same OS version (10.6.8). Still doesn't work on a MacPro desktop machine running same OS version, R version, data.table version, etc.

I'm thinking it's b/c the desktop machine is running in 64-bit kernel mode (not default), and the laptop is 32-bit (default). Confirmed.

Answer

The problem was that the darwin machine was running Snow Leopard with a 64-bit kernel, which is not the default for that OS X version.

I also verified that this is not a problem for another darwin machine running Mountain Lion which uses a 64-bit kernel by default. So it's an interaction between Snow Leopard and running a 64-bit kernel specifically.

As a side note, the official OS X binary installer for R is still built with Snow Leopard, so I do think that this issue is still relevant, as Snow Leopard is still a widely-used OS X version.

When the 64-bit kernel in Snow Leopard is enabled, no kernel extensions that are compatible only with the 32-bit kernel are loaded. After booting into the default 32-bit kernel for Snow Leopard, kextfind shows that these 32-bit only kernel extensions are on the machine and (most likely) loaded:

$ kextfind -not -arch x86_64
/System/Library/Extensions/ACard6280ATA.kext
/System/Library/Extensions/ACard62xxM.kext
/System/Library/Extensions/ACard67162.kext
/System/Library/Extensions/ACard671xSCSI.kext
/System/Library/Extensions/ACard6885M.kext
/System/Library/Extensions/ACard68xxM.kext
/System/Library/Extensions/AppleIntelGMA950.kext
/System/Library/Extensions/AppleIntelGMAX3100.kext
/System/Library/Extensions/AppleIntelGMAX3100FB.kext
/System/Library/Extensions/AppleIntelIntegratedFramebuffer.kext
/System/Library/Extensions/AppleProfileFamily.kext/Contents/PlugIns/AppleIntelYonahProfile.kext
/System/Library/Extensions/IO80211Family.kext/Contents/PlugIns/AirPortAtheros.kext
/System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/AppleRTL8139Ethernet.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/InternalModemSupport.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/MotorolaSM56KUSB.kext
/System/Library/Extensions/JMicronATA.kext
/System/Library/Extensions/System.kext/PlugIns/BSDKernel6.0.kext
/System/Library/Extensions/System.kext/PlugIns/IOKit6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Libkern6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Mach6.0.kext
/System/Library/Extensions/System.kext/PlugIns/System6.0.kext
/System/Library/Extensions/ufs.kext

So it could be any one of those loaded extensions that is enabling something for the Rprof package to use, so that the setkey operation in data.table is profiled correctly.

If anyone wants to investigate this further, dig a bit deeper, and get to the root cause of the problem, please post an answer and I'll happily accept that one.

Comments