Niek de Klein Niek de Klein - 3 months ago 11
R Question

When I read in a large table using fread it slightly changes the numbers in one of the columns

I have a large file that looks like this

region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8


When I read it in using fread, suddenly 82365593523656436 is not found anymore

correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE


I can find a slightly different number

> "82365593523656432" %in% correlations$region
[1] TRUE


but this number is not in the actual file

grep 82365593523656432 all_to_all_correlations.txt


gives no results, while

grep 82365593523656436 all_to_all_correlations.txt


does.

When I try to read in the small sample file I showed above instead of the full file I get

Warning message:
In fread("test.txt") :
Some columns have been read as type 'integer64' but package bit64 isn't loaded.
Those columns will display as strange looking floating point data.
There is no need to reload the data.
Just require(bit64) toobtain the integer64 print method and print the data again.


and the data looks like

region type coeff p.value distance count
1 3.758823e-303 A -0.94940 0.050 -16479472 8
2 3.758823e-303 B 0.47303 0.526 57815363 8
3 3.758823e-303 C -0.89380 0.106 42848210 8


So I think during reading 82365593523656436 was changed into 82365593523656432. How can I prevent this from happening?

Answer

IDs (and that's apparently what the first column is) should usually be read as characters:

correlations <- setDF(fread('region              type    coeff      p-value  distance    count
                                 82365593523656436   A      -0.9494     0.050    -16479472.5 8
                                 82365593523656436   B      0.47303     0.526    57815363.0  8
                                 82365593523656436   C      -0.8938     0.106    42848210.5  8',
                            colClasses = c(region = "character")))
str(correlations)
#'data.frame':  3 obs. of  6 variables:
# $ region  : chr  "82365593523656436" "82365593523656436" "82365593523656436"
# $ type    : chr  "A" "B" "C"
# $ coeff   : num  -0.949 0.473 -0.894
# $ p-value : num  0.05 0.526 0.106
# $ distance: num  -16479473 57815363 42848211
# $ count   : int  8 8 8