Greg Rov Greg Rov - 1 year ago 56
Linux Question

Pearson Correlation between two columns

Good Morning. Here is my problem:
I have several files like the one below:

104 0.1697 12.3513214 15.9136214
112 -0.3146 12.0517303 14.8027303
122 0.2718 10.881109 13.259109
123 -0.4185 11.2880142 14.0237142
128 0.0205 13.0585763 15.4365763
132 0.1562 13.3956582 16.9579582
136 -0.4602 12.2567041 14.6347041
157 0.8142 13.6455927 17.2078927
158 -0.9244 8.0012967 11.5635967


Approximately 10000 files, each file with several rows.
And I need to make the Pearson correlation between the column 2 and 4 for each file. Later, I need to make the average of these correlations. And I would like to do everything by Linux commands. Can anyone help me, please?
Thanks

Answer Source

Try this script. You will need bash and bc (to operate on floating point numbers).

  • give access to execute it chmod +x /path/to/pearson.sh
  • change FILES to your directory where all files are stored
  • call script with no parameters bash /path/to/pearson.sh.

It should produce the mean of all Pearson correlation coefficients calculated on data from those files.

#! /bin/bash

FILES=/path/to/files/

function add {
  echo $1 + $2 | bc
}
function sub {
  echo $1 - $2 | bc
}
function mult {
  echo $1*$2 | bc
}
function div {
  echo $1 / $2 | bc -l
}
function sqrt {
  echo "sqrt ($1)" | bc -l
}

X=0
X2=0
Y=0
Y2=0
XY=0

r=0
R=0
N=0

for f in $FILES/*; do
  N=$((N+1))
  n=0
  while read l; do
    n=$((n+1))
    read -r -a rows <<< $l
    x=${rows[1]}
    y=${rows[3]}
    X=$(add $X $x)
    X2=$(add $X2 $(mult $x $x))
    Y=$(add $Y $y)
    Y2=$(add $Y2 $(mult $y $y))
    XY=$(add $XY $(mult $x $y))
  done < $f;
  r=$(add $r $XY)
  r=$(sub $r $(div $(mult $X $Y) $n))
  d1=$(sub $X2 $(div $(mult $X $X) $n))
  d2=$(sub $Y2 $(div $(mult $Y $Y) $n))
  r=$(div $r $(sqrt $(mult $d1 $d2)))
  R=$(add $R $r)
  X=0
  X2=0
  Y=0
  Y2=0
  XY=0
  r=0
  n=0
done

echo Mean=$(div $R $N)

Ps: I assumed that all files have format like that one you presented. Formula to evaluate the coefficients was taken from the link you gave (http://www.stat.wmich.edu/s216/book/node122.html).