Greg Rov - 1 year ago 76
Linux Question

# Pearson Correlation between two columns

Good Morning. Here is my problem:
I have several files like the one below:

``````104 0.1697 12.3513214 15.9136214
112 -0.3146 12.0517303 14.8027303
122 0.2718 10.881109 13.259109
123 -0.4185 11.2880142 14.0237142
128 0.0205 13.0585763 15.4365763
132 0.1562 13.3956582 16.9579582
136 -0.4602 12.2567041 14.6347041
157 0.8142 13.6455927 17.2078927
158 -0.9244 8.0012967 11.5635967
``````

Approximately 10000 files, each file with several rows.
And I need to make the Pearson correlation between the column 2 and 4 for each file. Later, I need to make the average of these correlations. And I would like to do everything by Linux commands. Can anyone help me, please?
Thanks

Try this script. You will need bash and bc (to operate on floating point numbers).

• give access to execute it `chmod +x /path/to/pearson.sh`
• change FILES to your directory where all files are stored
• call script with no parameters `bash /path/to/pearson.sh`.

It should produce the mean of all Pearson correlation coefficients calculated on data from those files.

``````#! /bin/bash

FILES=/path/to/files/

echo \$1 + \$2 | bc
}
function sub {
echo \$1 - \$2 | bc
}
function mult {
echo \$1*\$2 | bc
}
function div {
echo \$1 / \$2 | bc -l
}
function sqrt {
echo "sqrt (\$1)" | bc -l
}

X=0
X2=0
Y=0
Y2=0
XY=0

r=0
R=0
N=0

for f in \$FILES/*; do
N=\$((N+1))
n=0
n=\$((n+1))
read -r -a rows <<< \$l
x=\${rows[1]}
y=\${rows[3]}
done < \$f;
r=\$(sub \$r \$(div \$(mult \$X \$Y) \$n))
d1=\$(sub \$X2 \$(div \$(mult \$X \$X) \$n))
d2=\$(sub \$Y2 \$(div \$(mult \$Y \$Y) \$n))
r=\$(div \$r \$(sqrt \$(mult \$d1 \$d2)))
X=0
X2=0
Y=0
Y2=0
XY=0
r=0
n=0
done

echo Mean=\$(div \$R \$N)
``````

Ps: I assumed that all files have format like that one you presented. Formula to evaluate the coefficients was taken from the link you gave (http://www.stat.wmich.edu/s216/book/node122.html).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download