Greg Rov - 8 months ago 33

Linux Question

Good Morning. Here is my problem:

I have several files like the one below:

`104 0.1697 12.3513214 15.9136214`

112 -0.3146 12.0517303 14.8027303

122 0.2718 10.881109 13.259109

123 -0.4185 11.2880142 14.0237142

128 0.0205 13.0585763 15.4365763

132 0.1562 13.3956582 16.9579582

136 -0.4602 12.2567041 14.6347041

157 0.8142 13.6455927 17.2078927

158 -0.9244 8.0012967 11.5635967

Approximately 10000 files, each file with several rows.

And I need to make the Pearson correlation between the column 2 and 4 for each file. Later, I need to make the average of these correlations. And I would like to do everything by Linux commands. Can anyone help me, please?

Thanks

Answer

Try this script. You will need bash and bc (to operate on floating point numbers).

- give access to execute it
`chmod +x /path/to/pearson.sh`

- change FILES to your directory where all files are stored
- call script with no parameters
`bash /path/to/pearson.sh`

.

It should produce the mean of all Pearson correlation coefficients calculated on data from those files.

```
#! /bin/bash
FILES=/path/to/files/
function add {
echo $1 + $2 | bc
}
function sub {
echo $1 - $2 | bc
}
function mult {
echo $1*$2 | bc
}
function div {
echo $1 / $2 | bc -l
}
function sqrt {
echo "sqrt ($1)" | bc -l
}
X=0
X2=0
Y=0
Y2=0
XY=0
r=0
R=0
N=0
for f in $FILES/*; do
N=$((N+1))
n=0
while read l; do
n=$((n+1))
read -r -a rows <<< $l
x=${rows[1]}
y=${rows[3]}
X=$(add $X $x)
X2=$(add $X2 $(mult $x $x))
Y=$(add $Y $y)
Y2=$(add $Y2 $(mult $y $y))
XY=$(add $XY $(mult $x $y))
done < $f;
r=$(add $r $XY)
r=$(sub $r $(div $(mult $X $Y) $n))
d1=$(sub $X2 $(div $(mult $X $X) $n))
d2=$(sub $Y2 $(div $(mult $Y $Y) $n))
r=$(div $r $(sqrt $(mult $d1 $d2)))
R=$(add $R $r)
X=0
X2=0
Y=0
Y2=0
XY=0
r=0
n=0
done
echo Mean=$(div $R $N)
```

Ps: I assumed that all files have format like that one you presented. Formula to evaluate the coefficients was taken from the link you gave (http://www.stat.wmich.edu/s216/book/node122.html).