Greg Rov - 1 year ago 51

Linux Question

Good Morning. Here is my problem:

I have several files like the one below:

`104 0.1697 12.3513214 15.9136214`

112 -0.3146 12.0517303 14.8027303

122 0.2718 10.881109 13.259109

123 -0.4185 11.2880142 14.0237142

128 0.0205 13.0585763 15.4365763

132 0.1562 13.3956582 16.9579582

136 -0.4602 12.2567041 14.6347041

157 0.8142 13.6455927 17.2078927

158 -0.9244 8.0012967 11.5635967

Approximately 10000 files, each file with several rows.

And I need to make the Pearson correlation between the column 2 and 4 for each file. Later, I need to make the average of these correlations. And I would like to do everything by Linux commands. Can anyone help me, please?

Thanks

Answer Source

Try this script. You will need bash and bc (to operate on floating point numbers).

- give access to execute it
`chmod +x /path/to/pearson.sh`

- change FILES to your directory where all files are stored
- call script with no parameters
`bash /path/to/pearson.sh`

.

It should produce the mean of all Pearson correlation coefficients calculated on data from those files.

```
#! /bin/bash
FILES=/path/to/files/
function add {
echo $1 + $2 | bc
}
function sub {
echo $1 - $2 | bc
}
function mult {
echo $1*$2 | bc
}
function div {
echo $1 / $2 | bc -l
}
function sqrt {
echo "sqrt ($1)" | bc -l
}
X=0
X2=0
Y=0
Y2=0
XY=0
r=0
R=0
N=0
for f in $FILES/*; do
N=$((N+1))
n=0
while read l; do
n=$((n+1))
read -r -a rows <<< $l
x=${rows[1]}
y=${rows[3]}
X=$(add $X $x)
X2=$(add $X2 $(mult $x $x))
Y=$(add $Y $y)
Y2=$(add $Y2 $(mult $y $y))
XY=$(add $XY $(mult $x $y))
done < $f;
r=$(add $r $XY)
r=$(sub $r $(div $(mult $X $Y) $n))
d1=$(sub $X2 $(div $(mult $X $X) $n))
d2=$(sub $Y2 $(div $(mult $Y $Y) $n))
r=$(div $r $(sqrt $(mult $d1 $d2)))
R=$(add $R $r)
X=0
X2=0
Y=0
Y2=0
XY=0
r=0
n=0
done
echo Mean=$(div $R $N)
```

Ps: I assumed that all files have format like that one you presented. Formula to evaluate the coefficients was taken from the link you gave (http://www.stat.wmich.edu/s216/book/node122.html).