Giulio - 1 year ago 63

Python Question

This is my current problem: sometimes, when I plot a scatter-plot and then a regression line (using the same data), the regression line calculated with

`scipy.stats.linregress(y,X)`

As I already said, the data for the scatter plot and the line are the same, and usually, it works. This is the code:

`### "results" comes from a Mysql query and passed to an np.array`

Data=np.array(results)

X=Data[:,1]

y=Data[:,0]

slope, intercept, r_value, p_value, std_err=linregress(y,X)

line = slope * X + intercept

plt.scatter(y, X, marker='o',color='#33ffe6',alpha=1,edgecolors='black',linewidths=0.5)

#### Plot

plt.plot(X, line, 'r', label="Regression Line", antialiased=True)

#### Label

plt.xlabel(labels[0])

plt.ylabel(labels[col])

#### scala

plt.xlim(y.min()-((y.max() / float(100)) * 5), y.max() + (y.max() / float(100)) * 5)

plt.ylim(X.min()-((X.max() / float(100)) * 5), X.max() + (X.max() / float(100)) * 5)

#### Legenda

plt.legend(loc="upper center", bbox_to_anchor=(0.5, 1.07), fancybox=True, shadow=True)

This is a drop box link with a .csv file containing the data: here. The first column is the y and the second the X.

Answer Source

The problem is that your regression has fit a line to X as a function of Y, not Y as a function of X, so when you construct your line you should use Y as the independent variable instead of X. Because neither X nor Y are sorted, it's also probably a good idea to plot a line versus a regularly-sampled array. I actually might suggest, to avoid all this confusion, doing things normally and just swapping the columns you get X and Y from.

From your csv file, I believe this code produces your desired plot (minus the axis labels).

```
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\smith\Downloads\orders.csv')
Data=df.values
X=Data[:,1]
y=Data[:,0]
slope, intercept, r_value, p_value, std_err=stats.linregress(y,X)
yvals = np.linspace(y.min(),y.max())
line = slope * yvals + intercept # This is the critical change
plt.scatter(y, X, marker='o',color='#33ffe6',alpha=1,edgecolors='black',linewidths=0.5)
#### Plot
plt.plot(yvals, line, 'r', label="Regression Line", antialiased=True)
#### scala
plt.xlim(y.min()-((y.max() / float(100)) * 5), y.max() + (y.max() / float(100)) * 5)
plt.ylim(X.min()-((X.max() / float(100)) * 5), X.max() + (X.max() / float(100)) * 5)
#### Legenda
plt.legend(loc="upper center", bbox_to_anchor=(0.5, 1.07), fancybox=True, shadow=True)
```