Giulio Giulio - 29 days ago 14
Python Question

Python matplotlib giving shorter regression line

This is my current problem: sometimes, when I plot a scatter-plot and then a regression line (using the same data), the regression line calculated with

scipy.stats.linregress(y,X)
is too short, like this one:

Plot screenshot

As I already said, the data for the scatter plot and the line are the same, and usually, it works. This is the code:

### "results" comes from a Mysql query and passed to an np.array
Data=np.array(results)
X=Data[:,1]
y=Data[:,0]
slope, intercept, r_value, p_value, std_err=linregress(y,X)
line = slope * X + intercept
plt.scatter(y, X, marker='o',color='#33ffe6',alpha=1,edgecolors='black',linewidths=0.5)
#### Plot
plt.plot(X, line, 'r', label="Regression Line", antialiased=True)
#### Label
plt.xlabel(labels[0])
plt.ylabel(labels[col])
#### scala
plt.xlim(y.min()-((y.max() / float(100)) * 5), y.max() + (y.max() / float(100)) * 5)
plt.ylim(X.min()-((X.max() / float(100)) * 5), X.max() + (X.max() / float(100)) * 5)
#### Legenda
plt.legend(loc="upper center", bbox_to_anchor=(0.5, 1.07), fancybox=True, shadow=True)


This is a drop box link with a .csv file containing the data: here. The first column is the y and the second the X.

Answer

The problem is that your regression has fit a line to X as a function of Y, not Y as a function of X, so when you construct your line you should use Y as the independent variable instead of X. Because neither X nor Y are sorted, it's also probably a good idea to plot a line versus a regularly-sampled array. I actually might suggest, to avoid all this confusion, doing things normally and just swapping the columns you get X and Y from.

From your csv file, I believe this code produces your desired plot (minus the axis labels).

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

df = pd.read_csv(r'C:\Users\smith\Downloads\orders.csv')
Data=df.values
X=Data[:,1]
y=Data[:,0]
slope, intercept, r_value, p_value, std_err=stats.linregress(y,X)
yvals = np.linspace(y.min(),y.max())
line = slope * yvals + intercept # This is the critical change
plt.scatter(y, X, marker='o',color='#33ffe6',alpha=1,edgecolors='black',linewidths=0.5)
#### Plot
plt.plot(yvals, line, 'r', label="Regression Line", antialiased=True)
#### scala
plt.xlim(y.min()-((y.max() / float(100)) * 5), y.max() + (y.max() / float(100)) * 5)
plt.ylim(X.min()-((X.max() / float(100)) * 5), X.max() + (X.max() / float(100)) * 5)
#### Legenda
plt.legend(loc="upper center", bbox_to_anchor=(0.5, 1.07), fancybox=True, shadow=True)

enter image description here

Comments