Christian Bertram - 7 months ago 42

Python Question

I am implementing liner regression in python, using numpy. My implementation of the square cost function looks like this

`square_cost(w, datax, datay):`

ndata = datax.shape[1]

return (1/ndata)*np.sum((h(w, datax)-datay)**2)

All parameters are 2-dimensional ndarrays, but datay and the result only have height one.

What I later saw, was this implementation:

`square_cost(w, datax, datay):`

ndata = datax.shape[1]

diff = h(w, datax)-datay

return (1/ndata)*diff.dot(diff.T)

I think my first implementation is the clearest, but is the second faster, since it uses dot product?

Answer

```
import numpy as np
np.random.seed([3, 1415])
x = np.random.rand(1000000, 1)
y = np.random.rand(1000000, 1)
%%timeit
diff = x - y
diff.T.dot(diff)
```

100 loops, best of 3: 3.66 ms per loop

```
%%timeit
diff = x - y
np.square(diff).sum()
```

100 loops, best of 3: 6.85 ms per loop

I'd say yes. Dot product is faster.

For completeness and to address @Eric's concern in the comment wrt the OP's question.

In the regression at hand, the dimension of the endogenous (y) variable is n x 1. Therefore, we can safely assume the second dimension will always have a size of 1.

However, if it were to have a size greater than 1, say m, (and this is definitely possible, just not what the OP needs) then we would be looking at an exogenous (X) of dimension n x k and endogenous (Y) of dimension n x m. This implies a parameter matrix Theta of size k x m. Still totally reasonable. The kicker is that in order to calculate the sum of squared errors we'd be doing (X * Theta - Y) squared and summed. The inner product of (X * Theta - Y) is no longer appropriate as a means to calculate our cost function and its performance is irrelevant. To get something appropriate, we'd reshape (X * Theta - Y) to one dimension then take the inner product. In that case, the same analysis we've done over one dimension is still the most appropriate analysis.

All that said, I ran the following code:

```
idx = pd.Index(np.arange(1000, 501000, 1000)).sort_values()
ddd = pd.DataFrame(index=idx, columns=['dot', 'dif'])
def fill_ddd(row):
i = row.name
x = np.random.rand(i, 1)
s = pd.datetime.now()
for _ in range(100):
x.T.dot(x)
row.loc['dot'] = (pd.datetime.now() - s).total_seconds() / 100.
s = pd.datetime.now()
for _ in range(100):
np.square(x).sum()
row.loc['dif'] = (pd.datetime.now() - s).total_seconds() / 100.
return row
np.random.seed([3, 1415])
ddd.apply(fill_ddd, axis=1)
ddd.plot()
```

The dot product is the clear winner and much more consistent.