sriramn - 2 months ago 10

Python Question

What is

`numpy`

`pandas`

`sweep()`

To elaborate: in R lets say we have a coefficient vector (say beta - numeric type) and an array (say data - 20x5 numeric type). I want to superimpose the vector on each row of the array and multiply the corresponding elements. And then return the resultant (20x5) array I could achieve this using

`sweep()`

`R`

`beta <- c(10, 20, 30, 40)`

data <- array(1:20,c(5,4))

sweep(data,MARGIN=2,beta,`*`)

#---------------

> data

[,1] [,2] [,3] [,4]

[1,] 1 6 11 16

[2,] 2 7 12 17

[3,] 3 8 13 18

[4,] 4 9 14 19

[5,] 5 10 15 20

> beta

[1] 10 20 30 40

> sweep(data,MARGIN=2,beta,`*`)

[,1] [,2] [,3] [,4]

[1,] 10 120 330 640

[2,] 20 140 360 680

[3,] 30 160 390 720

[4,] 40 180 420 760

[5,] 50 200 450 800

I have heard exciting things about

`numpy`

`pandas`

`R`

Answer

Pandas has an apply method too, apply being what R's sweep uses under the hood. (Note that the MARGIN argument is "equivalent" to the axis argument in many pandas functions, except that it takes values 0 and 1 rather than 1 and 2).

```
In [11]: np.random.seed = 1
In [12]: beta = pd.Series(np.random.randn(5))
In [13]: data = pd.DataFrame(np.random.randn(20, 5))
```

You can use an apply with a function which is called against each row:

```
In [14]: data.apply(lambda row: row * beta, axis=1)
```

*Note: that axis=0 would apply against each column, this is the default as data is stored column-wise and so column-wise operations are more efficient.*

However, in this case it's easy to make **significantly faster** (and more readable) to vectorize, simply by multiplying row-wise:

```
In [21]: data.apply(lambda row: row * beta, axis=1).head()
Out[21]:
0 1 2 3 4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1 0.026433 0.355915 -0.672302 0.225446 -0.520374
2 0.042254 -1.223200 -0.545957 0.103864 -0.372855
3 0.086367 0.218539 -1.033671 0.218388 -0.598549
4 0.203071 -3.402876 0.192504 -0.147548 -0.726001
In [22]: data.mul(beta, axis=1).head() # just show first few rows with head
Out[22]:
0 1 2 3 4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1 0.026433 0.355915 -0.672302 0.225446 -0.520374
2 0.042254 -1.223200 -0.545957 0.103864 -0.372855
3 0.086367 0.218539 -1.033671 0.218388 -0.598549
4 0.203071 -3.402876 0.192504 -0.147548 -0.726001
```

*Note: this is slightly more robust / allows more control than using *.*

You can do the same in numpy (ie `data.values`

here), either multiplying directly, this will be faster as it doesn't worry about data-alignment, or using vectorize rather than apply.