piRSquared - 2 years ago 78
Python Question

How can I "sparsify" on two values?

consider the the pandas series

`s`

``````n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)

s.memory_usage()

8080
``````

I can "sparsify" this by using
`to_sparse`

``````s.to_sparse(fill_value=0).memory_usage()

4080
``````

But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?

Since you tagged this with `scipy`, I'll show you what a `scipy.sparse` matrix is like:

``````In [31]: n=100
In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]:
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([  0, 100], dtype=int32)
``````

It has replaced the `n` elements of `arr` with 2 arrays each with `n/2` elements. Even if I replace the `int` with `uint8`, the `M.indices` array will still be `int32`.

The fact that your `pandas` version has half the memory usage,suggests that it is just storing the indices, and some how noting that the `data` part is all 1s. But that's just a guess.

How much greater sparification do you expect?

