alvas alvas - 18 days ago 7
Python Question

How to insert a mini identity matrix (row and column) to the start of Panda dataframe

Given a data frame as such:

+-----+-----+-----+------+
| X1 | xyz | opq | jkl |
+-----+-----+-----+------+
| abc | 0.9 | 0.0 | 0.5 |
+-----+-----+-----+------+
| efg | 0.3 | 0.0 | 0.0 |
+-----+-----+-----+------+
| lmn | 0.0 | 0.23| 0.0 |
+-----+-----+-----+------+


How do I add an identity matrix to the start of the dataframe to get this?:

+-----+-----+-----+------+------+
| X1 | <s> | opq | jkl | jkl |
+-----+-----+-----+------+------+
| <s> | 1.0 | 0.0 | 0.0 | 0.0 |
+-----+-----+-----+------+------+
| abc | 0.0 | 0.9 | 0.0 | 0.5 |
+-----+-----+-----+------+------+
| efg | 0.0 | 0.3 | 0.0 | 0.0 |
+-----+-----+-----+------+------+
| lmn | 0.0 | 0.0 | 0.23 | 0.0 |
+-----+-----+-----+------+------+


That's adding a 1x1 matrix but is there a way to add a nxn matrix, e.g.:

+-----+-----+-----+------+------+------+
| X1 | <s> | <e> | opq | jkl | jkl |
+-----+-----+-----+------+------+------+
| <s> | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
+-----+-----+-----+------+------+------+
| <e> | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
+-----+-----+-----+------+------+------+
| abc | 0.0 | 0.0 | 0.9 | 0.0 | 0.5 |
+-----+-----+-----+------+------+------+
| efg | 0.0 | 0.0 | 0.3 | 0.0 | 0.0 |
+-----+-----+-----+------+------+------+
| lmn | 0.0 | 0.0 | 0.0 | 0.23 | 0.0 |
+-----+-----+-----+------+------+------+


I've tried this:

$ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t0.23\nabc\tjkl\t0.5\n' > test.txt
$ cat test.txt
abc xyz 0.9
efg xyz 0.3
lmn opq 0.23
abc jkl 0.5

$ python
>>> from sframe import SFrame
>>> sf = SFrame.read_csv('test.txt', header=False, delimiter='\t', column_type_hints=[unicode, unicode, float])
[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1479779504.log
>>> df = sf.to_dataframe()
>>> df
X1 X2 X3
0 abc xyz 0.90
1 efg xyz 0.30
2 lmn opq 0.23
3 abc jkl 0.50
>>> df = df.pivot(index='X1', columns='X2', values='X3')
>>> df
X2 jkl opq xyz
X1
abc 0.5 NaN 0.9
efg NaN NaN 0.3
lmn NaN 0.23 NaN
>>> df.fillna(0)
X2 jkl opq xyz
X1
abc 0.5 0.00 0.9
efg 0.0 0.00 0.3
lmn 0.0 0.23 0.0


First I added a column:

>>> numrow, numcol = df.shape
>>> df.insert(0, '<s>', [0.0] * numrow)
>>> df
X2 <s> jkl opq xyz
X1
abc 0.0 0.5 0.00 0.9
efg 0.0 0.0 0.00 0.3
lmn 0.0 0.0 0.23 0.0


Then add the row and change the value of
df['<s>']['<s>']
:

>>> numrow, numcol = df.shape
>>> df.loc['<s>'] = [0.0]*numcol
>>> df
X2 <s> jkl opq xyz
X1
abc 0.0 0.5 0.00 0.9
efg 0.0 0.0 0.00 0.3
lmn 0.0 0.0 0.23 0.0
<s> 0.0 0.0 0.00 0.0
>>> df['<s>']['<s>'] = 1.0
>>> df
X2 <s> jkl opq xyz
X1
abc 0.0 0.5 0.00 0.9
efg 0.0 0.0 0.00 0.3
lmn 0.0 0.0 0.23 0.0
<s> 1.0 0.0 0.00 0.0


But still I need to swap the last row and put it to the first and push down all other rows in the table. How do I do that?

And there must be a better / easier way to do this.

Answer

Prepending to DataFrame isn't terribly simple, especially if you're looking to prepend both rows and columns.

Here's the best solution I could come up with, hopefully it isn't too inefficient since you're likely working with large datasets given that you use SFrame.

First, let's define the size of the identity matrix you want to preprend:

# Size o
n = 2

Now, let's create a new dataframe using numpy.eye, prepending new row and column indexes to your existing ones. I could have used numpy.identity since your initial dataframe is square, but it wouldn't work if it wasn't.

df2 = pd.DataFrame(np.eye(n+df.shape[0], n+df.shape[1]), 
         index=[chr(97+x) for x in (range(n))] + df.index.tolist(),
         columns=[chr(97+x) for x in (range(n))] + df.columns.tolist())
df2


       a    b  jkl  opq  xyz
a    1.0  0.0  0.0  0.0  0.0
b    0.0  1.0  0.0  0.0  0.0
abc  0.0  0.0  1.0  0.0  0.0
efg  0.0  0.0  0.0  1.0  0.0
lmn  0.0  0.0  0.0  0.0  1.0

Now we can assign the values of df to the subset of df2

df2.ix[n:,n:] = df
df2

       a    b  jkl   opq  xyz
a    1.0  0.0  0.0  0.00  0.0
b    0.0  1.0  0.0  0.00  0.0
abc  0.0  0.0  0.5  0.00  0.9
efg  0.0  0.0  0.0  0.00  0.3
lmn  0.0  0.0  0.0  0.23  0.0

Timing: For a df of size (100000,1000)

CPU times: user 6.82 s, sys: 4.52 s, total: 11.3 s
Wall time: 13.1 s

Update: Faster

I found a faster version:

df = pd.DataFrame(np.random.rand(100000,1000))
old_cols = df.columns.tolist()
old_index = df.index.tolist()

# Prepend rows: Create n rows and append existing df to this
df1 = pd.DataFrame(np.zeros(shape=(n, df.shape[1])),
             index=[chr(97+x) for x in (range(n))],
             columns=old_cols)
df1 = df1.append(df)

# This was slower
#df2 = pd.DataFrame(np.zeros(shape=(df1.shape[0],n)),
#             index=df1.index,
#             columns=[chr(97+x) for x in (range(n))])
#df_final = pd.concat([df2,df1], axis=1)

# Create n columns and set to zeros
for i in range(n):
    df1[chr(97+i)] = 0

# Reindex coluns in correct order
df1 = df1[[chr(97+x) for x in (range(n))] + old_cols]

# Assign identity to the top left corner
df1.iloc[:n,:n] = np.identity(n)

Timing: For a df of size (100000,1000)

CPU times: user 2.26 s, sys: 1.35 s, total: 3.61 s
Wall time: 3.64 s