the M the M - 4 months ago 21
Python Question

Converting pandas df containing rownames, columnnames and frequency to Term Document Matrix

I have a pandas df in the following format:
Input :

Freq Document TermId
3 A 112
5 A 055
1 C 003
4 D 001
2 B 003
1 D 089


I want to convert this dataframe to a term document matrix (preferably another pandas df) . What will be the most efficient way of achieving this?

Ex output:

Term/Document A B C D
001 0 0 0 4
003 0 2 1 0
055 5 0 0 0
089 0 0 0 1
112 3 0 0 0

Answer

Notice that the desired DataFrame has an index whose labels are from df['TermId'] and whose column labels are from df['Document']. Whenever the index and column labels come from columns of df, think about using df.pivot or df.pivot_table if aggregation is needed:

import pandas as pd
df = pd.DataFrame({'Document': ['A', 'A', 'C', 'D', 'B', 'D'],
                   'Freq': ['3', '5', '1', '4', '2', '1'],
                   'TermId': ['112', '055', '003', '001', '003', '089']})
result = df.pivot(index='TermId', columns='Document', values='Freq').fillna(0)
print(result)

yields

Document  A  B  C  D
TermId              
001       0  0  0  4
003       0  2  1  0
055       5  0  0  0
089       0  0  0  1
112       3  0  0  0
Comments