ml_student ml_student - 3 months ago 19
Python Question

How to transform several lists of words to a pandas dataframe?

I have file .txt that contains a list of words like this:

5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants


I would like to transform such data into a pandas table that I expect to show into a html/bootstrap template as follows (*):

COL_A COL_B
5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants


So I tried the following with pandas:

import pandas as pd
df = pd.read_csv('file.csv',
sep = ' ', names=['weight', 'topics'])
df.head(20)


However, my table doesnt have the above desired sructure:

COL_A COL_B
6.281426 engaged, chance, makes, meeting, nations, things, believe, tries, believing, knocked, admits, awkward
6.277438 sweden NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.271190 artificial, ammonium NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.259790 boats, prefix NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.230612 targets, tactical, wing, missile, squadrons NaN NaN NaN NaN NaN NaN NaN


Any idea of how to get the data as the (*) tabular format?

Answer

Because you have space between words and if you specify space as delimiter, it will naturally separate them. To get what you need, you can try to set the sep as a regular expression (?<!,), ?<! is a negative look behind syntax, which means separate on space only when it is not preceded by a comma and it should work for your case:

pd.read_csv("~/test.csv", sep = "(?<!,) ", names=['weight', 'topics'])

#     weight    topics
#0  5.916863    exclusively, catering, provides, arms, georgia...
#1  5.915604    hugh, senlis
#2  5.915279    italians
#3  5.914704    soil, cultivation, fertile
#4  5.914681    increases, moderation
#5  5.914402    farmers, descendants