Sachin_ruk Sachin_ruk - 4 months ago 22
Python Question

DataFrame from jagged array

I have a dataset that looks like follows:

date = ['01/01/2001','02/01/2001']
countries = [['US', 'UK', 'AU'],['CN']]


so basically the data ought to look like:

def flatten(array):
return sum(array,[])
pd.DataFrame({'date': flatten([[date[0]]*3, [date[1]]]), 'country': flatten(countries)})
# Which Returns:
country date
0 US 01/01/2001
1 UK 01/01/2001
2 AU 01/01/2001
3 CN 02/01/2001


The above seems like an inefficient way to implement this. The length of the
date
and
countries
arrays are exactly the same. Basically the
date
needs to be repeated by the length of the sub array.

Is there an elegant way of doing this in Pandas? Or do I need to preprocess the arrays similar to what I have done?

Answer

Try my 1-liner:

df = pd.DataFrame(list(chain(*[list(product([x],y)) for x, y in zip(date, countries)])), columns= ['date',"countries"])

Explanation:

Basically itertools is a perfect choice in this case to tangle the problem in a more Pythonic way:

from itertools import chain, product
df = pd.DataFrame(list(chain(*[list(product([x],y)) for x, y in zip(date, countries)])), columns= ['date',"countries"])

df
Out[56]: 
         date countries
0  01/01/2001        US
1  01/01/2001        UK
2  01/01/2001        AU
3  02/01/2001        CN

Updates:

More Detailed Explanation:

  • Step 1: Use list comprehension, zip, itertools.product to product the two input list: [list(product([x],y)) for x, y in zip(date, countries)] Out[59]: [[('01/01/2001', 'US'), ('01/01/2001', 'UK'), ('01/01/2001', 'AU')], [('02/01/2001', 'CN')]]

  • Step 2: Unpack/flatten list of lists, using itertools.chain(). list(chain(*[list(product([x],y)) for x, y in zip(date, countries)])) Out[60]: [('01/01/2001', 'US'), ('01/01/2001', 'UK'), ('01/01/2001', 'AU'), ('02/01/2001', 'CN')]

  • Step 3: Form the DataFrame: df = pd.DataFrame(list(chain(*[list(product([x],y)) for x, y in zip(date, countries)])), columns= ['date',"countries"])
Comments