ludofet ludofet - 5 months ago 20
Python Question

What is the most efficient way to create a pandas dataframe that combine two unrelated series?

I'm looking at creating a Dataframe that is the combination of two unrelated series.

If we take two dataframes:

A = ['a','b','c']
B = [1,2,3,4]

dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)


I'm looking for this output:

A B
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4


One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.

result = []
for i in A:
for j in B:
result.append([i,j])

result_DF = pd.DataFrame(result,columns=['A','B'])


Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:

from datetime import datetime

start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)

A = ['a','b','c']

dfA = pd.DataFrame(A)

result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
newdf = pd.DataFrame(index=benefit.index)
newdf['A'] = dfA.iloc[i,0]
newdf['month'] = newdf.index
result = pd.concat([result,newdf])
result

Answer

One liner approach

pd.DataFrame(0, A, B).stack().index.to_series().apply(pd.Series).reset_index(drop=True)

Or:

pd.MultiIndex.from_product([A, B]).to_series().apply(pd.Series).reset_index(drop=True)

From dataframes, assuming the information is in the first column.

pd.MultiIndex.from_product([dfA.iloc[:, 0], dfB.iloc[:, 0]]).to_series().apply(pd.Series).reset_index(drop=True)

Functionalized:

def cross(df1, df2):
    s1 = df1.iloc[:, 0]
    s2 = df2.iloc[:, 0]
    midx = pd.MultiIndex.from_product([s1, s2])
    df = midx.to_series().apply(pd.Series).reset_index(drop=True)
    df.columns = [s1.name, s2.name if s1.name != s2.name else 1]
    return df

print cross(dfA, dfB)

    0  1
0   a  1
1   a  2
2   a  3
3   a  4
4   b  1
5   b  2
6   b  3
7   b  4
8   c  1
9   c  2
10  c  3
11  c  4