José José - 2 months ago 9
Python Question

Replace values in a dataframes using other dataframe with strings as keys with Pandas

I have been trying this for a while and I am stuck. Here it is the problem:

I am working with some metadata about texts that I have in CSV files. It looks like this:

enter image description here

The real table is longer and more complex, but it follows the same logic: every row is a text and every column is different aspects of the text. I have in some of the columns to much variation and I want it to remodel in a simpler one. For example changing from the narrative-perspective the values of homodiegetic and autodiegetic to non-heterodiegetic. I define this new model in another CSV file called keywords that looks like this:

enter image description here

As you can see, every column of the metadata becomes a row in the new model-keywords, where the old value is in the term_value column and the new value is in the new_model column.

So I need to map or replace this values using Pandas. This is what I have got for now:

import re
import pandas as pd

df_metadata = pd.read_csv("/metadata.csv", encoding="utf-8", sep=",")
df_keywords = pd.read_csv("/keywords.csv", encoding="utf-8", sep="\t")

for column_metadata,value_metadata in df_metadata.iteritems():

if str(column_metadata) in list(df_keywords.loc[:,"term_type"]):

df_metadata.loc[df_metadata[column_metadata] == value_metadata, column_metadata] = df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]


And Python always gives this error back:


"ValueError: Series lengths must match to compare"


I think the problem is in the value_metadata of the second part of the replace with loc, I mean here:

df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]


The thing I don't understand is why value_metadata works in the first part of this command but it doesn't in the second one...

Please, I would appreciate any help. Maybe there is a simpler way than iterate through the dataframes... I am very open to any suggestion. Best regards,
José

Answer

You can first create Multiindex in df_keywords for faster selecting by slicers and in loop map new value by old one:

df_keywords.set_index(['term_type','term_value'], inplace=True)

idx = pd.IndexSlice
#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]].to_dict()['new_model'])
{'heterodiegetic': 'heterodiegetic', 'other/mixed': 'n-heterodiegetic', 
 'homodiegetic': 'n-heterodiegetic', 'autodiegetic': 'n-heterodiegetic'}

#column names for replacing    
L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = 
    df_metadata[col].map(df_keywords.loc[idx[col,:]].to_dict()['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male

Also to_dict can be omited and then map by Series:

df_keywords.set_index(['term_type','term_value'], inplace=True)
idx = pd.IndexSlice

#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]]['new_model'])
term_value
autodiegetic      n-heterodiegetic
heterodiegetic      heterodiegetic
homodiegetic      n-heterodiegetic
other/mixed       n-heterodiegetic
Name: new_model, dtype: object

L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = df_metadata[col].map(df_keywords.loc[idx[col,:]]['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male