alvas alvas - 7 months ago 18
Python Question

How to extract rows in sframe where there's a joint condition and two separate conditions?

I have an

sframe
as such:

+---------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 | ro | Agenție de aprovizionare | fullForm | 3 |
| IATE-84 | bg | компетенции на държави чле... | fullForm | 3 |
| IATE-84 | cs | příslušnost členských stát... | fullForm | 3 |
| IATE-84 | da | medlemsstatskompetence | fullForm | 3 |
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-84 | el | αρμοδιότητα των κρατών μελ... | fullForm | 3 |
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-84 | es | competencias de los Estado... | fullForm | 3 |
| IATE-84 | et | liikmesriikide pädevus | fullForm | 3 |
| IATE-84 | fi | jäsenvaltioiden toimivalta | fullForm | 3 |
| IATE-84 | fr | compétence des États membres | fullForm | 3 |
| IATE-84 | ga | inniúlacht na mBallstát | fullForm | 3 |
| IATE-84 | hu | tagállami hatáskör | fullForm | 3 |
| IATE-84 | it | competenza degli Stati membri | fullForm | 3 |
| IATE-84 | lt | valstybių narių kompetencija | fullForm | 2 |
| IATE-84 | lv | dalībvalstu kompetence | fullForm | 3 |
| IATE-84 | nl | bevoegdheid van de lidstaten | fullForm | 3 |
| IATE-84 | pl | kompetencje państw członko... | fullForm | 3 |
| IATE-84 | pt | competência dos Estados-Me... | fullForm | 3 |
| IATE-84 | ro | competența statelor membre... | fullForm | 3 |
+---------+------+-------------------------------+-----------+------------------+


I need to extract all the rows where
lang == 'de' or lang == 'en'
but the rows I extract with
lang == 'en'
needs to have a corresponding
lang == 'de'
such that they share the same
term_id
.

I have been doing it as such with
graphlab
and
sframe
:

sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='\0', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']


and
de.print_rows(10)
:

+------------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-290 | de | Schutz der öffentlichen Ge... | fullForm | 3 |
| IATE-662 | de | mengenmäßigen Ausfuhrbesch... | fullForm | 3 |
| IATE-801 | de | Eintragungshindernisse | fullForm | 2 |
| IATE-1326 | de | Sonderregelung für Reisebü... | fullForm | 4 |
| IATE-1702 | de | Erwerbslose | fullForm | 4 |
| IATE-2818 | de | Verwaltungsvorschriften | fullForm | 3 |
| IATE-21139 | de | frisches Obst und Gemüse | fullForm | 3 |
| IATE-21563 | de | chemische Erzeugnisse zur ... | fullForm | 3 |
| IATE-21564 | de | Mineralsäuren | fullForm | 3 |
+------------+------+-------------------------------+-----------+------------------+


And then:

en = sf[sf['lang'] == 'en']
en.print_rows(10)


[out]:

+------------+------+-------------------------------+--------------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-254 | en | award of public works cont... | fullForm | 3 |
| IATE-290 | en | public health protection | fullForm | 3 |
| IATE-662 | en | quantitative restriction o... | fullForm | 3 |
| IATE-801 | en | grounds for refusal | fullForm | 2 |
| IATE-1299 | en | CEP | abbreviation | 3 |
| IATE-1326 | en | special scheme for travel ... | fullForm | 3 |
| IATE-2818 | en | regulations | fullForm | 3 |
| IATE-7128 | en | company name | fullForm | 2 |
| IATE-21139 | en | fresh fruits and vegetables | fullForm | 3 |
+------------+------+-------------------------------+--------------+------------------+


I have tried:

en_de = en[en['term_id'] in de_termids]


But I'm getting the syntax wrong, giving me this error:

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
691
692 """
--> 693 return (self == item).any()
694
695 def contains(self, item):

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
973 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
974 else:
--> 975 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
976
977

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Array size mismatch


How should I filter the sframe such that I get rows with
en
and
de
and corresponding
term_id
?


The resulting dataframe should look something like this:

+---------+-----------------+-------------+
| term_id | term_str_en | term_str_de |
+---------+-------------------------------+
| IATE-999 | something | etwas |
...
+---------+-----------------+-------------+


How do I do the same with
pandas
?

Answer

Suppose you already have two data frames with filtered data for both languages: df_en and df_de. Then you can merge them:

new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))

The method inner takes care of skipping all unmatched rows. You can find more options to merge in pandas docs and refs

Edit

The same result without creating two data frames (df is the original data frame with all entries, possibly containing other languages too):

new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))