Observer Observer - 1 year ago 58
SQL Question

Filtering out row in Pyspark when there is a word from other table in any column

I am new to pyspark and I want to write a query something like,

select * from table1 where column like '%word1%'
which we write in sql or hive.

I am writing the following command,

data = sqlCtx.sql('select * from table1 where column like '%word1%')


But I am getting errors such as,

NameError: name 'word1' is not defined


I am ideally thinking of having a condition like,

select word_name from table2;


would give a list of words and whenever those words occur in table1 in any column, I want to filter out those entries and give out the remaining rows and place it in a dataframe.

Can anybody help me in doing this?

Thanks

Answer Source

Well, "like" function works in pyspark just fine and just like in SQL. With DataFrame API and with SQL API. Examples:

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
df = sqlContext.createDataFrame(duncan_prestige.data.reset_index())

    index   type    income  education   prestige
0   accountant  prof    62  86  82
1   pilot   prof    72  76  83
2   architect   prof    75  92  90
3   author  prof    55  90  76

DataFrame API:

df.filter(df['index'].like('%ilo%')).toPandas()

    index   type    income  education   prestige
0   pilot   prof    72  76  83

Or with SQL

df.registerTempTable('df')
sqlContext.sql("select * from df d where d.index like '%ilo%' ").toPandas()

And with join (silly but to prove the point)

qry = """
select d1.* 
from df d1 join df d2 
    on ( d1.index  = d2.index)
where d1.index like '%ilo%' and d2.index like concat('%', d1.index , '%')
"""
sqlContext.sql(qry).toPandas()