Observer Observer - 2 months ago 10x
SQL Question

Filtering out row in Pyspark when there is a word from other table in any column

I am new to pyspark and I want to write a query something like,

select * from table1 where column like '%word1%'
which we write in sql or hive.

I am writing the following command,

data = sqlCtx.sql('select * from table1 where column like '%word1%')

But I am getting errors such as,

NameError: name 'word1' is not defined

I am ideally thinking of having a condition like,

select word_name from table2;

would give a list of words and whenever those words occur in table1 in any column, I want to filter out those entries and give out the remaining rows and place it in a dataframe.

Can anybody help me in doing this?



Well, "like" function works in pyspark just fine and just like in SQL. With DataFrame API and with SQL API. Examples:

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
df = sqlContext.createDataFrame(

    index   type    income  education   prestige
0   accountant  prof    62  86  82
1   pilot   prof    72  76  83
2   architect   prof    75  92  90
3   author  prof    55  90  76

DataFrame API:


    index   type    income  education   prestige
0   pilot   prof    72  76  83

Or with SQL

sqlContext.sql("select * from df d where d.index like '%ilo%' ").toPandas()

And with join (silly but to prove the point)

qry = """
select d1.* 
from df d1 join df d2 
    on ( d1.index  = d2.index)
where d1.index like '%ilo%' and d2.index like concat('%', d1.index , '%')