1290 1290 - 11 months ago 77
Scala Question

Create a subset of a Dataframe

If I have a data frame of email address like this from Hive:

email_address user_id

test@test.com 2134
null 2133
test4@test.com 2132
test5@test.com 21
test6@test.com 213
test7@test.com 21388
null 22
null 2134

I want to create two dataframes (one dataframes which have all the user_id's with emails that are null and the other dataframe which has all the user_id's with emails that are not null) Something like this:

First Dataframe: Second Dataframe:

test@test.com 2134 null 22
test4@test.com 2132 null 2134
test5@test.com 21 null 2133
test6@test.com 213
test7@test.com 21388

I have this code below:

val sparkConf = new SparkConf().setAppName("YOUR_APP_NAME").setMaster("local[10]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val hiveContext = new HiveContext(sc)

hiveContext.setConf("hive.metastore.uris", "METASTORE_URI_NAME_HERE")

val df = hiveContext.sql("SELECT email,user_id FROM USERS")

df.map{ row =>
if row.getString(0).length > 0 {
//ADD INTO "First Dataframe"
//row.getString(0) = email, row.getString(1) = user_id
}else {
//ADD INTO "First Dataframe"
//row.getString(0) = email, row.getString(1) = user_id

I am not sure if I need to create a whole new Dataframe or how I would do it in the first place either. Any pointers?

Answer Source

Using the dataframe function is probably easier in this case.

df_no_nulls = df.where(col("email_address").isNull())

df_nulls = df.where(col("email_address").isNotNull())