thinkinbee thinkinbee - 1 year ago 193
Scala Question

selecting a range of elements in an array spark sql

I use Spark-shell to do the below operations

Recently loaded a table with an array column in spark-sql .

Here is the ddl for the same:

create table test_emp_arr{
dept_id string,
dept_nm string,
emp_details Array<string>

the data looks something like this

|dept_id|dept_nm| emp_details|
| 10|Finance|[Jon, Snow, Castle, Black, Ned]|
| 20| IT| [Ned, is, no, more]|

i can query the emp_details column something like this :

sqlContext.sql("select emp_details[0] from emp_details").show


I want to query a range of elements in the collection :

Expected query to work

sqlContext.sql("select emp_details[0-2] from emp_details").show


sqlContext.sql("select emp_details[0:2] from emp_details").show

Expected output

| emp_details|
|[Jon, Snow, Castle]|
| [Ned, is, no]|

In pure scala if i have an array something as :

val emp_details = Array("Jon","Snow","Castle","Black")

i can get the elements from 0 to 2 range using


returns me

Array(Jon, Snow,Castle)

I am not able to apply the above operation of the array in spark-sql . any help ?


Answer Source

Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice method :

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))

Example with a sample of your data :

val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details")
df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show

Produces the expected output

|         emp_details|              slice|
|[Jon, Snow, Castl...|[Jon, Snow, Castle]|

You can also register the UDF in your sqlContext and use it like this

sqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to))
sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon‌​','Snow','Castle','Black','Ned'),0,3)")

You won't need lit anymore with this solution

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download