flybonzai flybonzai - 6 months ago 67
Python Question

How to pull the slice of an array in Spark SQL (Dataframes)?

I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:

|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|


In both array-types, from 'courses' onward is the same data and structure.

I want to take the slice of the array using a
case
statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax
[3:]
, and normal
PostgreSQL
syntax
[3, n]
where
n
is the length of the array. If it's not 'api', then just take the given value.

My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.

Answer

It's very easy just define a UDF, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).

df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()

getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)

df.select(getUDF(col("_1"), lit(0))).show()

+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
|                [bs, xd]|
|            [bc, cd, :x]|
+------------------------+