user2205916 user2205916 - 1 month ago 13
Scala Question

How do you break up long SQL query into multiple lines in Scala?

Given the following situation:

val sqlTransformer_labeled = new SQLTransformer().setStatement("""SELECT *, CAST(var1 AS DOUBLE) AS label, var2 - var3 - var4 AS newvar1, var5 - var6 AS new var2""")


How does one break up a long SQL query into multiple lines? I want something like the following for easier readability:

val sqlTransformer_labeled = new SQLTransformer().setStatement("""SELECT *,
CAST(var1 AS DOUBLE) AS label,
var2 - var3 - var4 AS newvar1,
var5 - var6 AS new var2""")


Readable code will be helpful both for myself 6 months from now, as well as for collaborators, as well as for our iterative modeling workflow.

However, attempts to run the latter, more legible version of the code above, gives strange feedback, shown below:

scala> val sqlTransformer_labeled = new SQLTransformer().setStatement("""SELECT *,
|
<init> DRIVER_IDENTIFIER LEGACY_DRIVER_IDENTIFIER Pipeline PipelineModel RDD_SCOPE_KEY RDD_SCOPE_NO_OVERRIDE_KEY
RandomForestClassifier SPARK_JOB_DESCRIPTION SPARK_JOB_GROUP_ID SPARK_JOB_INTERRUPT_ON_CANCEL SQLTransformer StringIndexer StringToColumn
VectorAssembler _sqlContext abs acos add_months approxCountDistinct array
array_contains asc ascii asin atan atan2 avg
base64 bin bitwiseNOT boolToBoolWritable booleanWritableConverter broadcast bytesToBytesWritable
bytesWritableConverter callUDF callUdf cbrt ceil classOf clearActiveContext
clone coalesce col collect_list collect_set column concat
concat_ws conv corr cos cosh count countDistinct
crc32 cumeDist cume_dist current_date current_timestamp date_add date_format
date_sub datediff dayofmonth dayofyear decode denseRank dense_rank
desc doubleRDDToDoubleRDDFunctions doubleToDoubleWritable doubleWritableConverter encode eq equals
exp explode expm1 expr factorial finalize first
floatToFloatWritable floatWritableConverter floor format_number format_string from_unixtime from_utc_timestamp
getClass getOrCreate get_json_object greatest hashCode hex hh_features
hh_split hh_timeline_pair_luid hour hypot initcap inputFileName input_file_name
instr intRddToDataFrameHolder intToIntWritable intWritableConverter isNaN isTraceEnabled isnan
isnull jarOfClass jarOfObject json_tuple kurtosis lag last
last_day lead least length levenshtein lit localSeqToDataFrameHolder
localSeqToDatasetHolder locate log log10 log1p log2 logDebug
logError logInfo logName logTrace logWarning longRddToDataFrameHolder longToLongWritable
longWritableConverter lower lpad ltrim markPartiallyConstructed max md5
mean min minute model modeling_stack monotonicallyIncreasingId monotonically_increasing_id
month months_between nanvl ne negate newBooleanEncoder newByteEncoder
newDoubleEncoder newFloatEncoder newIntEncoder newLongEncoder newProductEncoder newShortEncoder newStringEncoder
next_day not notify notifyAll ntile numDriverCores numericRDDToDoubleRDDFunctions
percentRank percent_rank pipeline pmod pow prediction quarter
rand randn rank rddToAsyncRDDActions rddToDataFrameHolder rddToDatasetHolder rddToOrderedRDDFunctions
rddToPairRDDFunctions rddToSequenceFileRDDFunctions regexp_extract regexp_replace repeat reverse rf
rint round rowNumber row_number rpad rtrim sc
second setActiveContext sha1 sha2 shiftLeft shiftRight shiftRightUnsigned
signum sin sinh size skewness sort_array soundex
sparkPartitionId spark_partition_id split sql sqlContext sqlTransformer sqlTransformer_labeled
sqrt stddev stddev_pop stddev_samp stringIndexer stringRddToDataFrameHolder stringToText
stringWritableConverter struct substring substring_index sum sumDistinct symbolToColumn
synchronized tan tanh toDegrees toRadians toString to_date
to_utc_timestamp translate trim trunc tstack udf unbase64
unhex unix_timestamp updatedConf upper var_pop var_samp variance
wait weekofyear when writableWritableConverter year

| cast(same_variable as double) as label,
|
<init> DRIVER_IDENTIFIER LEGACY_DRIVER_IDENTIFIER Pipeline PipelineModel RDD_SCOPE_KEY RDD_SCOPE_NO_OVERRIDE_KEY
RandomForestClassifier SPARK_JOB_DESCRIPTION SPARK_JOB_GROUP_ID SPARK_JOB_INTERRUPT_ON_CANCEL SQLTransformer StringIndexer StringToColumn
VectorAssembler _sqlContext abs acos add_months approxCountDistinct array
array_contains asc ascii asin atan atan2 avg
base64 bin bitwiseNOT boolToBoolWritable booleanWritableConverter broadcast bytesToBytesWritable
bytesWritableConverter callUDF callUdf cbrt ceil classOf clearActiveContext
clone coalesce col collect_list collect_set column concat
concat_ws conv corr cos cosh count countDistinct
crc32 cumeDist cume_dist current_date current_timestamp date_add date_format
date_sub datediff dayofmonth dayofyear decode denseRank dense_rank
desc doubleRDDToDoubleRDDFunctions doubleToDoubleWritable doubleWritableConverter encode eq equals
exp explode expm1 expr factorial finalize first
floatToFloatWritable floatWritableConverter floor format_number format_string from_unixtime from_utc_timestamp
getClass getOrCreate get_json_object greatest hashCode hex hh_features
hh_split hh_timeline_pair_luid hour hypot initcap inputFileName input_file_name
instr intRddToDataFrameHolder intToIntWritable intWritableConverter isNaN isTraceEnabled isnan
isnull jarOfClass jarOfObject json_tuple kurtosis lag last
last_day lead least length levenshtein lit localSeqToDataFrameHolder
localSeqToDatasetHolder locate log log10 log1p log2 logDebug
logError logInfo logName logTrace logWarning longRddToDataFrameHolder longToLongWritable
longWritableConverter lower lpad ltrim markPartiallyConstructed max md5
mean min minute model modeling_stack monotonicallyIncreasingId monotonically_increasing_id
month months_between nanvl ne negate newBooleanEncoder newByteEncoder
newDoubleEncoder newFloatEncoder newIntEncoder newLongEncoder newProductEncoder newShortEncoder newStringEncoder
next_day not notify notifyAll ntile numDriverCores numericRDDToDoubleRDDFunctions
percentRank percent_rank pipeline pmod pow prediction quarter
rand randn rank rddToAsyncRDDActions rddToDataFrameHolder rddToDatasetHolder rddToOrderedRDDFunctions
rddToPairRDDFunctions rddToSequenceFileRDDFunctions regexp_extract regexp_replace repeat reverse rf
rint round rowNumber row_number rpad rtrim sc
second setActiveContext sha1 sha2 shiftLeft shiftRight shiftRightUnsigned
signum sin sinh size skewness sort_array soundex
sparkPartitionId spark_partition_id split sql sqlContext sqlTransformer sqlTransformer_labeled
sqrt stddev stddev_pop stddev_samp stringIndexer stringRddToDataFrameHolder stringToText
stringWritableConverter struct substring substring_index sum sumDistinct symbolToColumn
synchronized tan tanh toDegrees toRadians toString to_date
to_utc_timestamp translate trim trunc tstack udf unbase64
unhex unix_timestamp updatedConf upper var_pop var_samp variance
wait weekofyear when writableWritableConverter year


etc.

Answer

The issue here is that your SQL script contains tabs, which triggers tab completion in the REPL. Remove the tabs from your script and it should work.

Alternatively, if you have no use for tab completion, you could try using this answer to disable it completely.