ML_Passion ML_Passion - 3 months ago 33
Python Question

Cast a very long string as an integer or Long Integer in PySpark

I'm working with a string column which is 38 characters long and is actually numerical.

for e.g. id = '678868938393937838947477478778877.....' ( 38 characters long).

How do I cast it into a long integer ? I have tried cast function with IntegerType, LongType and DoubleType and when i try to show the column it yields Nulls.

The reason I want to do this is because I need to do some inner joins using this column and doing it as String is giving me Java Heap Space Errors.

Any suggestions on how to cast it as a Long Integer ? { This question tries to cast a string into a Long Integer }

Answer

Long story short you simply don't. Spark DataFrame is a JVM object which uses following types mapping:

  • IntegerType -> Integer with MAX_VALUE equal 2 ** 31 - 1
  • LongType -> Long with MaxValue equal 2 ** 63 - 1

You could try to use DecimalType with maximum allowed precission (38).

df = sc.parallelize([("9" * 38, "9" * 39)]).toDF(["x", "y"])
df.select(col("x").cast("decimal(38, 0)")).show(1, False)

## +--------------------------------------+
## |x                                     |
## +--------------------------------------+
## |99999999999999999999999999999999999999|
## +---------------------------------------

With larger numbers you can cast to double but not without a loss of precision:

df.select(
    col("y").cast("decimal(38, 0)"), col("y").cast("double")).show(1, False)

## +----+------+
## |y   |y     |
## +----+------+
## |null|1.0E39|
## +----+------+

That being said casting to numeric types won't help you with memory errors.

Comments