saeed talaee saeed talaee -4 years ago 84
Scala Question

How to remove a number among a string column in dataframe in scala

I am reading a text file in scala and I have the following row:

05:49:56.604899 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 10202: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [.], seq 3641977583:3641987719, ack 129899328, win 58, options [nop,nop,TS val 432623 ecr 432619], length 10136


I used this code to extract a pattern:

+---------------+--------------+--------------+-----+-----+
| time_stamp_0| sender_ip_1| receiver_ip_2|label|count|
+---------------+--------------+--------------+-----+-----+
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001| 1| 19|


Here is my code:

val customSchema = StructType(Array(
StructField("time_stamp_0", StringType, true),
StructField("sender_ip_1", StringType, true),
StructField("receiver_ip_2", StringType, true),
StructField("label", IntegerType, true)))

///////////////////////////////////////////////////make train dataframe
val Dstream_Train = sc.textFile("/Users/saeedtkh/Desktop/sharedsaeed/Test/trace1.txt")
val Row_Dstream_Train = Dstream_Train.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split("")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third, 1))
})
val Frist_Dataframe = session.createDataFrame(Row_Dstream_Train, customSchema).toDF("time_stamp_0", "sender_ip_1", "receiver_ip_2", "label")
val columns1and2 = Window.partitionBy("sender_ip_1", "receiver_ip_2") // <-- matches groupBy


My problem is I need to extract the sender_ip_1 and receiver_ip_2 columns like this:

+---------------+--------------+--------------+-----+-----+
| time_stamp_0| sender_ip_1| receiver_ip_2|label|count|
+---------------+--------------+--------------+-----+-----+
|05:49:56.604899|10.0.0.2 | 10.0.0.3 | 1| 19|


It means that I need to omit the last number of IP. (The number is not constant and it is variable)

Can you help me?

Answer Source

The easiest possible way to do this in your example is to remove that dangling port number in your lambdas by doing something like the following (output of a Scala shell):

scala> val stringToTrim = "255.255.255.255.1"
stringToTrim: String = 255.255.255.255.1

scala> stringToTrim.take(stringToString.lastIndexOf("."))
res8: String = 255.255.255.255

In you case for example you would replace first and third in your row as follows:

val firstFixed = first.take(first.lastIndexOf("."))
val thirdFixed = third.take(third.lastIndexOf("."))
Row.fromSeq(Seq(firstFixed, second, thirdFixed, 1))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download