vansha vansha - 4 months ago 55
Java Question

How to process MultiLine input log file in Spark using Java

I am new to Spark and It seems very confusing to me. I had gone through the spark documentation for Java API But couldn't figure out the way to solve my problem.
I have to process a logfile in spark-Java and have very little time left for the same. Below is the log file that contains the device records(device id, decription, ip address, status) span over multiple lines.
It also contains some other log information which I am not bothered about.
How can I get the device information log from this huge log file.
Any help is much appreciated.

Input Log Data :

!
!

!
device AGHK75
description "Optical Line Terminal"
ip address 1.11.111.12/10
status "FAILED"
!
device AGHK78
description "Optical Line Terminal"
ip address 1.11.111.12/10
status "ACTIVE"
!

!
context local
!
no ip domain-lookup
!
interface IPA1_A2P_1_OAM
description To_A2P_1_OAM
ip address 1.11.111.12/10
propagate qos from ip class-map ip-to-pd
!
interface IPA1_OAM_loopback loopback
description SE1200_IPA-1_OAM_loopback
ip address 1.11.111.12/10
ip source-address telnet snmp ssh radius tacacs+ syslog dhcp-server tftp ftp icmp-dest-unreachable icmp-time-exceed netop flow-ip


What I have done so far is:

Java Code

JavaRDD<String> logData = sc.textFile("logFile").cache();
List<String> deviceRDD = logData.filter(new Function<String, Boolean>() {
Boolean check=false;
public Boolean call(String s) {
if(s.contains("device") ||(check == true && ( s.contains("description") || s.contains("ip address"))))
check=true;
else if(check==true && s.contains("status")){
check=false;
return true;
}
else
check=false;
return check; }
}).collect();


Current Output :


device AGHK75
description "Optical Line Terminal"
ip address 1.11.111.12/10
status "FAILED"
device AGHK78
description "Optical Line Terminal"
ip address 1.11.111.12/10
status "ACTIVE"


Expected Output is:

AGHK75,"Optical Line Terminal",1.11.111.12/10,"FAILED"
AGHK78,"Optical Line Terminal",1.11.111.12/10,"ACTIVE"

Answer

You can use sc.wholeTextFiles("logFile") for getting the data as key,value pair of where key will be the file name and value as data in it.

Then you can use some string operation for splitting of the data as per the start and end delimiter of single log data with "!" and do a filter first for checking if the first word is device and then do a flatMap on it which will make it as singleLog text RDD.

and then get the data from it using the map.

Please try it and let me know whether if this logic is working for you.

Added code in Spark Scala:

val ipData = sc.wholeTextFiles("abc.log")
val ipSingleLog = ipData.flatMap(x=>x._2.split("!")).filter(x=>x.trim.startsWith("device"))
val logData = ipSingleLog.map(x=>{
  val rowData = x.split("\n")
  var device = ""
  var description = ""
  var ipAddress = ""
  var status = ""
  for (data <- rowData){
    if(data.trim().startsWith("device")){
      device = data.split("device")(1)
    }else if(data.trim().startsWith("description")){
      description = data.split("description")(1)
    }else if(data.trim().startsWith("ip address")){
      ipAddress = data.split("ip address")(1)
    }else if(data.trim().startsWith("status")){
      status = data.split("status")(1)
    }
  }
  (device,description,ipAddress,status)
})
logData.foreach(println)
Comments