Knows Not Much Knows Not Much - 2 months ago 18
Scala Question

Scala Processing a file in batches

I have a flat file which contains several million lines like one below

59, 254, 2016-09-09T00:00, 1, 6, 3, 40, 18, 0


I want to process this file in batches of X rows at a time. So I wrote this code

def func(x: Int) = {
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(x, x)
} yield batches.map("(" + _ + ")").mkString(",")
}
func(2).foreach(println)


This code produces exactly the output I want. the function walks through entire file taking 2 rows at a time batch them into 1 string.

(59, 828, 2016-09-09T00:00, 0, 8, 2, 52, 0, 0),(59, 774, 2016-09-09T00:00, 0, 10, 2, 51, 0, 0)


But when I see scala pros write code everything happens inside the for comprehension and you just return the last thing from your comprehension.

So in order to be a scala pro I change my code

for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
line <- batches.map("(" + _ + ")").mkString(",")
} yield line


This produces 1 character per line and not the output I expected. Why did the code behavior totally change? At least on reading they look the same to me.

dhg dhg
Answer

In the line line <- batches.map("(" + _ + ")").mkString(","), the right-hand side produces a String (the result of mkString), and the loop iterates over this string. When you iterate over a string, the individual items are characters, so in your case line is going to be a character. What you actually want is not to iterate over that string, but to assign it to the variable name line, which you can do by replacing the <- with =: line = batches.map("(" + _ + ")").mkString(",").

By the way, sliding(2,2) can be more clearly written as grouped(2).

Comments