Metallica Metallica - 9 days ago 6
Scala Question

Scala variable resets after for-each loop / string gets truncated

I am doing a Spark project. In the following code, I have a string which I use to collect my results in order to write to a file later on (I know this is not the correct way, I am just checking what is inside a

Tuple3
returned by a method). The string gets truncated after a for each loop. Here is the relevant part of my code:

val newLine = sys.props("line.separator") // also tried "\n". I am using OS X.

var str = s"*** ${newLine}"

for (tuple3 <- ArrayOfTuple3s) {
for (list <- tuple3._3) {
for (strItem <- list) {
str += s"${strItem}, "
}
str += s"${newLine}"
}
str += s"${newLine}"
println(tempStr)
}

print("str=" + str)


The first
println
method call prints the correct value of the string (the concatenated result), but when the loop ends, the value of
str
is
***
(the same value assigned to it before the first loop).

Edit: I replaced the
str
immutable
String
object with a
StringBuilder
, but no change in the result:

val newLine: String = sys.props("line.separator")

var str1: StringBuilder = new StringBuilder(15000)

for (tuple3 <- ArrayOfTuple3s) {
for (list <- tuple3._3) {
for (str <- list) {
str1.append(s"${str}, ")
}
str1.append(s"${newLine}")
}
str1.append(s"${newLine}")
println(str1.toString())
}

print("resulting str1=" + str1.toString())


Edit 2: I mapped the RDD to take the Tuple3's third field directly. This field itself is an RDD of Arrays of Lists. I changed the code accordingly, but I am still getting the same result (the resulting string is empty, although inside the for loop it is not).

val rddOfArraysOfLists = getArrayOfTuple3s(mainRdd).map(_._3)

for (arrayOfLists <- rddOfArraysOfLists) {
for (list <- arrayOfLists) {
for (field <- list) {
str1.append(s"${field}, ")
}
str1.append(" -- ")
}
str1.append(s"${newLine}")
println(str1.toString())
}


Edit 4: I think the problem is not with strings at all. There seems to be a problem with all types of variables.

var count = 0

for (arrayOfLists <- myArray) {
count = arrayOfLists.last(3).toInt
println(s"count=$count")
}

println(s"count=$count")


The value is non-zero inside the loop, but it is 0 outside the loop. Any idea?

Edit 5: I cannot publish the whole code (due to confidentiality restrictions), but here is the major part of it. If it matters, I am running Spark on my local machine in Intellij Idea (for debugging).

System.setProperty("spark.cores.max", "8")
System.setProperty("spark.executor.memory", "15g")
val sc = new SparkContext("local", getClass.getName)
val samReg = sc.objectFile[Sample](sampleLocation, 200).distinct

val samples = samReg.filter(f => f.uuid == "dce03545e8034242").sortBy(_.time).cache()

val top3Samples = samples.take(3)
for (sample <- top3Samples) {
print("sample: ")
println(s"uuid=${sample.uuid}, time=${sample.time}, model=${sample.model}")
}

val firstTimeStamp = samples.first.time
val targetTime = firstTimeStamp + 2592000 // + 1 month in seconds (samples during the first month)

val rddOfArrayOfSamples = getCountsRdd(samples.filter(_.time <= targetTime)).map(_._1).cache()
// Due to confidentiality matters, I cannot reveal the code,
// but here is a description:
// I have an array of samples. Each sample has a few String fields
// and is represented by a List[String]
// The above RDD is of the type RDD[Array[List[String]]].
// It contains only a single array of samples
// (because I passed a filtered set of samples to the function),
// but it may contain more.
// The fourth field of each sample (list) is an increasing number (count)

println(s"number of arrays in the RDD: ${rddOfArrayOfSamples.count()}")

var maxCount = 0
for (arrayOfLists <- rddOfArrayOfSamples) {
println(s"Last item of the array (a list)=${arrayOfLists.last}")
maxCount = arrayOfLists.last(3).toInt
println(s"maxCount=${maxCount}")
}
println(s"maxCount=${maxCount}")


The output:

sample: uuid=dce03545e8034242, time=1360037324, model=Nexus 4

sample: uuid=dce03545e8034242, time=1360037424, model=Nexus 4

sample: uuid=dce03545e8034242, time=1360037544, model=Nexus 4

number of arrays in the RDD: 1

Last item of the array (a list)=List(dce03545e8034242, Nexus 4, 1362628767, 32, 2089, 0.97, 0.15999999999999992, 0)

maxCount=32

maxCount=0

Answer

Uprating my explanation in a comment to an answer:

See this answer to a somewhat-related question:

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:

serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes

The for in your code is just syntactic sugar for a map.

Because of this, the maxCount that each execution updates is not the same maxCount in your invoking program. That one never changes.

The lesson here is don't use closures (blocks) that update vars outside the block