TBZ92 TBZ92 - 1 month ago 12
Scala Question

What is the cause of OutOfMemoryError in Scala?

I'm only just starting to learn Scala, coming from Python. I was attempting a basic file processing task in Scala. The task is to remove substrings like

"[ ... ]"
from data files using regex. The script successfully processes the first few files and then throws a
java.lang.OutOfMemoryError: Java heap space
error. The data file at which the error occurs is about 70MB, and I have 16GB of RAM at my disposal. (The preceding 6 files have filesize < 100Kb, with the first one as an exception: 5.5MB).

My question is: what causes the
OutOfMemoryError
, and how can I change my approach to prevent it from happening? I don't understand why it happens. I have little experience in debugging memory errors, as Python is relatively forgiving in memory management.

Any additional comments on coding style or the methods I use are more than welcome - I am eager to learn.

Regexer.scala:

import scala.io.Source
import java.io._

object Regexer {

def main(args: Array[String]): Unit = {

val filenames = Source.fromFile("all_files.txt").getLines()

for (fn <- filenames) {

val datafile:String = Source.fromFile(fn).mkString

val new_data:String = datafile.replaceAll(raw"\[.*?\]", "")

val file = new File(fn)
val bw = new BufferedWriter(new FileWriter(file))
bw.write(new_data)
bw.close()


}
}
}


all_files.txt
is a file containing paths to all files to process (as they are located in subdirectories).

Finally, the complete error message thrown upon execution:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at scala.collection.mutable.StringBuilder.appendAll(StringBuilder.scala:249)
at scala.io.BufferedSource.mkString(BufferedSource.scala:97)
at Regexer$$anonfun$main$1.apply(Regexer.scala:12)
at Regexer$$anonfun$main$1.apply(Regexer.scala:10)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at Regexer$.main(Regexer.scala:10)
at Regexer.main(Regexer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:70)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:101)
at scala.reflect.internal.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:70)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:101)
at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:22)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:39)
at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:29)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:39)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:65)
at scala.tools.nsc.MainGenericRunner.run$1(MainGenericRunner.scala:87)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:98)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:103)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)

Answer

You might have 16Gib on your computer, but that doesn't mean the JVM can use all of that. Scala code (normally) runs in the Java Virtual Machine (JVM), which has its own memory. The default amount of memory you have available might be too low for your program. The maximum available memory for you process can be set with the -Xmx option. Try something like -Xmx1024m or -Xmx2g or however much memory you think should work. If you still get the problem after adding upping the available memory needed to process the files, then you either have some memory leak going on, or your algorithm needs to be optimized.

In your specific case, instead of loading the entire file into memory, consider processing line by line, or some other buffer amount, so that at any time you only need to keep a small portion of the file in memory

Comments