jithinpt jithinpt - 6 days ago 5
Scala Question

sbt-assembly: Generate a minimal JAR file

I've been using sbt-assembly to generate standalone JAR file for my scala project. However, I would like to reduce the size of my JAR file (its currently around 150MB and there's defintely room for improvement there).

I used the following command to list the contents of the JAR file that's produced:

jar tf <JAR file>


This revealed that there are lots of classes in the generated JAR file that are not used in the project. I believe these classes get included as part of third-party JARs.

Questions

(a) Is there an option that I can use to instruct sbt-assembly to generate a minimal JAR file that does not include the third-party classes that are not used in my project?

(b) I could use AssemblyStrategy to manually specify which files need to be excluded. Is this a sound strategy? I'm a bit concerned that with this approach the JAR file might end up throwing unexpected ClassNotFound exceptions.

Thanks in advance.

Answer

It's not easy to say what's used in your project and what is not. If you include a dependency into a project it might bring a few other ones in. Those child dependencies might also require their own dependencies and so on.

By default if you include some dependency in your project you intend to use it. The author of a dependency usually does the same thing. Thus, there is usually not much you can throw away, it's there for a reason. There are couple cases when this is not true:

  • Dependency author includes additional dependencies that will be used only in some settings, and that does not apply to your project
  • You are using a mega-dependency when you actually need only one of its libraries/features.

There are counter examples to this as well: Scalatest does not ship pegdown for generating html test reports because you don't need it usually. But it might be needed if you try to use -h flag to generate html.

Imagine the case when you use Apache Tika for pdf parsing. It wraps PDFBox to do the parsing. You don't need a bloat of all other libraries in that case that parse MS documents. The best thing to do is not to exclude files manually via sbt exclude or sbt-assembly rules because there is a risk you get it wrong and get run time class loading exception. Instead you need to use the right dependency like PDFBox directly. Unfortunately this is a lot of manual work in many cases to figure out all dependencies that you need, so it's your choice: easy and fat JAR, or painful and lean.

There are two ways to exclude dependencies:

  1. Exclude transitive dependencies with exclude. See the docs here.
  2. Don't use the top level dependency and manually add its subdependencies as you need them.
  3. Ok, one more less fun option: use provided and make sure libraries are copied to your target environment and are on classpath. If you have many jars using the same libraries this helps to share those.

You can visualize your dependency tree with this plugin: https://github.com/jrudolph/sbt-dependency-graph. It's very helpful when trying to figure out what you are using and what you can remove. There are some tools like tattletale and loosejar that people suggest but I haven't tried them. If anyone has experience with those please share.