Yuri Yuri - 3 months ago 18
Java Question

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are

CSV
files that are stored on FTP). There are 3 types of files that are being generated:


  • every 15 minutes (2 files). These files are pretty small (~
    2 Mb
    )

  • everyday at 5 PM (~
    200 - 300 Mb
    )

  • every midnight (this
    CSV
    file is about
    1 Gb
    )



Overall the size of 4
CSV
s is
1.5 Gb
. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.

We have
java
stack. The database is
MS SQL Standard
. From my measurements
MS SQL Standard
with other applications won't handle such load. What comes to my mind:


  • This could be an upgrade to
    MS SQL Enterprise
    with the separate server.

  • Usage of
    PostgreSQL
    on a separate server. Right now I'm working on PoC for this approach.



What would you recommend here? Probably there are better alternatives.

Edit #1



Those large files are new data for the each day.

Answer

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.

Tl;dr

Database: PostgreSQL as it is good for CSV, free and open source.

Tool: Apache Spark is a good fit for such type of tasks. Good performance.

DB

Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.

NoSQL

I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.

RDBMS

I didn't want to overengineer here. And I stopped the choice here.

MS SQL Server

It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.

Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.

  • MS SQL Server silently truncating a text field.

  • MS SQL Server's text encoding handling going wrong.

MS SQL Server throwing an error message because it doesn't understand quoting or escaping. More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.

PostgreSQL

This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.

It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).

Anyway, right now I'm using DataGrip from JetBrains.

Processing Tool

I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.

Regarding Apache Spark example, the code could be found in learning-spark project. My choice is Apache Spark for now.