I have a third party system that generates a large amount of data each day (those are
200 - 300 Mb
MS SQL Standard
MS SQL Standard
MS SQL Enterprise
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
I thought about the usage of
Cassandra here, but this solution would be too complex right now.
Cassandra does not have ad-hoc queries.
Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
I didn't want to overengineer here. And I stopped the choice here.
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here.
MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping. More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.