Shannon Holsinger Shannon Holsinger - 11 months ago 60
SQL Question

SQL 2016 - Compression Before Or After Bulk Insert

I've have approximately 90GB of uncompressed .csv data to import into SQL. depending on the ultimate requirements of the interface, I may or may not be able to partition this data - but whether I am able to or not, these are going to be some big tables. Though large in total, the rows are relatively simple:

[Exchange] [Symbol] [Compare_To_Exchange] [Compare_To_Symbol] [Date] [Correlation]
NYSE ZTS INDEX DCXE 12/9/2013 0:00 -0.5521076

I am not familiar with compressing SQL tables, and am reading up. One question I have that I can't seem to find an answer for is whether or not it's better (performance or any other way) to compress a table while it's empty and let the bulk transaction do the compression, or insert all the data and then compress?

On the face of it, I'm thinking do it while it's empty (like creating a clustered index on an empty table is easier than on a full one), but the one article I read that came close to approaching this topic seemed to indicate compression should be done once the table is full. I'd appreciate the experience of any one who would know the difference. Thanks for your time and in advance for any thoughts, information, or pointers you can provide.

Answer Source

It is important to remember you are making a database here and not a CSV file. Compression makes a lot of sense in a CSV file but not in a well designed data model. Compression makes sense on unstructured data, but you have structured data.

If you normalize the data you don't need to compress at all. In your example you are showing 4 or 5 words of data plus a time stamp plus what is probably a 32 bit numeric. So less than 160 bits per row.

To be more clear ->

[Exchange]            -> 16 bit index to look up in exchange table.
[Symbol]              -> 16 bit index to look up in symbol table
[Compare_To_Exchange] -> 16 bit index to look up in compare table
[Compare_To_Symbol]   -> 16 bit index to look up in compare table
[Date]                -> 64 bit timestamp
[Correlation]         -> 32 bit numeric

That is going to be much smaller than the CSV

You might even be able to make some of these 8 bit indexes if lookup tables are small. If that is a date and not a timestamp that is much smaller.

Or if you are lazy you could use 32 bit integers or even 64 -- still the rows are going to be small. If you use the structure of the data in your design.