I know that normalis(z)ation has been extensively discussed on Stack Overflow. I've read many of the previous discussions. I've got some additional questions though.
I'm working on a legacy system with at least 100 tables. The database is has some un-normalized structure, tables that contain a variety of disparate data, and other problems. I've been given the task of trying to improve it. I can't just start again but need to modify the existing schema.
In the past I have always tried to design normalized databases. Now the questions. A senior developer has suggested that in some cases we can't normalize:
1) With temporal data. For example an invoice is created that links to a product. If a customer asks for a copy of this invoice a year later we must be able to produce an exact copy of the original. What if the product price, name or description have been updated? The senior guy suggested that the price and other product information should be copied to the invoice table. I'm thinking maybe we should have another table such as productPrice that has a date field so we can track changes in price over time. We would need the same thing for the product description and name I guess? Seems complicated. What do you think?
2) The database is an accounting system. I'm not very familiar with accounting. At the moment some summary data is derived and stored in the database. For example total sales for the year. My senior associate has said that accountants like to check things are correct by comparing this value with data that is actually calculated from invoices etc to give them confidence that the application is working correctly. He said that at the moment for example we can tell if someone deleted an invoice from last year mistakenly because the totals will not be the same. He also pointed out that it could be quite slow to calculate these totals on the fly. Of course I said that data should not be duplicated and should always be calculated when needed. I suggested that we could use SQL Reporting Services or some other solution that will generate these reports overnight and cache them. Anyway he's not convinced. Any comments on this?
Thanks very much :)
Your senior colleague is a developer, not a data modeller. You are better off starting from scratch, without them. Normalisation is complicated only to those who will not read books, and get their 'knowledge' from the amateurs at wiki. It is fair enough that he makes you think, but some of the issues are absurd.
You need to appreciate the differences between actual online data, and historic data; then the difference between merely historic and archival needs. All of them are right for the specific business requirement, and wrong for all others, there is no universal right and wrong.
ProductPrice(actually, I would call it
ProductDate) is a good idea, but may not be necessary. But you are right, you need to evaluate the currency of data, in the full context of the whole database.
Productscannot be deleted, they can be marked
InvoiceItemhas FKs to both
NumUnits; ProductPrice; TaxAmount; ExtendedPrice. Sure, this looks like a "denormalisation" but it is not, because prices, taxation rates, etc, are subject to change. But more important, the legal requirement is that we can reproduce the old invoice on demand.
InvoiceTotalAmountis a derived column, just
SUM()of the InvoiceItems
That is rubbish. Accounting systems, and accountants do not "work" like that.
If it is a true accounting system, then it will have JournalEntries, or "double entry"; that is what a qualified account is required to use (by law).
Ensuring that Invoices are not deleteable is a separate issue, to do with security, etc. if anyone is paranoid about things being deleted from their database, and their database was not secured by a qualified person, then they have more and different problems that have nothing to do with this question. Obtain a security audit, and do whatever they tell you.
There are a few people on this site who think wiki is a place that you can learn something. It isn't. It is a cesspool of "definitions" written by amateurs, and the "definitions" are constantly changed by other amateurs. No fixed definition that you can rely on. So don't worry about what wiki says or what people say wiki says, the moment they mention wiki, you know their "knowledge" come from reading not qualification; and what they are reading is an ever-changing cesspit. They will predictably argue about "definitions" because they have no actual experience; the experienced will just get on with the job
A Normalised database is always much faster than Unnormalised database. So it is very important to understand what Normalisation and Denormalisaion is, and what it isn't. The process is greatly hindered when people have fluid and amateur "definitions", it just leads to confusion and time-wasting "discussions". When you have fixed definitions, you can avoid all that, and just get on with the job.
Summary tables are quite normal, to save the time and processing power, of recalculating info that does not change, eg: YTD totals for every year but this year; MTD totals for every month in this year but not this month. "Always recalculating" data is a bit silly when (a) the info is very large and (b) does not change. Calculate for the current month only
Summary tables are not a "denormalisation" (except in the eyes of those who have just learned about "normalisation" from their magical, ever-changing fluid "source"; or as non-practitioners, who apply simple black-or-white rules to everything). Again, the definition is not being argued here; it simply does not apply to Summary tables.
Summary tables do not affect data integrity (assuming of course that the data that they were sourced from was integral).
Summary tables are an addition to the database, which are not required to have the same constraints as the database. There are essentially reporting tables or data warehouse tables, as opposed to database tables.
There are no Update Anomalies (which is a strict definition) related to Summary tables. You cannot change or delete an invoice from last year. Update Anomalies apply to true Denormalised or Unnormalised current data.