Does storing aggregated data go against database n

2020-07-22 18:45发布

问题:

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.

But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?

Thanks

回答1:

Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.



回答2:

The phrase to remember is "Normalize till it hurts, Denormalize till it works"

It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.

So, Yes. Storing aggregated data 'goes against' normalisation.

There is no 'one best way' to denormalise; it depends what you are doing with the data.

Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.



回答3:

Too much normalization will hurt performance so in the real world you have to find your balance.

I've handled a situation like this in two ways.

1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.

2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.

I used a wrapper class to do it so it's been while but I think in PHP was was as simple as: $_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);

Nonetheless, however you store that single value it saves you from running it with every request.