The Data Warehouse Development Life Cycle
Oracle Data Warehouse Design
De-normalization And Data Warehouse Sizing
Now, let’s apply these principles to the analysis we conducted on
Guttbaum’s Grocery. The first step is to start with our 3NF model
from Figure 4.1 and locate the “facts,” which are always numerical
items at the bottom of the entity hierarchy. In our example,
quantity and sale_amount in the transaction entity are the facts.
Because these appear in the TRANSACTION table in our 3NF model for
Guttbaum’s Grocery, we can presume that the transaction table will
form the basis for our fact table for our data warehouse.
However, here we return to an important point regarding data
aggregation. Our 3NF model for Guttbaum’s assumes that there will be
one row for each item on each and every transaction. Because
Guttbaum’s Grocery chain has 50 supermarkets, and an average of
3,000 customers each day for each supermarket, with an average of 10
items per transaction, we see that 1.5 million rows are added to
Guttbaum’s database each day. Assuming that each transaction row is
20 bytes, we need 30 MBs of disk space per day just to hold the
transactions, which translates into 30,000,000 X 365 days =
1,100,000,000,000 bytes, or 1.1 terabytes per year! Even with the
price of disk running at the relatively inexpensive rate of $100 per
gigabyte or $10,00 per terabyte, we may decide that the cost of disk
storage is prohibitively expensive to store this level of
transaction detail. One alternative to keeping the lowest level of
transaction detail would be to summarize the transaction data,
either by total per customer transaction or total by product
category. Of course, we may lose some information in this
summarization, but we would save a substantial amount of disk
storage expense. For details on the issues involved with data
aggregation and summarization, refer to Chapter 10, Oracle Data
Warehouse Utilities.
So, even if we can keep our transaction table rows to
20 bytes, the additional disk cost for adding another redundant data
item to the fact table is rather substantial. Each additional 20
bytes of data will cost Guttbaum’s $10,000 per year in additional
disk expense, so we must carefully consider which redundant data
items will appear in the fact table and which data items will reside
in the adjacent tables of our STAR schema design. Remember the
tradeoff--we are sacrificing disk costs for speed of data retrieval.
Each attribute added to the fact table reduces the number of table
joins that our Oracle warehouse must incur to service a transaction.
Hence, only frequently referenced data items are candidates for
inclusion in out fact table.