The Data Warehouse Development Life Cycle
Oracle Data Warehouse DesignIntroducing Redundancy Into An Entity/Relation Model
As the size of the database increases, redundancy can
become a major problem. Today, many users create very large
databases, many of which contain trillions of bytes. For databases
of this size, a single table can contain more than a billion rows,
and the introduction of a single new column to a table can represent
thousands of dollars in additional disk expense. Data redundancy is
detrimental for two reasons. First and foremost, duplicating
redundant material consumes disk storage space. Second and more
ominous, updating redundant data requires extra processing.
Redundant duplication of very large and highly volatile data items
can cause huge processing bottlenecks.
However, the overhead associated with data redundancy does not imply
that redundancy is always undesirable. Performance is still an
overriding factor in most systems. Proper control of redundant
information implies that redundant information can be introduced
into any structure as long as the performance improvements outweigh
the additional disk costs and update problems.
Since the first publication of Dr. Codd's 1993 research paper
Providing OLAP (Online Analytical Processing) to User-Analysts: An
IT Mandate, database designers have attempted to find an optimum way
of structuring tables for low data redundancy. Codd's rules of
normalization guide the designer to create a logically correct table
structure with no redundancy, but performance rules often dictate
the introduction of duplicated data to improve performance.
This is especially true for Oracle data warehouses. However, the
warehouse designer does not have free reign to introduce redundancy
anywhere in the model. Redundancy always carries a price, whether it
is the cost of the disk storage or the cost of maintaining a
parallel update scheme. Figure 4.2 shows a strategy for analyzing
the consequences of data redundancy.
Figure 4.2 A comparison of size versus volatility for redundant
data.