The Data Warehouse Development Life Cycle
LEGACY DATA ANALYSIS
Another source of external data consists of legacy data systems. It
is not uncommon for analysts to discover that one class of data
resides in different database formats. For example, sales data from
1980 through 1990 may reside on an IMS database, while transactions
from 1991 to present are stored in a DB2 database. The data analyst
must be able to deal with the problem of translating data archive
tapes in varying formats.
In some cases, developers find it fortunate that a data source is
stored in an OLTP Oracle database and believe that the data
extraction and loading will be easier because the source and target
data sources reside within Oracle. However, it must be noted that
many of the standard Oracle utilities for data extraction and
loading (export-import) are of no use with Oracle data warehouse
loading. This is because of the data transformation that takes place
as a part of data extraction and loading. For example, the
de-normalized format of the data warehouse makes it impossible to
export an OLTP Oracle database that uses five tables to represent a
sales transaction.
As a general rule, data extraction involves formatting legacy data
for loading into the warehouse. For example, data could be extracted
directly from an Oracle OLTP system into the de-normalized Oracle
warehouse as shown in Listing 3.3.
Listing 3.3 One-step data extraction and load between Oracle
databases.
Create table FACT
as
select customer_name,
customer_address,
sales_date,
to_char(sales_date,’YYYY’),
to_char(sales_date,’MM’),
to_char(sales_date,’DD’),
sale_amount,
quantity_sold,
total_price,
from
customer@oltp cust,
sales@oltp sale,
line_item@oltp li
item@oltp item
SEE CODE DEPOT FOR FULL SCRIPT
Here, you can see that the data is
pulled from the Oracle OLTP database using Oracle’s SQL*Net
facility. For details on this distributed SQL technique, see Chapter
9, Distributed Oracle Data Warehouses. Also, note the dissection of
the sales_date column. Rather than store the sales_date as a single
column of the date datatype, the Oracle data warehouse
transformation stores the sales_date in three columns: one for the
day, another for the month, and a third column for the year. The
reason for this date breakout will become clear later in this
chapter when we discuss data query analysis.