Introduction To Data Warehousing and Business Intelligence
Introduction To Data Warehousing and Business Intelligence
This chapter describes some of the basic ideas in business intelligence. This chapter contains the following topics:
Introduction to Data Warehousing and Business Intelligence Overview of Extraction, Transformation, and Loading (ETL) Overview of Materialized Views for Data Warehouses Overview of Bitmap Indexes in Data Warehousing Overview of Parallel Execution Overview of Analytic SQL Overview of OLAP Capabilities Overview of Data Mining
Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
1
Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
A typical data warehouse query scans thousands or millions of rows.For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." Historical Data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts)
Data Warehouse Architecture (Basic) Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse. Figure 16-1 Architecture of a Data Warehouse
Description of "Figure 16-1 Architecture of a Data Warehouse" In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-
compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. Summaries in Oracle are called materialized views. Data Warehouse Architecture (with a Staging Area) Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture. Figure 16-2 Architecture of a Data Warehouse with a Staging Area
Description of "Figure 16-2 Architecture of a Data Warehouse with a Staging Area" Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization. Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts
Description of "Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts" See Also: Oracle Database Data Warehousing Guide
kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time. After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement. If any errors occur during loading, an error is logged and the operation can continue.
Transportable Tablespaces
Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different computer architectures and operating systems. Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps. Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data. The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
Table Functions
Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps. A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
Multiple rows to be returned from a function Results of SQL subqueries (that select multiple rows) to be passed directly to functions Functions take cursors as input Functions can be parallelized Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
External Tables
6
External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data. External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. Figure 16-4 illustrates a typical example of pipelining. Figure 16-4 Pipelined Data Transformation
Description of "Figure 16-4 Pipelined Data Transformation" The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be created on them. External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.
Table Compression
You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables. To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead. Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
To help you select from among the many possible materialized views in your schema, Oracle provides a collection of materialized view analysis and advisor functions and procedures in the DBMS_ADVISOR package. Collectively, these functions are called the SQL Access Advisor, and they are callable from any PL/SQL program. The SQL Access Advisor recommends materialized views from a hypothetical or user-defined workload or one obtained from the SQL cache. You can run the SQL Access Advisor from Oracle Enterprise Manager or by invoking the DBMS_ADVISOR package. See Also: Oracle Database Performance Tuning Guide for information about materialized views and the SQL Access Advisor
Reduced response time for large classes of ad hoc queries Reduced storage requirements compared to other indexing techniques Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory Efficient maintenance during parallel DML and loads
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table. An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids. Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space. Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. A good candidate for a bitmap index would be a gender column due to the low number of possible values. Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes. See Also: Oracle Database Data Warehousing Guide
When Oracle runs SQL statements in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a statement among multiple processes, Oracle can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel processing. Parallel execution dramatically reduces response time for data-intensive operations on large databases typically associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large-scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up among many CPUs on a single Oracle system. You can also implement parallel execution on certain types of online transaction processing (OLTP) and hybrid systems. Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high. Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you need to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance. Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere seconds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution time. See Also: Oracle Database Data Warehousing Guide for specific information on tuning your parameter files and database to take full advantage of parallel execution
Description of "Figure 16-5 Serial Full Table Scan " Figure 16-6 illustrates several parallel execution servers performing a scan of the table emp. The table is divided dynamically (dynamic partitioning) into load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordinator. Each granule is a
10
range of physical blocks of the table emp. The mapping of granules to execution servers is not static, but is determined at execution time. When an execution server finishes reading the rows of the table emp corresponding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granules are exhausted, in other words, until the entire table emp has been read. The parallel execution servers send results back to the parallel execution coordinator, which assembles the pieces into the desired full table scan. Figure 16-6 Parallel Full Table Scan
Description of "Figure 16-6 Parallel Full Table Scan" Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL statement all have the same degree of parallelism. See Also: Oracle Database Data Warehousing Guide for information on granules as well as how Oracle divides work and handles DOP in multiuser environments
Aggregate at increasing levels of aggregation, from the most detailed up to a grand total Calculate all possible combinations of aggregations with a single statement Generate the information needed in cross-tabulation reports with a single query
These extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows. To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability. One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking. Here are some examples of multidimensional requests:
Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000. Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals. List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets. See Also: Oracle Database Data Warehousing Guide
Rankings and percentiles Moving window calculations Lag/lead analysis First/last analysis Linear regression statistics
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages.
12
Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group. Other features include the CASE expression. CASE expressions provide if-then logic useful in many situations. To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity. See Also: Oracle Database Data Warehousing Guide
Scalability There is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering. Availability Partitioning allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Transportable tablespaces also support high availability. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates. Manageability Oracle lets you precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of endusers. Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle maintains statistics describing the percentcomplete of these operations. Oracle Enterprise Manager lets you view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server using system views. Backup and Recovery Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlights are:
Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations. Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table. Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.
14
See Also: Oracle Database Backup and Recovery Quick Start Guide Security The security features in Oracle have reached the highest levels of U.S. government certification for database trustworthiness. Oracle's fine grained access control enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efficient centralized security management.
For classification, Naive Bayes, Adaptive Bayes Networks, and Support Vector Machines (SVM) For regression, Support Vector Machines For clustering, k-means and O-Cluster For feature extraction, Non-Negative Matrix Factorization (NMF) For sequence matching and annotation, BLAST
ODM also includes several feature and performance enhancements. See Also:
Oracle Data Mining Concepts Javadoc descriptions of classes for detailed information about the classes that constitute the ODM Java API
15