Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/230661908

Designing Data Integration: The ETL Pattern Approach

Article · January 2011

CITATIONS READS

9 4,077

3 authors, including:

Veit Köppen
Otto-von-Guericke-Universität Magdeburg
96 PUBLICATIONS   373 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Automated Query Interface for Hybrid Relational Architectures View project

All content following this page was uploaded by Veit Köppen on 22 May 2014.

The user has requested enhancement of the downloaded file.


CEPIS UPGRADE is the European Journal
for the Informatics Professional, published bi-
monthly at <http://cepis.org/upgrade>
Publisher
CEPIS UPGRADE is published by CEPIS (Council of Euro-
pean Professional Informatics Societies, <http://www.
cepis.org/>), in cooperation with the Spanish CEPIS society Vol. XII, issue No. 3, July 2011
ATI (Asociación de Técnicos de Informática, <http://
www.ati.es/>) and its journal Novática

CEPIS UPGRADE monographs are published jointly with Monograph


Novática, that publishes them in Spanish (full version printed;
summary, abstracts and some articles online) Business Intelligence
(published jointly with Novática*)
CEPIS UPGRADE was created in October 2000 by CEPIS and was
first published by Novática and INFORMATIK/INFORMATIQUE, Guest Editors: Jorge Fernández-González and Mouhib Alnoukari
bimonthly journal of SVI/FSI (Swiss Federation of Professional
Informatics Societies)
2 Presentation. Business Intelligence: Improving Decision-Making in
CEPIS UPGRADE is the anchor point for UPENET (UPGRADE Euro- Organizations — Jorge Fernández-González and Mouhib Alnoukari
pean NETwork), the network of CEPIS member societies’ publications,
that currently includes the following ones:
• inforewiew, magazine from the Serbian CEPIS society JISA
• Informatica, journal from the Slovenian CEPIS society SDI 4 Business Information Visualization — Josep-Lluís Cano-Giner
• Informatik-Spektrum, journal published by Springer Verlag on behalf
of the CEPIS societies GI, Germany, and SI, Switzerland
• ITNOW, magazine published by Oxford University Press on behalf of 14 BI Usability: Evolution and Tendencies — R. Dario Bernabeu
the British CEPIS society BCS and Mariano A. García-Mattío
• Mondo Digitale, digital journal from the Italian CEPIS society AICA
• Novática, journal from the Spanish CEPIS society ATI
• OCG Journal, journal from the Austrian CEPIS society OCG 20 Towards Business Intelligence Maturity — Paul Hawking
• Pliroforiki, journal from the Cyprus CEPIS society CCS
• Tölvumál, journal from the Icelandic CEPIS society ISIP
29 Business Intelligence Solutions: Choosing the Best solution
Editorial TeamEditorial Team
Chief Editor: Llorenç Pagés-Casas for your Organization — Mahmoud Alnahlawi
Deputy Chief Editor: Rafael Fernández Calvo
Associate Editor: Fiona Fanning
38 Strategic Business Intelligence for NGOs — Diego Arenas-
Editorial Board Contreras
Prof. Vasile Baltac, CEPIS President
Prof. Wolffried Stucky, CEPIS Former President
Prof. Nello Scarabottolo, CEPIS President Elect
Luis Fernández-Sanz, ATI (Spain) 43 Data Governance, what? how? why? — Óscar Alonso-Llombart
Llorenç Pagés-Casas, ATI (Spain)
François Louis Nicolet, SI (Switzerland)
Roberto Carniel, ALSI – Tecnoteca (Italy) 49 Designing Data Integration: The ETL Pattern Approach — Veit
Köppen, Björn Brüggemann, and Bettina Berendt
UPENET Advisory Board
Dubravka Dukic (inforeview, Serbia)
Matjaz Gams (Informatica, Slovenia) 56 Business Intelligence and Agile Methodologies for Knowledge-
Hermann Engesser (Informatik-Spektrum, Germany and Switzerland)
Brian Runciman (ITNOW, United Kingdom) Based Organizations: Cross-Disciplinary Applications — Mouhib
Franco Filippazzi (Mondo Digitale, Italy) Alnoukari
Llorenç Pagés-Casas (Novática, Spain)
Veith Risak (OCG Journal, Austria)
Panicos Masouras (Pliroforiki, Cyprus) 60 Social Networks for Business Intelligence — Marie-Aude Aufaure
Thorvardur Kári Ólafsson (Tölvumál, Iceland)
Rafael Fernández Calvo (Coordination) and Etienne Cuvelier
English Language Editors: Mike Andersson, David Cash, Arthur
Cook, Tracey Darch, Laura Davies, Nick Dunn, Rodney Fennemore,
Hilary Green, Roger Harris, Jim Holder, Pat Moody. UPENET (UPGRADE European NETwork)
Cover page designed by Concha Arias-Pérez
"Upcoming Resolution" / © ATI 2011 67 From Novática (ATI, Spain)
Layout Design: François Louis Nicolet
Composition: Jorge Llácer-Gil de Ramales Free Software
AVBOT: Detecting and fixing Vandalism in Wikipedia — Emilio-
Editorial correspondence: Llorenç Pagés-Casas <[email protected]>
Advertising correspondence: <[email protected]> José Rodríguez-Posada — Winner of the 5th Edition of the Novática
Subscriptions Award
If you wish to subscribe to CEPIS UPGRADE please send an
email to [email protected] with ‘Subscribe to UPGRADE’ as the
subject of the email or follow the link ‘Subscribe to UPGRADE’ 71 From Pliroforiki (CCS, Cyprus)
at <http://www.cepis.org/upgrade> Enterprise Information Systems
Copyright Critical Success Factors for the Implementation of an Enterprise
© Novática 2011 (for the monograph) Resource Planning System — Kyriaki Georgiou and Kyriakos E.
© CEPIS 2011 (for the sections Editorial, UPENET and CEPIS News)
All rights reserved under otherwise stated. Abstracting is permitted Georgiou
with credit to the source. For copying, reprint, or republication per-
mission, contact the Editorial Team
CEPIS NEWS
The opinions expressed by the authors are their exclusive responsibility
77 Selected CEPIS News — Fiona Fanning
ISSN 1684-5285

Monograph of next issue (October 2011)


"Green ICT" * This monograph will be also published in Spanish (full version printed; summary, abstracts, and some
articles online) by Novática, journal of the Spanish CEPIS society ATI (Asociación de Técnicos de
(The full schedule of CEPIS UPGRADE is available at our website) Informática) at <http://www.ati.es/novatica/>.
Business Intelligence

Designing Data Integration: The ETL Pattern Approach


Veit Köppen, Björn Brüggemann, and Bettina Berendt

The process of ETL (Extract-Transform-Load) is important for data warehousing. Besides data gathering from heteroge-
neous sources, quality aspects play an important role. However, tool and methodology support are often insufficient. Due
to the similarities between ETL processes and software design, a pattern approach is suitable to reduce effort and increase
understanding of these processes. We propose a general design-pattern structure for ETL, and describe three example
patterns.

Keywords: Business Intelligence, Data Integration, Data Authors


Warehousing, Design Patterns, ETL, Process.
Veit Köppen received his MSc degree in Economics from
1 Introduction Humboldt-Universität zu Berlin, Germany, in 2003. From 2003
Business Intelligence (BI) methods are built on high- until 2008, he worked as a Research Assistant in the Institute of
dimensional data, and management decisions are often based Production, Information Systems and Operation Research, Freie
upon data warehouses. Such a system represents internal Universität Berlin, Germany. He received a PhD (Dr. rer. pol.)
and external data from heterogeneous sources in a global in 2008 from Freie Universität Berlin. He is now a member of
schema. Sources can be operational data bases, files, or in- the Database Group at the Otto-von-Guericke University
Magdeburg, Germany. Currently, he is the project coordinator
formation from the Web. An essential success factor for
in the project funded by the German Ministry of Education and
Business Data Warehousing is therefore the integration of Research. His research interests include Business Intelligence,
heterogeneous data into the Data Warehouse. The process data quality, interoperability aspects of embedded devices, and
of transferring the data into the Data Warehouse is called process management. More information at <http://wwwiti.cs.
Extract-Transform-Load (ETL). uni-magdeburg.de/~vkoeppen>. <[email protected]
Although the ETL process can be performed in any in- magdeburg.de>
dividually programmed application, commercial ETL tools
are often used [1]. Such tools are popular because interfaces Björn Brüggemann studied Computer Science at Otto-von-
are available for most popular databases, and because Guericke-University Magdeburg, Germany, and received his
Masters Degree in 2010. In his Masters Thesis, he focused on
visualizations, integrated tools, and documentation of ETL
Data Warehousing and the ETL process in the context of Data
process steps are provided. However, a tool does not guar- Quality. Since 2010, he has been working at Capgemini, Berlin,
antee successful data integration. In fact, the ETL expert Germany, in Business Intelligence and Data Warehouse projects.
has to cope with several issues. Many of the challenges are More information at <http://www.xing.com/profile/Bjoern_
recurrent. Therefore, we believe that a support for ETL proc- Brueggemann3>. <[email protected]>
esses is possible and can reduce design effort. We propose
the use of the pattern approach from software engineering Bettina Berendt is a Professor in the Artificial Intelligence and
because similarities exist between the ETL process and the Declarative Languages Group at the Department of Computer
software design process. Science of K.U. Leuven, Belgium. She obtained her PhD in
Computer Science/Cognitive Science from the University of
Software patterns are used in object-oriented design as
Hamburg, Germany, and her Habilitation postdoctoral degree
best practices for recurring challenges in software engineer- in Information Systems from Humboldt University Berlin,
ing. They are general, re-usable solutions: not finished de- Germany. Her research interests include Web and text mining,
signs that can be transformed directly into code, but descrip- semantic technologies and information visualization and their
tions of how to solve a problem. These patterns are described applications, especially for information literacy and privacy.
in templates and often included in a catalogue. Consequently, More information at <http://people.cs.kuleuven.be/~bettina.
a software developer can access these templates and imple- berendt>. <[email protected]>
ment best practices easily. The idea of design patterns has
been adapted to different domains including ontology de-
sign [2], usage-interface design [3], and information visu- pattern approach with three example patterns. A brief evalu-
alization [4]. ation of these patterns is presented in Section 4, and in Sec-
In the domain of enterprise system integration, the pat- tion 5 we summarize our work.
tern approach is adapted by [5]. [6] develops patterns for
the design of service-oriented architectures. In this paper, 2 The ETL Process
we present patterns for the design and implementation of Data Warehouses (DW) are often described as an archi-
ETL processes. tecture where heterogeneous data sources, providing data
The paper is organized as follows: in Section 2, the ETL for business analysis, are integrated into a global data
process is described, and in Section 3 we present the ETL schema. Besides the basis database, where data is stored at

© Novática CEPIS UPGRADE Vol. XII, No. 3, July 2011 49


Business Intelligence

“ Business Intelligence methods are built on high-dimensional


data, and management decisions are often
based upon data warehouses

a fine-grained level, data marts for domain-specific analy- „ Relevant data or changes are stored in an additional
ses are stored, containing more coarse-grained information. data store, therefore the data is replicated.
Furthermore, management tools such as data-warehouse „ Logs can be parsed and used, which are otherwise
managers and metadata managers are included in the archi- used for recovery.
tecture. A DW reference architecture is given in [7]. „ Applications that update data can be monitored via
The process of data integration is performed in the stag- time stamp methods or snapshots.
ing area in the architecture. Here, heterogeneous data are The extraction operation is responsible for loading data
extracted from their origins. Adapters and interfaces can be from the source into the staging area. This operation de-
used to extract data from different sources such as opera- pends upon monitoring the method and the data source. For
tional (OLTP) databases, XML files, plain files, or the Web. example, it is possible that the monitor identifies a change,
This extraction is followed by transformation into the DW but the extraction process happens later, at a time predefined
schema. This schema depends on the DW architecture and by the extraction operation. There exist different strategies
the domain or application scenarios. In practice, relational for the extraction operation:
data warehouse are used and star or snowflake schema are „ Periodically, where data is extracted continuously
applied as relational On-Line Analytical Processing and recurrently at a given time interval.Tthis interval de-
(ROLAP) technologies, see for instance [8]. In addition, pends on requirements on timeliness as well as dynamics in
transformations according to data formats and aggregations the source.
as well as tasks related to data quality such as the identifi- „ Query-based, where the extraction is started when
cation of duplicates are performed during this step. Finally, an explicit query is performed instantly. Where all changes
the data is loaded from the staging area into the basis data- are directly propagated into the dw.
base within the DW. Based on this, a cube or different data „ Event-based, where a time-, external- or system-rel-
marts can be built, data mining algorithms applied, reports evant event starts the extraction operation.
generated, and analyses performed. In Figure 1, we present The transformation within the staging area fulfils the
the ETL process in its generic steps. tasks of data integration and data fusion. All data are inte-
A monitor observes a data source for changes. This is grated and transformed into the DW schema, and at the same
necessary to load updated data into the DW. The monitor- time, data quality aspects are addressed, such as duplicate
ing strategy is defined depending on the data source. Two identification and data cleaning. Different transformations
main strategies exist: either all changes are processed to the exist and can be categorized as follows:
monitor and the delta of all changes can be computed, or „ Key handling: since not all database keys can be
the monitor can only identify that changes occurred. We included into the dw schema, surrogates are used.
distinguish the following mechanisms: „ Data-type harmonization, where data are loaded
„ Reactions are selected according to the event-condi- from heterogeneous data sources.
tion-action rules for defined situations. „ Conversion of encodings of the same domain at-

Figure 1: The ETL Process.

50 CEPIS UPGRADE Vol. XII, No. 3, July 2011 © Novática


Business Intelligence

Figure 2: ETL Process with Patterns from Different Categories.

Element META-DESCRIPTION Mandatory?

Name This name identifies the pattern in the catalogue. Yes

Intention A concise description at which use the pattern aims. Yes

Classification A reference to elementary or composite task with an optional Yes


refinement on the ETL steps.

Context This describes the situation where the problem occurs. Yes

Problem A detailed description of the problem. Yes

Solution A concise description of the solution. Yes

Resulting This describes the outcome and the advantages and No


Context disadvantages of using this pattern.

Data Quality Which data quality issues are addressed and which data No
quality dimension/s is/are improved.

Variants A reference to similar and adapted patterns. No

Alternative Other commonly used names of the pattern. No


Naming
Composite Only composite patterns use this description and state the No
Property composition property of the pattern.

Used in This element describes briefly where the pattern is applied. No


this helps in the understanding and decision whether a pattern
should be used.

Implementation For various ETL tools, the solution is put into practice No
differently, therefore different implementations are referenced
here.

Demonstration A reference to an exemplary implementation of this pattern. No

Table 1: ETL Pattern Structure.

© Novática CEPIS UPGRADE Vol. XII, No. 3, July 2011 51


Business Intelligence

“ The process of transferring the data into the Data


Warehouse is called Extract-Transform-Load (ETL)

tribute value to a common encoding (e.g., 0/1 and m/f for domain of data movement see [12]. They all have in com-
gender are mapped to m/f). mon that some elements are mandatory and others are op-
„ Unification of strings, because the same objects can tional. Mandatory elements are the name of the pattern, con-
be represented differently (e.g., conversion to lower case). text, problem description, and core solution.
„ Unification of date format: although databases han- We see two levels of tasks in an ETL process: elemen-
dle different data formats, some other sources such as files tary and composite tasks. An elementary task inside an ETL
can only provide a fixed data format. process is often represented by an operator in the tools. A
„ Conversion of scales and scale units, such as cur- decomposition is not useful, although there might exist an
rency conversions. application that allows a decomposition. We present the
„ Combination or separation of attributes, depend- Aggregator Pattern as an example pattern for solving an
ing on the attribute level of the heterogeneous sources and elementary ETL task in Section 3.1.
the DW. Elementary tasks can be used in a composite task. A
„ Computation and imputation, in the case that val- composite task is the sequence of several tasks or operators
ues can be derived but are not given in the source systems. and therefore more complex. We can classify the compos-
The loading of the extracted and transformed data into ite tasks according to the ETL steps described in Section 2.
the DW (either into the basis database or into data marts) Apart from the loading into the DW dimensions, all cat-
can occur in online or offline mode. If the DW is or should egories and consequently all ETL patterns are independent
be accessed while the loading takes place, an online strat- of the DW schema. We support the design of composite
egy is necessary. This should be used for incremental up- tasks in the ETL process by including composition proper-
dates, where the amount of loading is small. In the first (ini- ties. These composition properties describe categories of
tial) loading of a DW, the loading is high and the DW is run tasks that are executed before or after the composite task.
in an offline mode for the users. At this time, the loading Figure 3 depicts this composition property for the History
operation has exclusive access to all DW tables. Another Pattern described in Section 3.2. Before the History Task is
task for the load operation is the historicization of data; old performed, loading into the DW dimensions and transfor-
data is not deleted in a DW but should be marked as depre- mations may be performed. After the completion of the His-
cated. tory Task, a loading into DW fact tables or into DW dimen-
The ETL process can be refined into several ETL steps, sions is possible. Note that all elements are optional in this
where each step consists of an initialization, a task execu- example.
tion, and a completion. These steps enable ETL designers Providing this information, a sequence structure can be
to structure their work. The following steps can be neces- defined and visualized as we present in Figure 2. In this
sary in an ETL process: extraction, harmonization and plau- way, the complete design of the ETL process can be given
sibility checks, transformations, loading into DW dimen- at an abstract level and customization of the ETL process
sions, loading into DW fact tables, and updating. We use can easily implemented.
this categorization for our template approach in the next We structure our ETL patterns according to the tem-
section. plate shown in Table 1.
In the following, we present three ETL patterns as ex-
3 ETL Patterns amples. In our first example, an elementary ETL task is
The term "pattern" was first described in the meaning presented, the aggregator pattern. In the other two exam-
used here in the domain of architecture [9]. A pattern is ples, we present composite ETL tasks: the history pattern,
described as a three-part rule consisting of the relations be- where data is stored in the DW according to changes in
tween context, problem, and solution. A pattern is used for DW dimensions, and the duplicates pattern for the detec-
recurrent problems and describes the core solution of this tion of duplicates.


problem. For pattern users, it is necessary to identify prob-
lem, context, and solution in an easy way. Therefore, tem-
plates should be used to structure all patterns uniformly.
Although the ETL process
We derive our pattern structure from software engineer- can be performed in any
ing patterns because of the similarities between Software
Design and ETL processes. A template consists of different
individually programmed
elements such as name and description. For examples of application, commercial
templates in object-oriented software design see [10], for
software architecture design patterns see [11], and for the ETL tools are often used

52 CEPIS UPGRADE Vol. XII, No. 3, July 2011 © Novática
Business Intelligence

“ A pattern is described as a
three-part rule consisting of
text of master data can be done according to the dimen-
sions.
Problem: Master data changes only occasionally, but
they do sometimes change (such as the last name of a per-
the relations between context, son). These changes should be taken into account in the
problem, and solution
3.1 The Aggregator Pattern
” dimension tables. However, challenges occur due to the use
of domain keys that change over time, thus they cannot be
use as primary keys. This is in contrast to the modeling of
dimension tables in the star schema. Another problem is
Name: Aggregator Pattern the use of domain keys if redundancy is required.
Intention: Data sets should be aggregated via this pat- Solution: An important challenge is to store old and
tern within ETL processes. new data in the DW system. Furthermore, a relation of fact
Classification: Elementary task table and dimension data is necessary. For this purpose, the
Context: From a database or file data on a fine-grained dimensional table has to be extended by additional attributes.
level are loaded into the DW. In a first step, a virtual primary key is added, together with
Problem: The DW data model does not require data at one or more attribute/s storing current or up-to-date infor-
a fine-grained level. If data from the operational system is mation. The attributes valid_from and valid to are used to
not needed at a fine-grained level, two problems may oc- store the information about when the data was valid. This
cur: more storage is required in the DW, and performance is described differently in the literature, for example as
decreases due to more data having to be processed. changes of type II dimensions [10] or as snapshot history
Solution: An ETL operator is used that collects data [13]. For every data set, a decision has to be made: either it
from the sources and transforms them into the desired granu- is a new dataset, an updated one, or a data set that already
larity. existed in the dimension tables of the DW. For this com-
Resulting Context: A performance increase can be ob- parison, a key should be used that is persistent in time, such
tained, in the DW system as well as in the ETL process, as the domain key. Every source data set is mapped with
through the reduction of data. Furthermore, the required stor- this key to dimensions. If this is not possible, a new entry is
age space is reduced. However, one disadvantage is that identified. If all attributes are equal for the source data set
there exists no inverse operation, so the inference to origi- compared to a data set in the DW, an existing one is identi-
nal data is not possible. If data granularity changes, infor- fied. Otherwise an updated data set is detected. A new data
mation loss may result. set has to be stored in the dimension tables and the attributes
valid_from and valid_to as well as the virtual key have to
3.2 The History Pattern be generated and timeliness set to true. For an update, the
Name: History Pattern timeliness and valid_to information of the already existing
Intention: Data sets in the dimension tables should be dataset have to set before the source dataset can be en-
marked and cataloged. tered into the DW.
Classification: Composite task in the category of di- Resulting Context: All data are historicized, however this
mension loading for star schema. influences performance due to the increase of the data amount
Context: Product, Location, and Time are dimension in the dimension tables. The domain key has to be unique;
in the DW that can change over time. Analyses in the con- otherwise, duplicate detection has to be performed first.

Figure 3: Composite Properties for History Pattern.

“ We derive our pattern structure from software engineering


patterns because of the similarities between Software Design
and ETL processes

© Novática CEPIS UPGRADE Vol. XII, No. 3, July 2011 53
Business Intelligence

Figure 4: Composite Properties of the Duplicates Pattern.

Data Quality: All available information (data complete- Solution: Duplicates have to be identified and deleted.
ness for dimensions) is accessible for analysis with the help As a first step, data have to be homogenized. This includes
of the history pattern. Data timeliness is another advantage conversions, encodings, and separations of all comparative
for data quality issues, as long as the loading is performed attributes. Partitioning of data reduces comparison effort,
at short, regular time intervals. but must be chosen with caution in order not to miss dupli-
Composite: Before an ETL task from the History Pat- cates. The comparison is based on similarity measures that
tern is performed, patterns from the categories Loading Di- help to identify duplicates. There exist different methods
mension and Transformation may be applied. The History and measures based on the data context.
Pattern can be followed by patterns from the Loading Facts A data fusion of identified duplicates has to be carried
and Loading Dimension categories. out. Aspects of uncertainty and inconsistencies have to be
considered in this context. Inconsistency means that seman-
3.3 The Duplicates Pattern tically identical attributes have different values. Uncertainty
Duplicate detection is a common but complex task in occurs if only null values are available. Data conflict avoid-
ETL processes. With our pattern template, we briefly de- ance can be carried out via the survivor strategy [17], where
scribe the solution, although in practice this should be de- a predefined source entry is favored against all others, or
scribed more comprehensively, see [14][15][16] for more via set-based merge [9], where the disjunction of all val-
details. ues is stored. In contrast, data conflict resolution can be
Name: Duplicates Pattern carried out via a decision strategy, where an entry is deter-
Intention: This pattern reduces redundancy in the DW mined from the sources, or a mediation strategy, where new
data; in the best case, it eliminates redundancy completely. values can be computed.
Classification: Composite task in the category transfor- Resulting Context: Duplicates are only partially de-
mation. tected. Due to complexity of the duplicate detection, the
Context: Data from heterogeneous sources (e.g., appli- ETL designer has to carefully consider data context and
cations, databases, files) have to be loaded into the DW. appropriate methods for measuring similarities or partition-
Problem: A data hub for the integration of data is not ing strategy.
always available, therefore master data redundancy occurs Data Quality: The data quality issue non-redundancy
in different business applications. A duplicate are two or is supported with this pattern.
more data sets that describe the same real-world object. Data Composite: The Duplicates Pattern can be preceded by
in the DW should give a consolidated view and must be free patterns from the Transformation category as well as from
of duplicates. the category Harmonization & Plausibility Check. The cat-
egories Transformation, Updating, and Loading Dimension


include patterns that can be used for subsequent tasks, see
The creation of complex ETL Figure 4.

processes is often a challenging 4 Conclusion and Future Work


The creation of complex ETL processes is often a chal-
task for ETL designers. This lenging task for ETL designers. This complexity is compa-
complexity is comparable rable to software engineering, where patterns are used to


structure the required work and support software architects
to software engineering and developers. We propose ETL patterns for the support
of ETL designers. This provides an adequate structure for

54 CEPIS UPGRADE Vol. XII, No. 3, July 2011 © Novática


Business Intelligence

“ We plan to create an ETL pattern catalogue with descriptions of


most common ETL tasks and the corresponding challenges

performing recurring tasks and allows developers to apply unterstützung. Vieweg Verlag, Wiesbaden, 2006.
solutions more easily. In this paper we have presented a [14] I. P. Fellegi and A.B. Sunter. A Theory for Record Link-
template for the general description of ETL patterns. Fur- age. Journal of the American Statistical Association,
thermore, we have presented three examples. 64(328):1183–1210, 1969.
As future work, we plan to create an ETL pattern cata- [15] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios.
logue with descriptions of most common ETL tasks and the Duplicate record detection: A survey. IEEE Transac-
corresponding challenges. This includes an evaluation of tions on Knowledge and Data Engineering, 19:1–16,
the pattern catalogue as well as the application to different 2007.
ETL tools. [16] C. Batini and M. Scannapieco. Data Quality: Concepts,
Methodologies and Techniques. Springer, 2006.
References [17] R. Hollmann and S. Helmis. Webbasierte
[1] R. Schütte, Thomas Rotthowe, and Roland Holten, edi- Datenintegration. Ansätze zur Messung und Sicherung
tors. Data Warehouse Managementhandbuch. Springer- der Informationsqualität in heterogenen Datenbeständen
Verlag, Berlin et al., 2001. unter Verwendung eines vollständig webbasierten
[2] OntologyDesignPatterns.org. <http://ontology design Werkzeuges. Vieweg Verlag, 2009.
patterns.org>.
[3] S.A. Laakso. Collection of User Interface Design Pat-
terns. University of Helsinki, Dept. of Computer Sci-
ence. <http://www.cs.helsinki.fi/u/salaakso/patterns/
index.html. 2003> [accessed July 20, 2011].
[4] J. Heer and M. Agrawala. Software Design Patterns
for Information Visualization. IEEE Transactions on
Visualization and Computer Graphics, 12 (5): 853,
2006.
[5] G. Hohpe and B. Woolf. Enterprise integration patterns.
Designing, building, and deploying messaging solu-
tions. Addison-Wesley, Boston, 2004.
[6] T. Erl. SOA Design Patterns. Prentice Hall PTR, Bos-
ton, 2009.
[7] A. Bauer and H. Günzel. Data-Warehouse-Systeme.
Architektur, Entwicklung, Anwendung. dpunkt Verlag,
Heidelberg, 2009.
[8] E.F. Codd, S.B. Codd, and C.T. Salley. Providing OLAP
to user-analysts: An IT mandate. Technical report, Codd
& Associates, 1993.
[9] D. Apel, W. Behme, R. Eberlein, and C. Merighi.
Datenqualität erfolgreich steuern. Praxislösungen für
Business-Intelligence-Projekte. Carl Hanser Verlag,
2009.
[10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. De-
sign Patterns: Elements of Reusable Object-Oriented
Software. Addison-Wesley Professional, 1995.
[11] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad,
and M. Stal. Pattern-Oriented Software Architecture.
A System of Patterns. Volume 1. Wiley, 1996.
[12] P. Teale. Data Patterns. Patterns and Practices. Microsoft
Corporation, 2003.
[13 H.-G. Kemper, W. Mehanna, and C. Unger. Business In-
telligence - Grundlagen und praktische Anwendungen.
Eine Einführung in die IT-basierte Management

© Novática CEPIS UPGRADE Vol. XII, No. 3, July 2011 55

View publication stats

You might also like