Title: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project
1ISQS 3358, Business IntelligenceSupplemental
Notes on the Term Project
- Zhangxi Lin
- Texas Tech University
2Term project
- 3-4 students form a team to fulfill a data mart
development project. - Stage 1 (10) Project proposal. March 9
- Stage 2 (20) Data mart development. March 25
- Stage 3 (20) Data mart populating. Due April 6
- Stage 4 (20) Analysis report. Due April 20
- Stage 5 (30) The compilation of the previous
deliverables with modifications and enhancements.
Due May 1 - Detailed instructions http//zlin.ba.ttu.edu/3358
/Projects09.html
3Merits of the project outcomes
- Carefully developed project proposal
demonstrating the understanding of the business
requirements, attractive analytics themes, and
clearly defined project goal and objectives - Comprehensive data mart design, such as multiple
fact tables, with supporting analytic themes - Applications of advanced ETL model or techniques,
such as slowly changing dimensions, the use of
containers, etc. - Advanced OLAP cube design, and/or optional MDX
scripting by self-taught - Rich data analysis outcomes
- Well-presented final report
- Demonstrating the creative ideas and skillful
data warehousing ability
4More about dimensional modeling
- - How to develop a more professional data mart
5Slowly Changing Dimensions
- The attribute values in a dimension may change
over time, which are critical to understand the
dynamics of the business. The ability to track
the changes of facts over time is critical to a
DW/BI system. - Employees changed their departments
- Home moving (16.8 American moved per year) zip
code changes possible. - The dimensions that have changeable attribute
values slowly changing dimensions (SCDs) - Type 1 SCD overwrites the existing attribute
value with a new value. You dont care about
keeping track of historical values - Type 2 SCD change tracking ETL process creates
a new row in the dimension table to capture the
new values of the changed item - Type 3 SCD Similar to Type 2 SCD but only track
current state and the original state two
additional attribute SCD Start Date, SCD Initial
Value - Ask business users how they will use the
information to determine which SCDs are of type 1
and which are of type 2.
6Surrogate Key
- A surrogate key has a unique value assigned to
each row in the dimension. It becomes primary key
of the dimension table and is used to join the
dimension to the associated foreign key field in
the fact table - Benefits of surrogate keys
- Protect the DW/BI system from changes in the
source system - Allow the DW/BI system to integrate data from
multiple source system - Enable developers to add rows to dimensions that
do not exist in the source system - Provide the means for tacking changes in
dimension - Are efficient in the relational database and
analysis services - Surrogate keys are created when doing data
warehousing. They are new from the keys in
original database - They are also called meaningless keys, substitute
keys, non-natural keys, artificial keys - Specifically, surrogate keys are used in slowly
changed dimensions (SCD) management
7Aggregate Dimensions
- Situation data at different levels of
granularities - Two resolutions
- Removing a dimension
- Rolling up a dimensions hierarchy and provide a
new, shrunken dimension at the aggregate level
7
8Many-to-many or Multivalued Dimensions
- Relationship between a dimension table and fact
table is called one-to-many one row in the
dimension table may join to many rows in the fact
table. - Many-to-many or Multivalued Dimensions are
referred to as there are more than one row in a
dimension table joining to multiple rows in a
fact table - Bridge table supports many-to-many relationship
- fact-dimension
- dimension-dimension.
-
9Hierarchies
- Meaningful, standard ways to group the data
within a dimension - Variable-depth hierarchies
- Frequently changing hierarchies
- Examples of hierarchy in a dimension
- Address street, city, state, country
- Organization section, division, branch, region
- Time year, quarter, month, date
10Heterogeneous Products
- Several products with differentiated attributes
- Problem sharing one dimension or use different
dimension? - Resolutions
- One family-oriented dimension with core fact and
product tables plus specific information for each
line of product
11The Three Fact Table Types
- Transaction fact table
- Periodic snapshot fact table
- Cumulative performance over specific time
intervals valuable to combine data across
several business processes in the value chain. - Accumulating snapshot fact table
- Constantly updated over time.
12Junk Dimensions
- Also called miscellaneous or mystery dimensions
- They are miscellaneous attributes that dont
belong to any existing dimension. - Typically flags or indictors that describe or
categorize the transaction in some way. - Contents are often important
- Four alternatives for dealing with them
- Leave them in the fact table
- Create a separate dimension for each attribute
- Omit them
- Group them into a single junk dimension
13Degenerate Dimensions
- No description of its own
- No joining to an actual dimension table
- No attributes
- Example transaction ID
13
14Data Warehousing Methodology
- - Implementing data warehouse systematically
14
15Dimensional Modeling Process
- Preparation
- Identify roles and participants
- Understanding the data architecture strategy
- Setting up the modeling environment
- Establishing naming conventions
- Data profiling and research
- Data profiling and source system exploration
- Interacting with source system experts
- Identifying core business users
- Studying existing reporting systems
- Building Dimensional models
- High-level dimensional model design
- Identifying dimension and fact attributes
- Developing the detailed dimensional model
- Testing the model
- Reviewing and validating the model
16Business Dimensional Lifecycle
Business Reqts definition
Technical Arch. Design
Product Selection Installation
Growth
Dimensional Modeling
Physical Design
ETL design Development
Deployment
Project Planning
BI Appl. Specification
BI Application Development
Maintenance
Project Management
16
17ETL Methodology
- Develop a high-level map
- Build a sandbox source system (optional)
- Detailed data profiling
- Make decisions
- The source-to-target mapping
- How often loading tables
- The strategy for partitioning the relational and
Analysis Services fact table - The strategy for extracting data from each source
system - De-duplicate key data from each source system
(optional) - Develop a strategy for distributing dimension
tables across multiple database servers
(optional)
17
18Sandbox Source System
- Sandbox
- A protected, limited environment where
applications are allowed to "play" without
risking damage to the rest of the system. - A term for the RD department at many software
and computer companies. The term is
half-derisive, but reflects the truth that
research is a form of creative play. - In the DW/BI context, sandbox source system is a
subset of source database for analytic
exploration tasks - How to create
- Set up a static snapshot of the database
- By sampling
18
19Data Profiling
- Data profiling is a methodology for learning
about he characteristics of the data - It is a hierarchical process that attempt to
build an assessment of the metadata associated
with a collection of data sets. - Three levels
- Bottom characterizing the values associated
with individual attributes - Middle the assessment looking at relationships
between multiple columns within a single table. - Highest level the profile describing
relationships that exist between data attributes
across different tables. - Can run a program against the sandbox source
system to obtain the needed information.
19
20Decision Issues in ELT System Design
- Source-to-target mapping
- Load frequency
- How much history is needed
20
21Strategies for Extracting Data
- Extracting data from packaged source systems
self-contained data sources - May not be good to use their APIs
- May not be good to use their add-on analytic
system - Extracting directly from the source databases
- Strategies vary depending on the nature of the
source database - Extracting data from incremental loads
- How the source database records the changes of
the rows - Extracting historical data
21