Title: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
1Data Warehousing the New Knowledge
Management Architecture for Humanities Research?
- Janet Delve
- University of Portsmouth, UK
- UKAIS 2004
2Introduction
- Data Warehouses everywhere
- Amazon
- WalMart
- Opodo
- DWs used a lot in industry, and scientific
research, but not in humanities research. - Written paper covers linguistics and history.
Talk covers history in detail and gestures
towards linguistics.
3Overview
- Introduction
- Data modelling and traditional databases
- Source-oriented data modelling
- Data Mining
- Philosophy of data warehousing
- Background of DWs
- Basic components of a data warehouse (DW)
- Advantages of DWs
- Findings Humanities and DWs
- Humanities and DWs some issues
- Examples of possible Humanities DWs
- Ideas for the future?
4Data Modelling
- Relational data modelling material split into
many tables in order to gain enhanced performance
no duplication, updating or insertion anomalies
etc. - Source-oriented data modelling emphasis on
modelling data as closely as possible to original
source which is included in its entirety for
posterity. - DW data modelling nearer to source-oriented
approach in spirit.
5Traditional databases
- ERD p117 Harvey and Press
6Traditional databases
7Historical Data
- This can be difficult to model because
- It is irregular in structure,
- It is complex
- It is erratic in terms of when it occurs
- Using a relational database can mean data from a
single source being spit into many tables.
8Source-oriented data modelling
- a semantic network tempered by hierarchical
considerations Thaller 1991, 155. - Its flexible nature gives kleiw a rubber band
data structures facility Denley 1994, 37. - The fluid nature of creating a database with
kleiw marks it out as an organic DBMS.
9Data Mining
- The whole field is often referred to as data
mining, which is also a major component within
the field. - Data mining (DM) is normally used on large
quantities (terabytes) of data, to find
meaningful patterns. Neural nets, statistical
modelling, decision trees are just some AI
methods used. SQL can be used too. Parallel data
processing is used with DM. - In order to mine data, it must be kept in a
suitable system - a data warehouse is ideal.
10Philosophy of data warehousing
- Data warehousing is an architecture, not a
technology. There is the architecture, and there
is the underlying technology, and they are two
very different things. Unquestionably there is a
relationship between data warehousing and
database technology, but they are most certainly
not the same. Data warehousing requires the
support of many different kinds of technology. - Inmon 2002
11Background of DWs
- Business-oriented serve the analytical needs of
a company. The ordinary DBMS is still needed for
the day-to-day queries, and also to feed the DW. - W.H. Inmon, father of DW. Cabinet effect 1991
- R. Kimball, expert on dimensional modelling
- Need for single, integrated source of clean data,
particularly for multinational etc. companies - Supporting technology from e.g. Oracle, Prism
Solutions, IBM
12Data Marts
- Data marts contain DW data but are restricted to
one department or one business process. - The industry is divided about data marts,
- Inmon recommends building the DW first, then
siphoning off the data to data marts. - Kimball believes you should build several data
marts first, then integrate them into a DW.
13Basic components of a Data Warehouse (DW)
- A DW is subject-oriented, integrated,
non-volatile time-variant. - The major subjects for an insurance company are
customer, policy, premium and claim. Previously
data modelled around applications -car, health,
life and accident. - Integration is the most important facet of a DW.
Previous inconsistencies are ironed out and all
data unambiguously entered into DW. Many sources
of data can be placed in DW.
14Basic components of a Data Warehouse (DW)
- Non-volatile data in a DW means that it is not
changed in the way data is in operational
database data is loaded en masse and isnt
updated. Obviates need for normalisation. - Time- variant DW time horizon 5 10 years,
operational database 2-3 months. DW snapshots,
operational database current data, DW always has
element of time, operational database may or may
not have. Inmon 2002
15Basic components of a Data Warehouse (DW)
16Typical Architecture of a Data Warehouse
17Meta Data
- Meta data is extremely important in a DW. It is
used - to log the extraction and loading of data into
the warehouse - in query management to locate the most
appropriate data source and also to help end
users to build queries - to show how the data has been mapped when
carrying out data cleansing and transformations - To manage all the data in the DW recording
where data came from, when etc.
18Basic components of a Data Warehouse (DW)
- Fact Tables
- A fact table is the primary table in a
dimensional model where the numerical performance
measurements of the business are stored - The measurement data resulting from a business
process is stored in a single data mart - Since measurement data is overwhelmingly the
largest part of any data mart, we avoid
duplicating it in multiple places around the
enterprise Kimball 2002
19Basic components of a Data Warehouse (DW)
- Dimension tables
- These contain the textual descriptors of the
business. Their depth and breadth define the
usefulness of the DW. - Contains data that doesnt change frequently
- Can have 50-100 attributes.
- Not usually normalized. (Snowflake and starflake)
- Coding disparaged (Long term view)
20Basic components of a Data Warehouse (DW)
21Basic components of a Data Warehouse (DW)
22Basic components of a Data Warehouse (DW)
23Data Warehousing Tools and Technologies
- Building a data warehouse is a complex task
because there is no vendor that provides an
end-to-end set of tools. - Necessitates that a data warehouse is built using
multiple products from different vendors. - Ensuring that these products work well together
and are fully integrated is a major challenge.
24Advantages of DWs
- Flexibility in modelling data.
- Time dimension country-specific calendars and
synchronization across multiple time zones. - Easy to add external data and summarised data.
- Built for analysis.
- Built for huge volumes of data (terabytes of data
a trillion 1012). - Can cope with idiosyncrasies of geographic
location dimensions within GISs.
25Possible advantages of DWs
- Indexing facilities of DW.
- Publishing the right data data collected from
a variety of sources and edited for quality and
consistency. - DW seeks to collate all data so a variety of
different subsets can be analysed whenever
required. - Easy to extend DW and add material from a new
source. - Data cleansing techniques.
- Tracking facility afforded by meta data
26Disadvantages of DWs
- Some humanities data fits into the numerical
fact topology, some doesnt - Technology not easy and is based on having
existing databases to extract from - Regular snapshots not the same but they could
equate to data sets taken at different periods of
time (e.g. 1841 census, 1861 census) - A lot to learn.
27Findings Humanities and DWs
- NAGARA
- (National Association of Government Archives and
Records Administrators) - Article on DWs by Mary Klauda of the Minnesota
Historical Society 1999 (archivist) - Eastern Connecticut schools DW 2002
- Bo Wandschneider University of Guelph, Canada
-DW and the use of census data. ICPSR
(Inter-university Consortium for Political and
Social Research)
28Findings Humanities and DWs
- University of California DW memo to Humanities
department - Social Science DW Human Resources DW project of
Human Sciences Research Council, South Africa - GEOBASE, Israel. DW of Israels regional
statistics, supported by National Planning
Authority in the Ministry of Interior Affairs.
29Humanities and DWs some issues
- Scale can cope with really large country /
state -wide problems. - Can analyse e.g. British censuses 1841-1901
(108). - Can put several databases together to produce a
time run e.g Hearth taxes, window taxes, poll
taxes, land taxes, poor rates all in one DW. - Oracle site licenses.
30Examples of possible History DWs
31Examples of possible History DWs
HOLDING DETAILS ----------------------------- Hold
ing ID King Tenant in Chief Manor Lord VILL Etc.
PROPERTY INFORMATION -----------------------------
--- Property Id Property description Property
value Etc
MANOR ---------------------------- ManorId Holding
Id Property Id Original Owner Id Date Manor
Value Tax (Hides) Cottar Population Bordar
Population Villein Population Sokeman
Population Pries Population Number of
Burgesses Number of slaves Etc.
ORIGINAL OWNER ---------------------------- Origin
al Owner ID Etc.
32Examples of possible History DWs
- Data from a variety of sources over time hearth
tax, poor rates, trade directories, census,
street directories, wills and inventories, GIS
maps for a city e.g. Winchester. - Voting data poll book data and rate book data
up to 1870 for whole country (note some data
missing). - Port data all data from portbooks for all
British ports together with yearly trade figures. - Street directories for whole country for last 100
years. - Taxation overview different types / areas /
periods.
33Examples of possible History DWs
- 19th C British census data doesnt fit into the
typical DW model as it doesnt have the numerical
facts to go into a fact table. - However, theres a recent development in DWs
factless fact tables. - There is real scope to be able to model
historical data using these.
34Examples of possible History DWs
35Examples of possible Humanities DWs
- Language DW could contain databases of
different languages for comparison, or many
databases of same languages over larger area. - DW of worldwide scholarly community / whole
culture - GIS or archaeological DW by continent etc. rather
than country. - DW of biographies.
- DW of library catalogues or archives for enhanced
public access.
36Ideas for the future?
- Instead of me and my database - emphasis on
smallish, individual, national projects, - Maybe
- Our integrated warehouse emphasis on large
scale, collaborative, international projects?