Knowledge Discovery in Databases - PowerPoint PPT Presentation


PPT – Knowledge Discovery in Databases PowerPoint presentation | free to download - id: 96951-ZDhiZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Knowledge Discovery in Databases


Knowledge Discovery in Databases ' ... Also known as knowledge extraction, information harvesting, data archeology, and ... Require knowledge-based decisions ... – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 38
Provided by: annemari8


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Knowledge Discovery in Databases

Knowledge Discovery in Databases Information
University of Texas at Austin School of
Knowledge Management Systems Presented
April 29, 2003 By Anne Marie Donovan
  • Knowledge Discovery in Databases
  • The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data (Fayyad,
    Piatetsky-Shapiro, and Smyth, 1996, p. 30)
  • Also known as knowledge extraction, information
    harvesting, data archeology, and information
    extraction (p. 28)

  • Information Retrieval
  • The methods and processes for searching relevant
    information out of information systems that
    contain extremely large numbers of documents
    (Rocha, 2001, 1.1)
  • The ultimate goal of IR is to produce or
    recommend relevant information to users (1.2)
  • Traditional IR does not identify users and
    classifies subjects only with unchanging keywords
    and categories (1.2)

  • Institutions that use KDD/IR systems
  • Require knowledge-based decisions
  • Have a large quantity of accessible, relevant,
    historical and current data
  • Have a high payoff for correct decisions
  • Financial banking investment
  • Medical healthcare insurance
  • Sales marketing customer relations
  • (Piatetsky-Shapiro, 1998, Slides 28-31)

  • Database Management Systems
  • File Systems
  • Relational Database Management Systems (RDBMS)
  • Object-Oriented Database Management Systems
  • Object-Relational Database Management Systems
  • (Devarakonda, 2001, ORDBMS)

  • Relational Database Management Systems (RDBMS)
  • Relational databases are composed of many
    relations in the form of two-dimensional tables
    of rows and columns
  • RDBMS advantages include the SQL standard
    (enables migration between database systems),
    rapid data access and large storage capacity
  • RDBMS disadvantages include an inability to
    handle complex data types and relationships
  • (Devarakonda, 2001, RDBMS)

  • Object-Oriented Database Management Systems
  • OODBMS use abstract data types (ADTs) in which
    the internal data structure is hidden
  • OODBMS data is managed through two sets of
    relations, one describing the interrelations of
    data items and another describing the abstract
  • OODBMS handle complex data relationships, but
    suffer from poor performance and problems of
  • (Devarakonda, 2001, OODBMS)

  • Object-Relational Database Management Systems
  • ORDBMS store all database information in tables,
    but some entries have richer data structure that
    are also called abstract data types (ADTs).
  • ORDBMS exhibit features of both the relational
    and object models such as scalability and support
    for rich data types
  • Their main advantage is massive scalability
  • (Devarakonda, 2001, ORDBMS)

  • The KDD Process
  • Collecting and pre-processing data
  • The problem of continually increasing volumes of
  • The problem of increasingly complex forms of data
  • Identifying and extracting useful knowledge from
    large data repositories
  • What knowledge is in the data set?
  • What can be observed about the data set?
  • Presenting the knowledge in usable forms
  • (Fayyad et al., 1996)

  • The KDD Process (continued)
  • Data management problems in data collection,
    storage, and retrieval
  • Translation, change detection, integration,
    duplication, summarization aggregation,
    timeliness/datedness (Widom, 1995)
  • The impracticality of manual analysis
  • Billions of records and hundreds of fields
  • Increasing desire for on-the-fly analysis and
    more flexible presentation (Fayyad et al., p. 28)

  • The KDD Process (continued)
  • A need to automate the knowledge discovery and
    extraction processes
  • Data selection and pre-processing
  • Data transformation and mining
  • Interpretation and evaluation (p. 28)
  • Automation requires attention to
  • Data collection, storage, and retrieval
  • Statistical foundations of search and retrieval
    processes (p. 29)

  • Stages in the KDD process
  • Learning the application domain
  • Creating a target data set
  • Data cleaning and preprocessing
  • Data reduction and projection
  • Choosing the function of data mining
  • Choosing the data mining algorithm
  • Data mining
  • Interpretation
  • Using discovered knowledge (pp. 30-31)

  • Data mining
  • The application of specific algorithms to a data
    set for the purpose of extracting data patterns
    (p. 28)
  • Fitting models to or determining patterns from
    observed data (p. 31)
  • Data warehousing
  • Collecting and cleaning transactional data to
    make it available for online analysis and
    decision support (p. 30)

  • Data mining tasks
  • Classification predicting an item class
  • Forecasting predicting a parameter value
  • Clustering finding groups of items
  • Description describing a group
  • Deviation detection finding changes
  • Link analysis finding relationships and
  • Visualization presenting data visually to
    facilitate human discovery (Piatetsky-Shapiro,
    1998, Slide 17)

  • Components of data mining systems
  • Model functions classification, regression,
    clustering, etc. (pp. 31 -32)
  • Model representation decision trees and rules,
    linear models, non-linear models, example-based
    methods, etc. (p. 32)
  • Preference criterion quantitative criterion
    embedded in the search algorithm implicit
    criterion embedded in the KDD process
  • Search algorithms parameter search (given a
    model) or model search over model space

  • There is NO universal search algorithm
  • Each type of search suits specific types of
    search problems
  • The searcher must be careful to properly
    formulate the question
  • The searcher must understand the search goal (p.
  • Every search can be improved by an increase in
    data or query context

  • Creating context for KDD and IR
  • Extending IR throughout the social network of an
    organization, e.g., Answer Garden (Ackerman, 1994
    Ackerman and MacDonald, 1996)
  • Providing social context for data exchange, e.g.,
    PeopleGarden (Xiong and Donath, 1999)
  • Relational database reverse engineering,
    extracts a conceptual model from an existing
    relational database by analyzing data instances
    as well as metadata (Lee and Hwang, 2002,

  • KD IR problems for Web resources
  • Collecting and pre-processing data
  • Even more continually changing data
  • Complex data streaming multi-media
  • The problem of identifying and extracting useful
    knowledge from Web resources
  • No consistent data models no context
  • A lack of descriptive information
  • Presenting the knowledge in usable forms
  • More and more wireless devices and
    time-sensitive, multi-media applications

  • Current methods for Web KD IR
  • Collecting and pre-processing data
  • Web crawlers and link-based ranking
  • Human indexing and categorization
  • Identifying and extracting useful knowledge from
    Web resources
  • Keyword search on natural language text
  • Topical directories or topical Web sites
  • Presenting the knowledge in usable forms
  • Content presented in native format (plugins) or
    in HTML

  • Automating KD IR for the Web
  • Semantic markup to enable machine
    understanding/processing (RDF/S DAML/OIL)
    inference analysis
  • Intelligent search engines and agents to exploit
    semantic statements
  • Ontologies to provide context (a data model) for
    agents (Shah et. al.)

  • Automating KD IR for the Web (continued)
  • Automated data collection, automated context
    collection (data pre-processing)
  • Value-added services (query routing)
  • Integrated query systems/knowledge delivery
    systems (accessibility)
  • Social accounting metrics to provide context for
    humans (Smith, 2002, p. 52)

  • Enhanced presentation for the Web
  • Reformatting for presentation
  • Differentiated service
  • Variable visualization
  • Adaptive graphics, a unifying framework that
    allows visual representations of information to
    be customized and mixed together into new ones
    (Boier-Martin, 2003, pp. 6-9)
  • Previewing interactive content
  • Selective presentation customized views

  • KDD and IR for pervasive computing
  • Achieving ubiquitous data access (Cherniack,
    Franklin, Zdonik, 2001, slide 7)
  • Data management problems
  • Dissemination (context dependent pull/push)
  • Synchronization (multiple collectors/devices)
  • Recharging (renewing) multiple data streams
  • Profile-driven data management

  • KDD and IR for pervasive computing (continued)
  • Achieving ubiquitous data access (Cherniack,
    Franklin, Zdonik, 2001, slide 7)
  • Location aware, mobile devices
  • Service discovery for mobile services
  • Distributed sensors/collectors (slides 8-27)

  • Next generation KDD IR will.
  • Focus on solving business problems, not data
    analysis problems
  • Embed knowledge discovery engines
  • Integrate access to enterprise and external data
    on the back-end
  • Integrate knowledge discovery process with
    knowledge delivery tools (Piatetsky-Shapiro,
    1998, Slide 7)

  • Next generation KDD IR will.
  • Manage information retrieval contextually
  • Allow contextual query/continuous query
  • Synchronize multiple data flows from disparate
    sensors/input devices
  • Enable KD in virtual networks of peer-to-peer
    databases (data clusters or cubes)
  • Interpolate or extrapolate for missing data
  • (Cherniack et. al., 2001, slides 115-138)

  • Next generation KDD IR will.
  • Recognize individual users
  • Characterize information resources
  • Provide a way to exchange knowledge between users
    and information resources (push and pull of
  • Adapt to the user community and enable the reuse
    and recombination of information as well as its
  • (Rocha, 2001, 1.2)

  • KDD research problems
  • Massive data sets high dimensionality
  • User interaction prior knowledge
  • Determining statistical significance
  • Missing data
  • Understandability of patterns
  • Management of changing data knowledge
  • Data integration
  • Non-standard, multimedia, object-oriented data
    (Fayyad, Piatetsky-Shapiro, Smyth, 1996, pp.

  • Top Ten IR research issues
  • Integrated solutions
  • Distributed IR
  • Efficient, flexible indexing and retrieval
  • "Magic (automatic query expansion)
  • Interfaces and browsing
  • Routing and filtering
  • Effective retrieval
  • Multimedia retrieval
  • Information extraction
  • Relevance feedback (Croft, 1995)

  • Total Information Awareness - DARPA on the
    bleeding edge...
  • New database technologies
  • Database architectures
  • Database population
  • New search algorithms and data models
  • Genysis
  • Goal is to produce technology enabling
    ultra-large, all-source information repositories
  • http//

  • Social Issues
  • Communicating context
  • Creating trust/social value
  • Inciting cooperation/collaboration
  • Privacy tradeoffs convenience/service or

  • Ackerman, M. S. (1998, July). Augmenting the
    organizational memory A field study of Answer
    Garden. ACM Transactions on Information Systems,
    16(3), 203-204. Retrieved March 28, 2003 from
  • Ackerman, M. S., Malone, T. W. (1990, April).
    Answer Garden A tool for growing organizational
    memory. ACM SIGOIS Bulletin, 11(.2-3), 31-39.
    Retrieved March 28, 2003 from http//
  • Ackerman, M. S., McDonald, D. W. (1996).
    Proceedings of the ACM Conference on
    Computer-Supported Cooperative Work 1996 (CSCW96
    Boston, MA). Retrieved March 28, 2003 from
  • Boier-Martin, I. M.. (2003, January/February).
    Adaptive graphics. In T. Rhyne (Ed.)
    Visualization Viewpoints, IEEE Computer Graphics
    and Application, 23(1), 6-10. Retrieved April 5,
    2003 from http//

  • Chakrabarti, S., Srivastava, S., Subramanyam, M.,
    Tiware, M. (2000). Using Memex to archive and
    mine community Web browsing experience. A paper
    presented at the 9th International World Wide Web
    Conference, Amsterdam, May 15-19, 2000. Retrieved
    April 12, 2003 from http//
  • Croft, W. B. (1995, November). What do people
    want from information retrieval? The top 10
    research issues for companies that use and sell
    IR systems. D-Lib Magazine. Retrieved April 5,
    2003 from http//
  • DARPA Information Awareness Office. (2003a).
    Genysis. Retrieved from the DARPA Information
    Awareness Office Web site at http//
  • DARPA Information Awareness Office. (2003b).
    Total Information Awareness System. Retrieved
    from the DARPA Information Awareness Office Web
    site at http//

  • Devarakonda, R. (2001, March). Object-Relational
    database systems - The road ahead. ACM Crossroads
    Student Magazine. Retrieved April 12, 2003 from
  • Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.
    (1996, November). The KDD process for extracting
    useful knowledge from volumes of data.
    Communications of the ACM, 39(11), 27-34.
    Retrieved March 03, 2003 from http//wwwhome.cs.ut
  • Lee, D., Hwang, Y. (2002, March 1). Extracting
    semantic metadata and its visualization. ACM
    Crossroads Student Magazine. Retrieved March 27,
    2003 from
  • Piatetsky-Shapiro, G. (1998, December 4). Data
    mining and knowledge discovery tools The next
    generation. Retrieved February 27, 2003 from at http//

  • Rauber, A., Aschenbrenner, A., Witvoet, O.,
    Bruckner, R. M., Kaiser, M. (2002, December).
    Uncovering information hidden in Web archives A
    glimpse at Web analysis building on data
    warehouses. D-Lib Magazine, 8(12). Retrieved
    March 28, 2003 from http//
  • Rocha, L. M. (2001). TalkMine A soft computing
    approach to adaptive knowledge recommendation
    Electronic version. In V. Loia S. Sessa
    (Eds.), Studies in fuzziness and soft computing
    Vol. 75. Soft computing agents New trends for
    designing autonomous systems. (pp. 89-116). New
    York Springer. Retrieved March 28, 2003 from
  • Shah, U., Finin, T., Joshi, A., Cost, R. S.,
    Mayfield, J. (2002, November). Information
    retrieval on the Semantic Web. Paper presented at
    The ACM Conference on Information and Knowledge
    Management , November 2002. Retrieved March 28,
    2003 from http//

  • Smith, M. (2002). Tools for navigating large
    social cyberspaces. Communications of the ACM,
    45(4), 51-55. Retrieved March 28, 2003 from
  • Whitted, T. (1999, July/August). Draw on the
    Wall. IEEE Computer Graphics and Applications,
    19(4), 6-9. Retrieved April 8, 2003 from at http//
  • Widom, J. (1995, November). Research problems in
    data warehousing. Proceedings of the 4th
    International Conference on Information and
    Knowledge Management (CIKM). Retrieved March 28,
    2003 from http//

  • Xion, R., Donath, J. (1999). PeopleGarden
    Creating data portraits for users. CHI Letters,
    1(1). 37-44. Retrieved April 8, 2003 from