Christoph F. Eick - PowerPoint PPT Presentation

About This Presentation
Title:

Christoph F. Eick

Description:

Data Mining for the Health Sciences, Houston, Feb. 9, 2000. Christoph F. Eick ... http://ksl-web.stanford.edu/Reusable-ontol/P001.html (Richard Fikes' (Stanford ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 30
Provided by: eick
Learn more at: https://www2.cs.uh.edu
Category:
Tags: christoph | eick | ksl

less

Transcript and Presenter's Notes

Title: Christoph F. Eick


1
Data Miningfor the Health Sciences
  • Christoph F. Eick
  • www.cs.uh.edu/ceick/
  • ceick_at_cs.uh.edu
  • University of Houston
  • Organization
  • 1. Health Care and Computer Science
  • 2. Promising Technologies
  • 2.1 KDD / Data Mining
  • 2.2 Agent-based Systems
  • 2.3 Shared Ontologies and Knowledge
    Brokering
  • 3. Model Generation as an Example
  • 4. Summary and Conclusion

2
1. Health Care and Computer Science
  • Not too long ago (e.g. 1989)
  • Offline data / Missing data / hand written
    reports
  • Computer that cannot talk to each other
  • Lack of standardization (Tower of Babel, too many
    languages)
  • Today faster computers, cheaper computers,
    better computer networks, electronic scanners,
    better connectivity, the internet,...
  • We have a lot of computerized knowledge on almost
    any aspects of human health(a well of knowledge)
  • We have much more computing power to conduct
    complex data analysis tasks
  • New Problems
  • How can we find anything?
  • How do we gather information that is distributed
    over various computer systems and represented
    using different formats?
  • If we find something, how do we know that it is
    complete?
  • How can this large amount of information be
    analyzed?
  • What information can we trust?

3
2. Promising Newer Technologies to Cope with
the
Information Flood
  • Knowledge Discovery and Data Mining (KDD)
  • Agent-based Technologies
  • Ontologies and Knowledge Brokering
  • Non-traditional data analysis techniques

Model Generation As an Example To Explain
/ Discuss Technologies
4
Knowledge Discovery in Data and Data Mining
(KDD)
Let us find something interesting!
  • Definition KDD is the non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data
    (Fayyad)
  • Frequently, the term data mining is used to refer
    to KDD.
  • Many commercial and experimental tools and tool
    suites are available (see http//www.kdnuggets.com
    /siftware.html)
  • Field is more dominated by industry than by
    research institutions

5
General KDD Steps
Data sources
Selected/Preprocessed data
Transformed data
Extracted information
Knowledge
Select/preprocess
Transform
Data mine
Interpret/Evaluate/Assimilate
Data preparation
6
KDD and Classical Data Analysis
  • KDD is less focused than data analysis in that it
    looks for interesting patterns in data classical
    data analysis centers on analyzing particular
    relationships in data. The notion of
    interestingness is a key concept in KDD.
    Classical data analysis centers more on
    generating and testing pre-structured hypothesis
    with respect to a given sample set.
  • KDD is more centered on analyzing large volumes
    of data (many fields, many tuples, many tables,
    ).
  • In a nutshell the the KDD-process consists of
    preprocessing (generating a target data set),
    data mining (finding something interesting in the
    data set), and post processing (representing the
    found pattern in understandable form and
    evaluated their usefulness in a particular
    domain) classical data analysis is less
    concerned with the the preprocessing step.
  • KDD involves the collaboration between multiple
    disciplines namely, statistics, AI,
    visualization, and databases.
  • KDD employs non-traditional data analysis
    techniques (neural networks, association rules,
    decision trees, fuzzy logic, evolutionary
    computing,).

7
3. Generating Models as an Example
  • The goal of model generation (sometimes also
    called predictive data mining) is the creation,
    evaluation, and use of models to make predictions
    and to understand the relationships between
    various variables that are described in a data
    collection. Typical example application include
  • generate a model to that predicts a students
    academic performance based on the applicants data
    such as the applicants past grades, test scores,
    past degree,
  • generate a model that predicts (based on economic
    data) which stocks to sell, hold, and buy.
  • generate a model to predict if a patient suffers
    from a particular disease based on a patients
    medical and other data.
  • Model generation centers on deriving a function
    that can predict a variable using the values of
    other variables vf(a1,,an)
  • Neural networks, decision trees, naïve Bayesian
    classifiers and networks, regression analysis and
    many other statistical techniques, fuzzy logic
    and neuro-fuzzy systems, association rules are
    the most popular model generation tools in the
    KDD area.
  • All model generation tools and environments
    employ the basic train-evaluate-predict cycle.

8
Why Do We Need so manyData Mining / Analysis
Techniques?
  • No generally good technique exists.
  • Different methods make different assumptions with
    respect to the data set to be analyzed (to be
    discussed on the next transparency)
  • Cross fertilization between different methods is
    desirable and frequently helpful in obtaining a
    deeper understanding of the analyzed dataset.

9
Example Decision Tree Approach
10
Decision Tree Approach2
11
Example Nearest Neighbor Approach
12
Characteristics and Assumptions of Popular Data
Mining/Analysis Techniques
  • Distance based approaches (assume that a distance
    function with respect to the objects in the
    dataset exists) vs. order-based approaches (just
    use the ordering of values in their decision
    making 321 is indistinguishable from
    2.0121.99)
  • Approaches that make no assumptions / assume a
    particular distribution of the data in the
    underlying dataset.
  • Differences in employed approximation techniques
  • Rectangular vs. other approximation
  • Linear vs. non-linear approximations
  • Sensitivity to redundant attributes (variables)
  • Sensitivity to irrelevant attributes
  • Sensitivity to attributes of different degrees of
    importance
  • Different Training Performance / Testing
    Performance
  • What does the learnt function tell us about the
    analyzed data set? How difficult is it to
    understand the learnt function?
  • Deterministic / non-deterministic approaches
  • Stability of the obtained results

13
Players in the Model Generation Society
Data Analysts
Data Collection Providers
Tool Builders
End Users (Managers, Doctors, Decision Makers,
Gamblers,...)
14
(More) Problems of Model Generation
  • It is difficult to find appropriate data
    collections.
  • Sharing of models is not supported.
  • Model generation is mostly performed in a
    centralized environment, not taking advantage of
    distributed computed computing technology.
  • Degree of tool standardization is low, which
    makes more difficult to use different tools for
    the same data analysis problems.
  • Evaluation of claims with respect to to the
    performance models is very difficult. Problem
    the model itself, as well as tools and data
    collection that were used to generate the model
    are not accessible online.

15
Key Ideas Agent-based Technologies
  • Agents operate independently and anticipate user
    needs (P. Maes)
  • Agent help users suffering from information
    overload (O. Etzioni) rather to mimic human
    intelligence
  • Agents are important because the allow users to
    interoperate with modern applications such as
    electronic commerce and information retrieval.
    Most of these applications assume that components
    are added dynamically and that they will be
    autonomous (serve different users and providers
    to fill different goals) and heterogeneous. (M.
    Singh)
  • Essentially, agent-based architectures are
    characterized by three key features autonomy,
    adaptation, and cooperation. Agent-based systems
    are computational systems in which several agents
    interact for their own good and for the good of
    the overall system.
  • In an agent-based architecture services are
    provided in the context of a community of loosely
    coupled agents of various types in a distributed
    environment.
  • Agents are aware of their environment and
    capable of communicating with other agents that
    belong to the same agent community.

16
Simplified View of Agent-based Systems
Mediator Agents
End User Agents
Service Provider Agents
Agents that act on behalf of end users that look
for services
Agents that act as a matchmaker between service
providers and end users
Agents that act on behalf of service providers
Conversation Layer
Message Layer
17
Agent-based Model Generation
  • Model generation services are provided in the
    context of a community of loosely coupled agents
    of various types in a distributed environment.
  • Model generation tools are accessed using a
    unified interface.
  • Tool providers and data collection providers
    offer their services to data analysts and
    end-users via the internet. New forms of
    collaboration can easily be supported in this
    environment
  • data analysts no longer run the tools on their
    own computing environment
  • brokering techniques can be used to find
    interesting data collections, suitable tools,
    useful models, and available ontologies.
  • tool developers offer tool services on the
    internet charging one-time tool use fee.

18
Model Generation Agent Communities
Data Collection Provider
Resource Generation Tool
Model
Model
Data Collection
Resource Agent
Model Generation Browser
End User
Resource Agent
Data Collection Broker
Model Broker
Model Generation Browser
Data Collection
Tool Broker
Data Collection

Data Analyst
Model Generation Tool
Model Generation Tool
Agent-based Model Generation Community
Tool Developer
Tool Integration Tool
19
Shared Ontologies
  • Ontologies are content theories about sorts of
    objects, properties of objects, and relationship
    between objects that are possible in a specified
    domain of knowledge (Chandrasekaran)
  • We consider ontologies to be domain theories
    that specify a domain-specific vocabulary of
    entities, classes, properties, predicates, and
    functions, and a set of relationships that
    necessarily hold among those vocabulary items
    (Fikes)
  • Shared ontologies form the basis for domain
    specific knowledge representation languages
    (Chandrasekaran)
  • If we could develop ontologies that could be
    used as the basis of multiple systems, they would
    share a common terminology that would facilitate
    sharing and reuse (W. Swartout)
  • Ontologies play an important role for the
    standardization of terminology in medicine (e.g.
    UMLS) and other domains
  • Ontologies can serve as the glue between
    knowledge that is represented at different,
    usually heterogeneous information sources.

20
Ontologies and Brokering
  • Service providers describe their capabilities in
    terms of a domain (or task) ontology
  • Agents that seek services describe their needs in
    terms of a domain (or task) ontology
  • Broker agents server as matchmakers between
    service providers and service seekers by finding
    suitable agents and by evaluating the extent to
    which they can provide those services relying on
    a semantic brokering approach.
  • Various languages have been advocated in the
    recent years to specify ontologies OKBC,
    CKML/OML, ONTOLINGUA, XML, UMLS,...

21
Promising Technologies to Use theFlood of Data
for Providing Better Health Care
Agent-based Systems KDD
Software Development Environments Knowledge Acqui
sition Tools
Visualization Traditional Data Analysis
Techniques
The Well of Knowledge
Database Technology
Ontologies
Knowledge Brokering
22
References
  • WWW-Links
  • http//ksl-web.stanford.edu/Reusable-ontol/P001.ht
    ml (Richard Fikes (Stanford University) Slide
    Show on Reusable Ontologies
  • http//www.kdnuggets.com/index.html (KDD Nuggets
    Directory Data Mining and Knowledge Discovery
    Resources)
  • http//www.mcc.com/projects/infosleuth/
    (InfoSleuth (MCC) --- an Agent-based System for
    Information Gathering)
  • http//www.cs.cmu.edu/softagents/ (CMU
    Intelligent Software Agents Page)
  • http//www.cs.uh.edu/ceick/6368.html (Homepage
    UH Graduate AI-class)
  • Papers
  • Special Issue IEEE Intelligent Systems on Coming
    to Terms with Ontologies, Jan./Feb. 1999.
  • Special Issue IEEE Intelligent System on
    Unmasking Intelligent Agents, March/April 1999.
  • Special Issue IEEE Computer on Data Mining,
    October 1999.

23
End of PresentationTransparencies that follow
are very likely not to be used
24
What is KDD?
  • Definition KDD is the non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data
    (Fayyad)
  • The identified knowledge is used to
  • make predictions
  • classify new examples
  • summarize the content of data collections and
    documents to facilitate understanding, decision
    making, and for supporting search and indexing
  • support graphical visualization to aid human in
    discovering deeper patterns
  • Example applications
  • learn to classify brain tissue from examples
  • predict a patients life expectancy from his
    medical history
  • summarize/cluster/mine clinical trial reports

25
What are Ontologies good for?
  • As a shared conceptual model of a particular
    application domain that describes the semantics
    of the objects that are part of the domain, and
    captures knowledge that is inherent to the
    particular domain --- idea knowledge base .
  • Ontologies provide a vocabulary for representing
    knowledge about a domain and for describing
    specific situations in a domain (tool for
    defining and describing domain-specific
    vocabularies) --- idea language for
    communication
  • For data/knowledge translation and transformation
    (provide a solution to the translation problem
    between different terminologies) for fusion and
    refinement of existing knowledge --- idea
    interoperation
  • For matchmaking between users, agents, and
    information resources in agent-based systems ---
    idea collaboration, brokering focus of
    next slides
  • As reusable building blocks to build systems that
    solve particular problems in the application
    domain --- idea model reuse
  • Summary Ontologies can be used as building
    block components of knowledge bases, object
    schema for object-oriented systems, conceptual
    schema for data bases, structured glossaries for
    human collaborations, vocabularies for
    communication between agents, class definitions
    for conventional software system, etc. (Fikes)

26
Service Provider Agents
End User Agents
A Traditional Approach
Search Engine
Specify keywords with respect to the documents
they are looking for
Clinical Trial Report
Abstract Clinical Trial Report
Summary
Semantic Brokering Approach
Service Provider Agents
End User Agents
Semantic Brokering
Specify subset of ontology
Clinical Trial Report
Subset of an Ontology
Summary
matchmaking
27
Why do agent-based systemsshow promise for
health care?
  • Scalability
  • Tasks to be solved involve the collaboration
    between different groups
  • Well suited for the world-wide web
  • Health care is a dynamically changing environment
  • Establish standards (as a by product)

28
Example Semantic Brokering
Data Analysts Information Requirement
Patient
Result Semantic Brokering ((DataCollection1 nil
((missing slot weight)
(contradictory ( age 40))
(DataCollection2 t) (DataCollection3 t (( age
60)( weight 300)))
Age40
weight
Intensive-Care- Patient
Hours-in-intensive-care
Data Collection1
Data Collection2
Data Collection3
Patient
Patient
Patient
Ageage
Age60
weight
Weight300
Intensive-Care- Patient
Intensive-Care- Patient
Intensive-Care- Patient
Hours-in-intensive-care
Hours-in-intensive-care
Hours-in-intensive-care
29
Critical Problems with Respect to Shared
Ontologies
  • Scientific communities have to agree on
    ontologies otherwise, the whole approach is
    flawed.
  • Development of ontologies for a particular domain
    is a difficult task (see Digital Anatomist
    project at UW, development of UMLS). The
    development of user friendly, and intelligent
    knowledge acquisition tools is very important for
    the successful development of shared ontologies.
  • Expressiveness of languages that are used to
    define ontologies limits what can be done with
    domain ontologies.
  • Reasoning capabilities are important for systems
    that use shared ontologies (we need a language to
    specify ontologies and an inference engine that
    can reason with the given ontologies)
  • finding inconsistencies in knowledge bases, for
    finding errors at data entry
  • semantic brokering
  • more intelligent mappings between terms
  • ...
Write a Comment
User Comments (0)
About PowerShow.com