Research Problems in Data Mining - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Research Problems in Data Mining

Description:

Weblog mining (usage, access, and evolution) Identification of authoritative Web pages ... customization: home page Weblog user profiles. 9/4/09. Research ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 29
Provided by: jiaw201
Category:

less

Transcript and Presenter's Notes

Title: Research Problems in Data Mining


1
Research Problems in Data Mining
  • Jiawei Han
  • Database Systems Research Lab
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign,
    U.S.A.
  • http//www.cs.uiuc.edu/hanj

2
Several Research Issues in Data Mining
  • Web mining and text mining
  • Biomedical/DNA data mining
  • On-line, real-time, stream data mining
  • Cube exploration iceberg, cube-gradient, trends,
    etc.
  • Mining max/closed long and error-tolerant
    frequent and sequential patterns
  • Intrusion detection and anomaly mining
  • Invisible data mining

3
Challenges in Web Mining
  • Web A huge, widely-distributed, highly
    heterogeneous, semi-structured, interconnected,
    evolving, hypertext/hypermedia information
    repository.
  • Problems
  • the abundance problem
  • limited coverage of the Web (hidden Web sources)
  • limited query interface keyword-oriented search
  • limited customization to individual users
  • DBMS, DBers, and data miners will play an
    increasingly important role in the new generation
    of Internet

4
Web Mining Lots Can Be Done!
  • A taxonomy of Web mining
  • Web content mining
  • Web usage mining
  • Some interesting problems on Web mining
  • Mining what Web search engine finds
  • Weblog mining (usage, access, and evolution)
  • Identification of authoritative Web pages
  • Web document classification
  • Warehousing a Meta-Web Web yellow page service
  • Intelligent query answering in Web search

5
Mine What Web Search Engine Finds
  • Current Web search engines convenient source for
    mining
  • keyword-based, return too many answers, low
    quality answers, still missing a lot, not
    customized, etc.
  • Data mining will help
  • coverage enlarge and then shrink, using
    synonyms and conceptual hierarchies
  • better search primitives user preferences/hints
  • linkage analysis authoritative pages and
    clusters
  • Web-based languages XML WebSQL WebML
  • customization home page Weblog user profiles

6
Web Log Mining
  • Weblog provides rich information about Web
    dynamics
  • Multidimensional Weblog analysis
  • disclose potential customers, users, markets,
    etc.
  • Web accessing association/sequential pattern
    analysis
  • Web cashing, prefetching, swapping
  • Web linkage adjustment
  • Trend analysis
  • Dynamics of the Web what has been changing?
  • Customized to individual users
  • Need additional information in order to discover
    truly useful patterns

7
Discovery of Authoritative Pages in WWW
  • Page-rank method ( Brin and Page, 1998)
  • Rank the "importance" of Web pages, based on a
    model of a "random browser."
  • Hub/authority method (Kleinberg, 1998)
  • Prominent authorities often do not endorse one
    another directly on the Web.
  • Hub pages have a large number of links to many
    relevant authorities.
  • Thus hubs and authorities exhibit a mutually
    reinforcing relationship
  • Both the page-rank and hub/authority
    methodologies have been shown to provide
    qualitatively good search results for broad query
    topics on the WWW, e.g., Google.

8
Web Document Classification
  • Automatic classification of Web pages vs. human
    classification (e.g., Yahoo)
  • Training set
  • Existing, typical good classification sites,
    e.g., Yahoo!, CS term hierarchies
  • Classification methods
  • Typical method Naïve Bayesian, decision trees,
    etc.
  • Key-word based classification is different from
    multi-dimensional classification
  • Association- or clustering- based classification
    is often more effective
  • Multi-level classification is important

9
Warehousing a High-Level Web An MLDB Approach
  • ML-Web A structure which summarizes the
    contents, structure, linkage, and access of the
    Web and which evolves with the Web
  • Layer0 the Web itself
  • Layer1 the lowest layer of the ML-Web
  • An entry a Web page summary, including class,
    time, URL, contents, keywords, popularity, rank,
    links, etc.
  • Layer2 and up summary/classification/clustering
    in various ways and distributed for various
    applications
  • ML-Web can be warehoused and incrementally
    updated
  • Querying and mining can be performed on or
    assisted by ML-Web (a ML digital library
    catalogue, yellow page).

10
A Multiple Layered Meta-Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
11
Construction of Multi-Layer Meta-Web
  • XML facilitates structured and meta-information
    extraction
  • Hidden Web DB schema extraction other meta
    info
  • Automatic classification of Web documents
  • based on Yahoo!, etc. as training set
    keyword-based correlation/classification analysis
    (IR/AI assistance)
  • Automatic ranking of important Web pages
  • authoritative site recognition and clustering Web
    pages
  • Generalization-based multi-layer ML-Web
    construction
  • With the assistance of clustering and
    classification analysis

12
Use of Multi-Layer Meta Web
  • Benefits of Multi-Layer Meta-Web
  • Multi-dimensional Web info summary analysis
  • Approximate and intelligent query answering
  • Web high-level query answering (WebSQL, WebML)
  • Web content and structure mining
  • Observing the dynamics/evolution of the Web
  • Is it realistic to construct such a meta-Web?
  • Benefits even if it is partially constructed
  • Benefits may justify the cost of tool
    development, standardization and partial
    restructuring

13
Intelligent Web Query Answering
  • What is intelligent query answering?
  • Smart alternative answers, summary information,
    etc.
  • Based on users profiles or history
  • Web query needs more intelligent query answering
    mechanism
  • How to develop it?
  • Data warehouse and Web Yellow Page service will
    help
  • Data mining will help too!

14
Biomedical Data Mining and DNA Analysis
  • DNA sequences 4 basic building blocks
    (nucleotides) adenine (A), cytosine (C), guanine
    (G), and thymine (T).
  • Gene a sequence of hundreds of individual
    nucleotides arranged in a particular order
  • Humans have around 50,000 genes
  • Tremendous number of ways that the nucleotides
    can be ordered and sequenced to form distinct
    genes
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Current highly distributed, uncontrolled
    generation and use of a wide variety of DNA data
  • Data cleaning and data integration methods
    developed in data mining will help

15
Discovery and Comparison of DNA Sequences
  • Finding tandem repeats
  • Fault-tolerant sequential patterns (Is Blast
    enough?)
  • Similarity search and comparison among DNA
    sequences
  • Compare the frequently occurring patterns of each
    class (e.g., diseased and healthy)
  • Identify gene sequence patterns that play roles
    in various diseases

16
Association and Path Analysis in Bio-Medical and
DNA Data Mining
  • Association analysis identification of
    co-occurring gene sequences
  • Most diseases are not triggered by a single gene
    but by a combination of genes acting together
  • Association analysis may help determine the kinds
    of genes that are likely to co-occur together in
    target samples
  • Path analysis linking genes to different disease
    development stages
  • Different genes may become active at different
    stages of the disease
  • Develop pharmaceutical interventions that target
    the different stages separately
  • Visualization tools and genetic data analysis

17
Stream Data and Applications
  • Prevalence of Data Streams
  • Network flow analysis and management
  • Telephone call details and fraud detection
  • On-line sensor monitoring
  • Characteristics of Streams
  • Massive and continuous amounts of data
  • O(100 GB) per day
  • Storage constraints
  • small space (e.g., main memory or cache)
  • Online continuous queries
  • i.e., agg stream -gt stream

18
On-Line Mining of Stream Data
  • Single-pass aggregates
  • Basic, simple (exact) aggregates
  • trivial inherently single-pass
  • Approx fancy aggregates (e.g., median)
  • Discovering correlations, associations, models,
    cause-and-effect relationships between patterns
  • Discovering trends, clusters, changes, and
    outliers in data flowvisual mining will help as
    well
  • Data stream warehousingsaving regularities?
    constraint- or goal- directed stream mining?

19
Multidimensional Data and Data Cubes
  • Sales volume as a function of product, month, and
    region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
20
Mining and Explorative Analysis of Data Cubes
  • Efficient computation of data or iceberg cubes
  • Discovery-driven data cube analysis
  • Cube-gradient analysis
  • What are the changes of the average house value
    in Sillicon Valley in 2001 comparing with 2000?
  • Under what conditions the average house value
    increases 10 per year in Chicago area in 1990s?

21
What More can Be Done at Mining Data Cubes?
  • Trend analysis in data cubes?
  • What kind of companies have similar asset
    increase trends like Microsoft?
  • Cluster customers based on their similar shopping
    behavior with regard to the change of time
  • Model-based class comparison
  • What are the critical features that distinguish
    winners and losers?
  • Association and correlation analysis in data
    cubes
  • If companys average profit is high, what other
    features will go with it?

22
Further Development of Frequent and Sequential
Pattern Analysis
  • Efficient frequent pattern mining methods
  • Association Apriori (94), FP-growth (00)
  • Sequential pattern GSP(96), PrefixSpan (01)
  • Mining max patterns, closed patterns,
    approximately closed, top-n frequent patterns
  • Error-tolerant frequent and sequential patterns
    (e.g., DNA sequences)
  • Constraint-based mining of frequent and
    sequential patterns

23
Intrusion Detection and Anomaly Mining
  • Fighting against crimes and terrorists
  • Linking and mining dynamic and huge amounts of
    data
  • Sifting irregularities from regular ones Mining
    regularities as base for comparison and find
    outliers
  • Classification normal vs. alarming classes and
    models
  • Clustering and outlier analysis
  • Human-instructed conditions and condition-guided
    classification, clustering, and outlier analysis
  • Information visualization and stream data analysis

24
Invisible Data Mining
  • Embed mining functions into information services
  • Web search engine (link analysis, authoritative
    pages, user profiles)adaptive web sites, etc.
  • Improvement of query processing history data
  • Making service smart and efficient
  • Benefits from/to data mining research
  • Data mining research has produced many scalable,
    efficient, novel mining solutions
  • Applications feed new challenge problems to
    research

25
Conclusions
  • Data mining A young and promising discipline
  • A confluences of multiple disciplines database,
    data warehouse, machine learning, statistics,
    high performance computing, Web technology, etc.
  • Great progress in the last decade
  • Lots of research issues, and a few identified
    here
  • Web mining and text mining
  • Biomedical/DNA data mining
  • On-line, real-time, stream data mining
  • Cube exploration iceberg, cube-gradient, trends,
    etc.
  • Mining max/closed long and error-tolerant
    frequent and sequential patterns
  • Intrusion detection and anomaly mining
  • Invisible data mining

26
http//www.cs.uiuc.edu/hanj
  • Thank you !!!

27
Selected Publications (2001)
  • A. K. H. Tung, J. Hou, and J. Han, "Spatial
    Clustering in the Presence of Obstacles", Proc.
    2001 Int. Conf. on Data Engineering (ICDE'01),
    Heidelberg, Germany, April 2001.
  • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
    M.-C. Hsu, "PrefixSpan Mining Sequential
    Patterns Efficiently by Prefix-Projected Pattern
    Growth", Proc. 2001 Int. Conf. on Data
    Engineering (ICDE'01), Heidelberg, Germany, April
    2001.
  • J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining
    Frequent Itemsets with Convertible Constraints",
    Proc. 2001 Int. Conf. on Data Engineering
    (ICDE'01), Heidelberg, Germany, April 2001.
  • A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and
    R. T. Ng, "Constraint-Based Clustering in Large Da
    tabases", Proc. 2001 Int. Conf. on Database
    Theory (ICDT'01), London, U.K., Jan. 2001.
  • H. Miller and J. Han (eds.), Geographic Data
    Mining and Knowledge Discovery, Taylor and
    Francis, 2001.
  • Y. Bedard, T. Merrett, and J. Han, "Fundamentals
    of Geospatial Data Warehousing for Geographic
    Knowledge Discovery", H. Miller and J. Han
    (eds.), Geographic Data Mining and Knowledge
    Discovery, Taylor and Francis, 2001.
  • J. Han, M. Kamber, and A. K. H. Tung, "Spatial
    Clustering Methods in Data Mining A Survey", H.
    Miller and J. Han (eds.), Geographic Data Mining
    and Knowledge Discovery, Taylor and Francis,
    2001.
  • H. Lu, L. Feng, and J. Han, "Beyond
    Intra-Transaction Association AnalysisMining
    Multi-Dimensional Inter-Transaction Association
    Rules", ACM Transactions on Information Systems,
    2001.

28
Selected Publications (2000)
  • J. Han and M. Kamber, Data Mining Concepts and
    Techniques, Morgan Kaufmann, August 2000.
  • K. Wang, Y. He and J. Han, "Mining Frequent
    Itemsets Using Support Constraints", Proc. 2000
    Int. Conf. on Very Large Data Bases (VLDB'00),
    Cairo, Egypt, Sept. 2000, pp. 43-52.
  • E. D. Kim, J. M.W. Lam, and J. Han, "AIM
    Approximate Intelligent Matching for Time Series
    Data", Proc. 2000 Int. Conf. on Data Wareshouse
    and Knowledge Discovery (DaWaK'00), Greenwich,
    U.K., Sept. 2000.
  • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
    Dayal, M.-C. Hsu, "FreeSpan Frequent
    Pattern-Projected Sequential Pattern Mining",
    submitted to 2000 Int. Conf. on Knowledge
    Discovery and Data Mining (KDD'00), Boston, MA,
    August 2000.
  • J. Pei and J. Han "Can We Push More Constraints
    into Frequent Pattern Mining?", submitted to 2000
    Int. Conf. on Knowledge Discovery and Data Mining
    (KDD'00), Boston, MA, August 2000.
  • J. Han, J. Pei, and Y. Yin, "Mining Frequent
    Patterns without Candidate Generation", Proc.
    2000 ACM-SIGMOD Int. Conf. on Management of Data
    (SIGMOD'00), Dallas, TX, May 2000.
  • J. Pei, J. Han, and R. Mao, "CLOSET An Efficient
    Algorithm of Mining Frequent Closed Itemsets for
    Association Rules", submitted to 2000 ACM-SIGMOD
    Int. Workshop on Data Mining and Knowledge
    Discovery (DMKD'00), Dallas, TX, May 2000.
  • D. Cheung, C. Hwang, A. Fu, and J. Han,
    "Efficient Rule-Based Attributed-Oriented
    Induction for Data Mining", Journal of
    Intelligent Information Systems, 15(2) 175-200,
    2000.
  • N. Stefanovic, J. Han, and K. Koperski,
    "Object-Based Selective Materialization for
    Efficient Implementation of Spatial Data Cubes,"
    IEEE Transactions on Knowledge and Data
    Engineering, 12(6), 2000.
Write a Comment
User Comments (0)
About PowerShow.com