Data mining: theory and applications - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Data mining: theory and applications

Description:

Algorithms, databases, pattern recognition, probabilistic modeling ... Paleontology. Document data. Telecommunications data. Spatial data (onomastics, ecology) ... – PowerPoint PPT presentation

Number of Views:378
Avg rating:3.0/5.0
Slides: 25
Provided by: mann84
Category:

less

Transcript and Presenter's Notes

Title: Data mining: theory and applications


1
Data mining theory and applications
  • Heikki Mannila
  • FDK Scientific Advisory Board Meeting March 22,
    2004

2
Why?
  • Lots of data
  • Novel types of data
  • Computational techniques gaining in importance
  • Algorithmic and probabilistic methods are needed

3
Where is data mining?
  • Lots of research since 1993
  • Algorithms, databases, pattern recognition,
    probabilistic modeling
  • Early emphasis on scalability and applications
  • Current emphasis on applications and solid
    frameworks
  • Data analysis as a major theme in algorithms and
    database conferences
  • Multidisciplinary efforts more common

4
Approach and goals in FDK
  • Application area ? concept formation ?
    algorithmic question ? algorithm ? analysis ?
    back to practice
  • Algorithmic basic research in areas where the
    results can be put into practice
  • Understanding the fundamental properties of data
    summarization and analysis

5
Data mining people
  • Loose group structure theory and applications
  • 3 seniors (Heikki Mannila, Hannu Toivonen, Jaakko
    Hollmén)
  • 5 postdocs (Bart Goethals, Floris Geerts, Aris
    Gionis, Panyoitis Tsaparas, Marko Salmenkivi)
  • Lots of Ph.D. and M.Sc. students

6
Projects
  • CompGenome genome structure
  • Gene expression studies (J. Hollmén)
  • MobiLife (EU 6th Framework IP)
  • April II (STREP)
  • Paleoprojects
  • Some industrial projects

7
Computational methods
  • Pattern discovery
  • Finding recurrent patterns in large data sets
  • Succinct representations of data
  • Combinatorial algorithms
  • Dynamic programming, greedy algorithm etc.
  • Probabilistic modeling
  • Combining pattern discovery, combinatorial
    algorithms, and probabilistic modeling

8
Application areas
  • Genome structure (Aris Gionis)
  • Understanding the types of variation in genomes
    (within and between species)
  • Gene mapping (Hannu Toivonen)
  • Gene expression (Jaakko Hollmén)
  • Paleontology
  • Document data
  • Telecommunications data
  • Spatial data (onomastics, ecology)

9
Research themes examples
  • Finding structure in large 0-1 data sets topic
    models, dense itemsets, etc.
  • Seppänen, Geerts, Goethals, Bingham, Tatti
  • Where are the ones in a large 0-1 dataset
  • Combining probabilistic and combinatorial methods
  • Sequence structure segmentations, vocabularies
  • Gionis, Heimonen, Terzi, Kollin
  • Discrete algorithms on probabilistic structures

10
Research themes examples
  • Pattern discovery
  • Goethals, Geerts, Mielikäinen
  • Discrete algorithm some probabilistic notions
  • Foundations of data analysis
  • Seppänen, Goethals, Gionis
  • Succinct representations
  • Algorithmic aspects of mixture models
  • Combining different types of data (sequences,
    spatial data, temporal data)
  • Leino, Salmenkivi, Gionis, Geerts etc.

11
Some examples of themes
  • Orders from unordered data
  • Fragments of order
  • Random walks on databases

12
Orders from unordered data
  • Given a 0-1 matrix, can one find some information
    about the order of the rows or columns?
  • Two approaches
  • spectral ordering (global total orders)
  • fragments of order (local total orders)

13
Example paleontological data
  • Given a matrix of occurrences of species in
    fossil sites
  • Ages of the fossil sites are not available
  • How to order the sites according to their age?
  • Background information species arrive and vanish
  • Try to find ordering that minimizes Lazarus
    events
  • species
  • A B C
  • 0 0 1
  • 1 1 0
  • 1 0 1
  • 0 1 1
  • 0 1 0

time
Lazarus events
14
Consecutive ones property
  • A matrix has consecutive ones property if rows
    can be ordered so that the ones in each column
    are consecutive
  • Consecutive ones property no Lazarus events
  • Booth and Lueker 1976 the property can be
    decided in linear time
  • Only for the exact case

15
Computational task
  • Find an ordering of the sites that minimizes
    Lazarus events
  • Minimizes deviation from consecutive ones
    property
  • NP-hard
  • Good approximation algorithms

16
Spectral ordering
  • Let s(i,j) be a similarity measure between sites
  • Laplacian matrix L(i,j)

17
Spectral ordering
18
Spectral ordering and consecutive ones property
  • Spectral ordering does a good job of
    approximating the consecutive ones property
    (empirically)
  • If COP holds, spectral method finds it
  • Approximation guarantees no bounds are known
  • Practice very good results
  • Lots of nice open questions

19
Fortelius, Jernvall, Gionis, Mannila, in
preparation
20
(No Transcript)
21
Fragments of order
  • Can we find an order of introduction to
    attributes of a 0-1 data set?
  • database query selectivity
    Citeseer
  • system estimation answers
  • 1 1 1 49
  • 1 1 0 1930
  • 0 1 1 221
  • 1 0
    1 4
  • database system lt query lt selectivity
    estimation

22
Fragments of order
  • Find all orderings A1 lt A2 lt ...lt Ak among the
    attributes such that
  • There are many rows having at least two 1s in the
    columns A1, A2, ..., Ak
  • There are few rows having the pattern
  • Ai ... Aj ... Ak
  • 1 ... 0 ... 1
  • A levelwise algorithm works fine

23
Random walks on databases
  • Random walks on the web very powerful
  • What is a random walk on a database?
  • State space partial tuples t occurring in the
    database
  • Edges E (t,s) in E if there is a query Q from a
    given class such that s is in Q(t)
  • Ranking of query results etc.

24
Future topics
  • Theory and applications of data mining
  • Combinatorial algorithms with probabilistic
    notions statistics
  • Foundations algorithmic aspects of mixture
    models, condensed representations
  • Sequences as one unifying topic
  • Spatial and spatio-temporal data
  • Genome structure what is all the DNA for?
  • Linguistic data, process data, paleontological
    data,
  • Postdocs will manage projects
Write a Comment
User Comments (0)
About PowerShow.com