Master of Science - PowerPoint PPT Presentation

About This Presentation
Title:

Master of Science

Description:

Bioinformatics. TCGR: DNA/RNA visualization. miRNA prediction. Web ... Exploratory data analysis. Data driven discovery. Deductive learning. 07/03/06 - Tunisia ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 29
Provided by: dream1
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Master of Science


1
Data Mining Research at SMU
ME
Margaret H. Dunham, DBGroup Yu Meng, Jie
Huang, Lin Lu, Donya Quick, Michael Pierce CSE
Department Southern Methodist University Dallas,
Texas 75275 mhd_at_engr.smu.edu

2
Data Mining Introductory and Advanced Topics, by
Margaret H. Dunham, Prentice Hall, 2003. DILBERT
reprinted by permission of United Feature
Syndicate, Inc.
3
Outline
  • What is Data Mining?
  • EMM
  • Spatio-temporal modeling
  • Rare Event Detection
  • Bioinformatics
  • TCGR DNA/RNA visualization
  • miRNA prediction
  • Web Usage Mining

4
Data Mining Definition
  • Finding hidden information in a database
  • Fit data to a model
  • Similar terms
  • Exploratory data analysis
  • Data driven discovery
  • Deductive learning

5
Query Examples
  • Database
  • Data Mining
  • Find all credit applicants with last name of
    Smith.
  • Identify customers who have purchased more than
    10,000 in the last month.
  • Find all customers who have purchased milk
  • Find all credit applicants who are poor credit
    risks. (classification)
  • Identify customers with similar buying habits.
    (Clustering)
  • Find all items which are frequently purchased
    with milk. (association rules)

6
Outline
  • What is Data Mining?
  • EMM
  • Spatio-temporal modeling
  • Rare Event Detection
  • Bioinformatics
  • TCGR DNA/RNA visualization
  • miRNA prediction
  • Web Usage Mining

7
Spatiotemporal Environment
  • Events arriving in a stream
  • At any time, t, we can view the state of the
    problem as represented by a vector of n numeric
    values
  • Vt ltS1t, S2t, ..., Sntgt

Time
8
Technique
  • Spatiotemporal modeling technique based on Markov
    models.
  • However
  • Size of MM depends on size of dataset
  • The required structure of the MM is not known at
    the model construction time.
  • As the real world being modeled by the MM
    changes, so should the structure of the MM.

9
Extensible Markov Model (EMM)
  • Time Varying Discrete First Order Markov Model
  • Nodes are clusters of real world states.
  • Learning continues during application phase.
  • Learning
  • Transition probabilities between nodes
  • Node labels (centroid/medoid of cluster)
  • Nodes are added and removed as data arrives

10
EMM Learning
lt18,10,3,3,1,0,0gt lt17,10,2,3,1,0,0gt lt16,9,2,3,1,0,
0gt lt14,8,2,3,1,0,0gt lt14,8,2,3,0,0,0gt lt18,10,3,3,1,
1,0.gt
11
Growth of EMM
Servent Data
12
EMM Performance Growth Rate
Minnesota Traffic Data
13
EMM Water Level Prediction Ouse Data
14
Rare Event
  • Rare - Anomalous Surprising
  • Out of the ordinary
  • Not outlier detection
  • Ex Snow in upstate New York is not rare
  • Snow in upstate New York in June is rare
  • Rare events may change over time
  • Applications
  • Intrusion Detection
  • Fraud
  • Flooding
  • Unusual automobile/network traffic

15
Rare Event in Cisco Data
16
Outline
  • What is Data Mining?
  • EMM
  • Spatio-temporal modeling
  • Rare Event Detection
  • Bioinformatics
  • TCGR DNA/RNA visualization
  • miRNA prediction
  • Web Usage Mining

17
Chaos Game Representation (CGR)
  • 2D technique to visually see the distribution of
    subpatterns
  • Our technique is based on the following
  • Generate totals for each subpattern
  • Scale totals to a 0,1 range. (Note scaling can
    be a problem)
  • Convert range to red/blue
  • 0-0.5 White to Blue
  • 0.5-1 Blue to Red

18
CGR Example
Homo Sapiens all mature miRNA Patterns of
length 3
UUC
GUG
19
Temporal CGR (TCGR)
  • Temporal version of Frequency CGR
  • In our context temporal means the starting
    location of a window
  • 2D Array
  • Each Row represents counts for a particular
    window in sequence
  • First row first window
  • Last row last window
  • We start successive windows at the next character
    location
  • Each Column represents the counts for the
    associated pattern in that window
  • Initially we have assumed order of patterns is
    alphabetic
  • Size of TCGR depends on sequence length and
    subpattern lengt
  • As sequence lengths vary, we only examine
    complete windows
  • We only count patterns completely contained in
    each window.

20
TCGR Example
21
TCGR Mature miRNA (Window5 Pattern2)
22
Outline
  • What is Data Mining?
  • EMM
  • Spatio-temporal modeling
  • Rare Event Detection
  • Bioinformatics
  • TCGR DNA/RNA visualization
  • miRNA prediction
  • Web Usage Mining

23
The BIG PICTURE
2003-10-05154920050721435700000026210000000000  
             02652026520000000002003-10-051640
49050832595900000872710001142380              
07107071070000000002003-10-0504551005076779990
0000191300000670518              
00000000000000000002003-10-0509431005078176610
0000603030000000000              
03657004690000000002003-10-0514493605081824200
00007066200000000000811a39       
09142071070000000002003-10-0521235705075903160
0000465050002794335              
11992071070000000002003-10-0511301605073051260
0000465050000195747              
1684600597corduroycoats
CANT SEE THE FOREST FOR THE TREES
24
  • Interests
  • Motivations

Preprocess Web Data Cleanse Sessionize
URL Abstraction
Markov Model per Cluster
Markov Model
User defined beginning/ending Web pages
Significant Usage Pattern
User Preferred Navigation Trail
Cluster Web Sessions
Normalized Probability
25
Experimental Result
  • On average purchase sessions are longer than
    those
  • sessions without purchase
  • - review the information, compare the price,
    the quality and etc.
  • - fill out the billing and shipping
    information to commit the purchase

WebKDD05 25
26
Experimental Result
SUPs in non-purchase cluster
Interested in gathering information of products
in different categories.
S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4
-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E
Interested in reviewing general pages (to gather
general information).
Not serious visitors (the average session length
is 3)
WebKDD05 26
27
Experimental Result
WebKDD05 27
28
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com