KDDRG Research Projects - PowerPoint PPT Presentation

About This Presentation
Title:

KDDRG Research Projects

Description:

Cindy Leung and Sam Holmes. John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB), Paul Young. ... Ciman and John Gulbrandsen. Tara Halwes ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 18
Provided by: webC
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: KDDRG Research Projects


1
KDDRG Research Projects
  • Prof. Carolina Ruiz
  • ruiz_at_cs.wpi.edu
  • Department of Computer Science
  • Worcester Polytechnic Institute

2
Some Current Analytical Data Mining Research
Projects at WPI
  • Mining Complex Data Set and Sequence Mining
  • Systems performance Data
  • Sleep Data
  • Financial Data
  • Web Data
  • Data Mining for Genetic Analysis
  • Correlating genetic information with diseases
  • Predicting gene expression patterns
  • Data Mining for Electronic Commerce
  • Collaborative and Content-Based Filtering
  • Using Association Rules and using Neural Networks

3
Analyzing Sleep Data
  • Purpose
  • Associations between sleep patterns and
    health/pathology
  • Obtain patterns of different sleep stages (4
    sleepREM Wake)
  • DATA SET
  • Clinical (sequential)
  • Electro-encephalogram (EEG),
  • Electro-oculogram (EOG),
  • Electro-myogram (EMG),
  • Probe measuring flow of Oxygen
  • in blood etc.
  • Diagnostic (tabular)
  • Questionnaire responses
  • Patients demographic info.
  • Patients medical history

(Source http//www. blsc.com)
  • Potential Rules
  • Association Rules
  • (Sleep latency lt3 min) (hereditary disorder)
    gt Narcolepsy confidence92, support 13
  • (B) Classification Rules
  • (snoring HEAVY) (AHI gt 30/hour)
    severe OSA
  • gt (Race Caucasian) confidence70, support
    8
  • AHI Apnea Hypopnea index, OSA
    Obstructive Sleep Apnea

WPI, UMassMedical, BC
4
Input Data
  • Each instance Tabular set sequential
    attributes
  • attr1 attr2 attr3
    attr4 attr5 class
  • illnesses heart rate
    age oxygen gender Epworth

depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14

P1 P2 P3
5
Analyzing Financial Data
  • Sequential data daily stock values
  • Normal (tabular/relational) data
  • sector (computers, agricultural, educational, ),
    type of government, product releases, companies
    awards,
  • Desired rules
  • If DELLs stock value increases 1999ltyearlt2002
    gt IBMs stock value decreases

6
Events Financial Data Basic events 16 or so
financial templates LittleRhodes78difficult
pattern matching alignments and time warping
Panic Reversal Head
Shoulders Reversal
Rounding Top Reversal Descending
Triangle Reversal
7
WPI Weka Tool for mining complex
temporal/spatial associations
8
Data Mining for Genetic Analysisw/ Profs. Ryder
(BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS,
WPI), and Alvarez (CS, BC)
  • SNP analysis
  • discovering correlations between sequence
    variations and diseases
  • Gene expression
  • discovering patterns that cause a gene to be
    expressed in a particular cell

9
Correlating Genetics with Diseases
  • Utilize Data Mining Techniques with Actual
    Genetic Data Sampled from Research
  • Spinal Muscular Atrophy inherited disease that
    results in progressive muscle degeneration and
    weakness.

10
Genomic Data Resources
Patient Gender SMA Type (Severity) SNP Location C212 Father / Mother AG1-CA Father / Mother
Female Severe Y272C 31 / 28 29 102 / 108 112
Male Mild Y272C 28 29 / 25 108 112 / 114

Wirth, B. et al. Journal of Human Molecular
Genetics
11
Our System CAGE
  • To predict gene expression based on DNA sequences.

Muscle Cell
Gene 1
Gene 3
Gene 2
Neural Cell
CAGE
Gene 1
Gene 3
Gene 2
Seam Cells
On
Gene 1
Gene 3
Gene 2
Off
12
Gene expression Analysis

PR1
PROMOTER(S)
CELL TYPES
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
13
Gene Expression
  • Transcription of DNA into RNA

TRANSCRIPTIONAL PROTEINS
PROMOTER REGION
..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA
MOTIFS M1, M2, M4
MUSCLE CELL
14
PR1 PROMOTER(S)
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
R1 M1, M4, M5 gt Neural supp 22, conf100 Supp. instances PR1, PR2
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
R2 M2, M4, M5 gt Neural supp 22 , conf100 Supp. instances PR1,PR8
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
15
Well-clustered motifs
M1
M2
M4
M5
240
150
100
M1
M4
M5
260
210
M1
M4
360
M1
M2
M5
100
350
M1
M4
190
IR1M1,M2,M5 ?(M1,M2) 120.1 ?(M1,M2)
216.6 cvd(M1,M2) 0.55
M4
M5
M3
150
120
M1
M2
M5
M3
210
100
110
M2
M4
M5
21
18
M4
M3
60
16
Distance-based Association Rules
Sample distance-based assoc. rule
  • Given
  • min-support
  • min-confidence
  • max-cvd
  • thresholds
  • Mine
  • all distance-based association rules

17
Grad. Undergrad. Students
  • Jonathan Rudolph
  • Eduardo Paredes
  • Iavor N. Trifonov.
  • Takeshi Kawato
  • Cindy Leung and Sam Holmes.
  • John Baird (BB), Jay Farmer, Rebecca Gougian
    (BB), Ken Monterio (BB), Paul Young.
  • Zachary Stoecker-Sylvia. Kristin Blitsch (BB),
    Ben Lucas, Sarah Towey(BB)
  • Wendy Kogel, Brooke LeClair, Christopher St.
    Yves.
  • Brian Murphy, David Phu (CS/BB), Ian Pushee,
    Frederick Tan (CS/BB).
  • Daniel Doyle, Jared Judecki, James Lund, Bryan
    Padovano (BB).
  • Christopher Cole.
  • Michael Ciman and John Gulbrandsen.
  • Tara Halwes
  • Christopher Martino.
  • Matthew Berube.
  • Anna Novikov.
  • Amy Kao and Dana Rock.
  • Ali Benamara.
  • Dharmesh Thakkar.
  • Senthil K Palanisamy.
  • Zachary Stoecker-Sylvia.
  • Keith A. Pray.
  • Jonathan Freyberger.
  • Maged El-Sayed.
  • Parameshvyas Laxminarayan.
  • Aleksandar Icev.
  • Wendy Kogel.
  • Michael Sao Pedro.
  • Christopher Shoemaker.
  • Weiyang Lin.
Write a Comment
User Comments (0)
About PowerShow.com