Master of Science - PowerPoint PPT Presentation

Loading...

PPT – Master of Science PowerPoint presentation | free to download - id: 342cd-ZDljZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Master of Science

Description:

... H. Dunham. Southern Methodist University. Dallas, Texas 75275. mhd_at_engr.smu. ... 'At Charters, Cheating's off the Charts:, Dallas Morning News, June 4, 2007. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 56
Provided by: dream1
Learn more at: http://lyle.smu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Master of Science


1
DATA MINING APPLICATIONS
  • Margaret H. Dunham
  • Southern Methodist University
  • Dallas, Texas 75275
  • mhd_at_engr.smu.edu
  • This material is based in part upon work
    supported by the National Science Foundation
    under Grant No. 9820841
  • Some slides used by permission from Dr Eamonn
    Keogh University of California
    Riverside eamonn_at_cs.ucr.edu

2
The 2000 ozone hole over the antarctic seen by
EPTOMS http//jwocky.gsfc.nasa.gov/multi/multi.ht
mlhole
3
OBJECTIVE
  • Explore some of the applications of data mining
    techniques.

4
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

5
Data Mining Overview
  • Finding hidden information in a database
  • Fit data to a model
  • You must know what you are looking for
  • You must know how to look for you

6
If it looks like a duck,
walks like a duck, and quacks
like a duck, then its a duck.
If it looks like a terrorist,
walks like a terrorist, and
quacks like a terrorist, then its
a terrorist.
Classification Clustering Link Analysis
(Profiling) (Similarity)
7
Classification Applications
  • Teachers classify students grades as A, B, C, D,
    or F.
  • Letter Recognition
  • andwriting Recognition
  • Phishing http//computerworld.com/action/article
    .do?commandviewArticleBasictaxonomyNamecybercri
    me_hackingarticleId9002996taxonomyId82
  • Pluto http//www.npr.org/templates/story/story.p
    hp?storyId5705254

8
Classification Example
Katydids
Given a collection of annotated data. (in this
case 5 instances of Katydids and five of
Grasshoppers), decide what type of insect the
unlabeled example is.
Grasshoppers
(c) Eamonn Keogh, eamonn_at_cs.ucr.edu
9
Antenna Length
Abdomen Length
Katydids
Grasshoppers
(c) Eamonn Keogh, eamonn_at_cs.ucr.edu
10
Clustering Applications
  • Targeted Marketing
  • Determining Gene Functionality
  • Identifying Species
  • Clustering vs. Classification
  • No prior knowledge
  • Number of clusters
  • Meaning of clusters
  • Unsupervised learning

11
http//149.170.199.144/multivar/ca.htm
12
What is Similarity?
(c) Eamonn Keogh, eamonn_at_cs.ucr.edu
13
Association Rules Applications
  • People who buy diapers also buy beer
  • If gene A is highly expressed in this disease
    then gene B is also expressed
  • Relationships between people
  • www.amazon.com
  • Book Stores
  • Department Stores
  • Advertising
  • Product Placement

14
Data Mining Introductory and Advanced Topics, by
Margaret H. Dunham, Prentice Hall, 2003. DILBERT
reprinted by permission of United Feature
Syndicate, Inc.
15
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

16
(No Transcript)
17
Fraud Detection
  • Identify fraudulent behavior
  • Used Extensively in financial, law enforcement,
    health care, etc. sectors
  • http//www.aaai.org/AITopics/html/fraud.html
  • SPSS http//www.spss.com/predictiveclaims/fraud_d
    etection.htm
  • Neural Technologies http//www.neuralt.com/fraud_
    management.html

18
Law Enforcement
  • Identify suspect behavior and relationships
  • I2 Inc.
  • Investigative analytic/visualization software
  • http//www.i2inc.com
  • Social Network Analysis Analyze patterns of
    relationships
  • Relationships personal, religious, operational,
    etc.

19
Jialun Qin, Jennifer J. Xu, Daning Hu,
Marc Sageman and Hsinchun Chen, Analyzing
Terrorist Networks A Case Study of the Global
Salafi Jihad Network  Lecture Notes in Computer
Science, Publisher Springer-Verlag GmbH, Volume
3495 / 2005 , p. 287.
20
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

21
How Stuff Works, Facial Recognition,
http//computer.howstuffworks.com/facial-recogniti
on1.htm
22
Facial Recognition
  • Based upon features in face
  • Convert face to a feature vector
  • Less invasive than other biometric techniques
  • http//www.face-rec.org
  • http//computer.howstuffworks.com/facial-recogniti
    on.htm
  • SIMS
  • http//www.casinoincidentreporting.com/Products.as
    px

23
(c) Eamonn Keogh, eamonn_at_cs.ucr.edu
24
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

25
Cheating on Multiple Choice Tests
  • Similarity between tests based on number of
    common wrong answers.
  • (George O. Wesolowsky, Detecting Excessive
    Similarity in Answers on Multiple Choice Exams,
    Journal of Applied Statistics, vol 27, no 7,200,
    pp909-923.)
  • The number of common correct answers is often
    ignored.
  • H-H Index (D.N. Harpp, J.J. Hogan, and J.S.
    Jennings, 1996, Crime in the Classroom Part
    II, and update, Journal of Chemical Education,
    vol 73, no 4, pp 349-351)
  • H-H (Number of exact answers in common)
  • (Number of different answers)

26
Joshua Benton and Holly K. Hacker, At Charters,
Cheatings off the Charts, Dallas Morning News,
June 4, 2007.
27
No/Little Cheating
Joshua Benton and Holly K. Hacker, At Charters,
Cheatings off the Charts, Dallas Morning News,
June 4, 2007.
28
Rampant Cheating
Joshua Benton and Holly K. Hacker, At Charters,
Cheatings off the Charts, Dallas Morning News,
June 4, 2007.
29
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

30
DNA
  • Basic building blocks of organisms
  • Located in nucleus of cells
  • Composed of 4 nucleotides
  • Two strands bound together

http//www.visionlearning.com/library/module_viewe
r.php?mid63
31
Central Dogma DNA -gt RNA -gt Protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
www.bioalgorithms.info chapter 6 Gene
Prediction
32
miRNA
  • Short (20-25nt) sequence of noncoding RNA
  • Known since 1993 but significance not widely
    appreciated until 2001
  • Impact / Prevent translation of mRNA
  • Generally reduce protein levels without impacting
    mRNA levels (animal cells)
  • Functions
  • Causes some cancers
  • Guide embryo development
  • Regulate cell Differentiation
  • Associated with HIV
  • …

33
Questions
  • If each cell in an organism contains the same DNA
  • How does each cell behave differently?
  • Why do cells behave differently during
    childhood/?
  • What causes some cells to act differently such
    as during disease?
  • DNA contains many genes, but only a few are being
    transcribed why?
  • One answer - miRNA

34
  • http//www.time.com/time/magazine/article/0,9171,1
    541283,00.html

35
Human Genome
  • Scientists originally thought there would be
    about 100,000 genes
  • Appear to be about 20,000
  • WHY?
  • Almost identical to that of Chimps. What makes
    the difference?
  • Visualization from UCR
  • dnaQT.mov
  • Answers appear to lie in the noncoding regions of
    the DNA (formerly thought to be junk)

36
RNAi Nobel Prize in Medicine 2006
siRNA may be artificially added to cell!
Double stranded RNA Short Interfering RNA
(20-25 nt) RNA-Induced Silencing Complex Binds
to mRNA Cuts RNA
Image source http//nobelprize.org/nobel_prizes/
medicine/laureates/2006/adv.html, Advanced
Information, Image 3
37
Computer Science Bioinformatics
  • Algorithms
  • Data Structures
  • Improving efficiency
  • Data Mining
  • Biologists dont usually understand or even
    appreciate what Computer Science can do
  • Issues
  • Scalability
  • Fuzzy
  • We will look at
  • Microarray Clustering
  • TCGR

38
Affymetrix GeneChip Array
http//www.affymetrix.com/corporate/outreach/lesso
n_plan/educator_resources.affx
39
Microarray Data Analysis
  • Each probe location associated with gene
  • Measure the amount of mRNA
  • Color indicates degree of gene expression
  • Compare different samples (normal/disease)
  • Track same sample over time
  • Questions
  • Which genes are related to this disease?
  • Which genes behave in a similar manner?
  • What is the function of a gene?
  • Clustering
  • Hierarchical
  • K-means

40
Microarray Data - Clustering
"Gene expression profiling identifies clinically
relevant subtypes of prostate cancer" Proc. Natl.
Acad. Sci. USA, Vol. 101, Issue 3, 811-816,
January 20, 2004
41
miRNA Research Issues
  • Predict / Find miRNA in genomic sequence
  • Predict miRNA targets
  • Identify miRNA functions

42
Temporal CGR (TCGR)
  • 2D Array
  • Each Row represents counts for a particular
    window in sequence
  • First row first window
  • Last row last window
  • We start successive windows at the next character
    location
  • Each Column represents the counts for the
    associated pattern in that window
  • Initially we have assumed order of patterns is
    alphabetic
  • Size of TCGR depends on sequence length and
    subpattern length

43
TCGR Example (contd)
  • TCGRs for Sub-patterns of length 1, 2, and 3

44
TCGR Mature miRNA (Window5 Pattern3)
45
TCGRs for Xue Training Data
  • C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X.
    Zhang, Classification of Real and Pseudo
    MicroRNA Precursors using Local
    Structure-Sequence Features and Support Vector
    Machine, BMC Bioinformatics, vol 6, no 310.

46
TCGRs for Xue Test Data
47
Data Mining Applications Outline
  • Introduction Data Mining Overview
  • Classification (Prediction,Forecasting)
  • Clustering
  • Association Rules (Link Analysis)
  • Applications
  • Fraud Detection Illegal Activities
  • Facial Recognition
  • Cheating Plagiarism
  • Bioinformatics
  • Conclusions

48
Conclusions
  • Not magic
  • Doesnt work for all applications
  • Stock Market Prediction
  • Issues
  • Privacy
  • Data
  • Here are some infamous examples of failed data
    mining applications

49
(No Transcript)
50
Dallas Morning News October 7, 2005
51
http//ieeexplore.ieee.org/iel5/6/32236/01502526.p
df?tparnumber1502526isnumber32236
52
BIG BROTHER ?
  • Total Information Awareness
  • http//infowar.net/tia/www.darpa.mil/iao/index.htm
  • http//www.govtech.net/magazine/story.php?id45918
  • http//en.wikipedia.org/wiki/Information_Awareness
    _Office
  • Terror Watch List
  • http//www.businessweek.com/technology/content/may
    2005/tc20050511_8047_tc_210.htm
  • http//www.theregister.co.uk/2004/08/19/senator_on
    _terror_watch/
  • http//blogs.abcnews.com/theblotter/2007/06/fbi_te
    rror_watc.html
  • http//www.thedenverchannel.com/news/9559707/detai
    l.html
  • CAPPS
  • http//www.theregister.co.uk/2004/04/26/airport_se
    curity_failures/
  • http//www.heritage.org/Research/HomelandDefense/B
    G1683.cfm
  • http//www.theregister.co.uk/2004/07/16/homeland_c
    apps_scrapped/
  • http//en.wikipedia.org/wiki/CAPPS

53
(No Transcript)
54
(No Transcript)
55
  • Thank You
About PowerShow.com