The Analysis of Patterns - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

The Analysis of Patterns

Description:

BOTH WIVES LOST CHILDREN WHILE LIVING IN THE WHITE HOUSE. BOTH PRESIDENTS WERE SHOT ... The world wide web contains billion of pages, with text, images, data... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 47
Provided by: stat290
Category:

less

Transcript and Presenter's Notes

Title: The Analysis of Patterns


1
The Analysis of Patterns
  • Nello Cristianini

2
The Value of Patterns
  • Patterns are everywhere, and people have always
    been fascinated by them.
  • Detecting patterns confers an advantage to an
    organism

Temperature and Rainfall in Lake Shasta over 5
years
3
Patterns Help Us in Many Ways
e.g., compress, predict, remove errors
4
Benefits of Detecting Patterns
5
Patterns and Intelligence
  • We care so much about pattern finding skills,
    that we even use them (partly) to quantify
    intelligence

6
The Instinct for Patterns
  • We see patterns everywhere
  • Even where there are no patterns

7
Patterns and Randomness
  • We are poorly equipped to deal with randomness
  • 3.141592653589793238462643383279502884197169399375
    10582097494459230781640628620899862803482534211706
    79...
  • In first million digits we see Erices ZIP code
    11 times
  • Does it mean anything?

8
Patterns and Randomness
  • ABRAHAM LINCOLN WAS ELECTED TO CONGRESS IN 1846.
  • JOHN  F.  KENNEDY WAS ELECTED TO CONGRESS IN
    1946.
  • ABRAHAM LINCOLN WAS ELECTED PRESIDENT IN 1860.
  • JOHN F. KENNEDY WAS ELECTED PRESIDENT IN 1960.
  • THE NAME LINCOLN AND KENNEDY EACH CONTAIN SEVEN
    LETTERS.
  • BOTH WIVES LOST CHILDREN WHILE LIVING IN THE
    WHITE HOUSE.
  • BOTH PRESIDENTS WERE SHOT ON FRIDAY.
  • BOTH WERE SHOT IN THE HEAD.
  • BOTH SUCCESSORS WERE NAMED JOHNSON.
  • ANDREW JOHNSON, WHO SUCCEEDED LINCOLN, WAS BORN
    IN 1808.
  • LYNDON JOHNSON, WHO SUCCEEDED KENNEDY, WAS BORN
    IN 1908.
  • JOHN WILKES BOOTH, REPORTEDLY ASSASSINATED
    LINCOLN.
  • LEE HARVEY OSWALD, REPORTEDLY ASSASSINATED
    KENNEDY.
  • BOTH ASSASSINS WERE KNOWN BY THREE NAMES.
  • BOTH NAMES CONTAINED FIFTEEN LETTERS.
  • BOOTH AND OSWALD WERE ASSASSINATED BEFORE THEIR
    TRIALS.

Coincidences ?
9
Visualizing Patterns
  • We are naturally equipped to detect CERTAIN types
    of patterns, not others

5.9400 8.6100 12.2800 11.6100 20.2800
23.8300 25.8300 27.3900 24.1700 19.0600
11.1700 8.8900 8.3300 7.8900 12.1100
15.3900 19.1100 24.6100 28.1100 25.7800
23.1100 16.8400 13.1700 8.4400 5.8900
10.8300 12.1100 15.7800 18.8300 26.5600
27.5600 25.0000 23.4400 15.5600 10.7200
7.1700 7.8300 11.1700 9.7800
14.9400 20.5000 23.3300 27.8300 29.2200
25.1100 20.6700 12.8900 11.8900 9.1700
9.8300 14.2800 18.5000 19.0000 26.3900
29.6100 26.7200 22.6700 20.3900 13.8900
8.8900
10
Finding Patterns
  • We are naturally interested in finding relations
    in data.
  • We are naturally ill-equipped in dealing with
    randomness.
  • We have developed sophisticated technology to do
    this for us.
  • In last decade we have made one more step
  • As a society we rely on pattern discovery
    technology, in many ways

11
Computational Pattern Finding
  • We want to find relations
  • They need to be reliable
  • They need to be explored efficiently
  • We want to do it automatically
  • On MASSIVE amounts of data

12
The Analysis of Patterns
  • Data driven approach to
  • Science
  • Business
  • Technology
  • Modern society relies on our capability to
    automatically detect reliable patterns in vast
    sets of data

13
The Analysis of Patterns
  • Science
  • The Genome Project
  • Surveys of the Universe
  • Business
  • Amazon automatically exploiting trends and
    relations in transactions database
  • Fraud Detection in Credit Card Companies
  • Technology
  • Voice recognition
  • Handwriting recognition

14
A Scientific Gold Rush
  • 1 GATCACAGGT CTATCACCCT ATTAACCACT
    CACGGGAGCT CTCCATGCAT TTGGTATTTT
  • 61 CGTCTGGGGG GTGTGCACGC GATAGCATTG
    CGAGACGCTG GAGCCGGAGC ACCCTATGTC
  • 121 GCAGTATCTG TCTTTGATTC CTGCCTCATT
    CTATTATTTA TCGCACCTAC GTTCAATATT
  • 181 ACAGGCGAAC ATACCTACTA AAGTGTGTTA
    ATTAATTAAT GCTTGTAGGA CATAATAATA
  • 241 ACAATTGAAT GTCTGCACAG CCGCTTTCCA
    CACAGACATC ATAACAAAAA ATTTCCACCA
  • 301 AACCCCCCCC TCCCCCCGCT TCTGGCCACA
    GCACTTAAAC ACATCTCTGC CAAACCCCAA
  • 361 AAACAAAGAA CCCTAACACC AGCCTAACCA
    GATTTCAAAT TTTATCTTTA GGCGGTATGC
  • 421 ACTTTTAACA GTCACCCCCC AACTAACACA
    TTATTTTCCC CTCCCACTCC CATACTACTA
  • 481 ATCTCATCAA TACAACCCCC GCCCATCCTA
    CCCAGCACAC ACACACCGCT GCTAACCCCA
  • 541 TACCCCGAAC CAACCAAACC CCAAAGACAC
    CCCCCACAGT TTATGTAGCT TACCTCCTCA
  • 601 AAGCAATACA CTGAAAATGT TTAGACGGGC
    TCACATCACC CCATAAACAA ATAGGTTTGG
  • 661 TCCTAGCCTT TCTATTAGCT CTTAGTAAGA
    TTACACATGC AAGCATCCCC GTTCCAGTGA
  • 721 GTTCACCCTC TAAATCACCA CGATCAAAAG
    GGACAAGCAT CAAGCACGCA GCAATGCAGC
  • 781 TCAAAACGCT TAGCCTAGCC ACACCCCCAC
    GGGAAACAGC AGTGATTAAC CTTTAGCAAT
  • 841 AAACGAAAGT TTAACTAAGC TATACTAACC
    CCAGGGTTGG TCAATTTCGT GCCAGCCACC
  • 901 GCGGTCACAC GATTAACCCA AGTCAATAGA
    AGCCGGCGTA AAGAGTGTTT TAGATCACCC
  • 961 CCTCCCCAAT AAAGCTAAAA CTCACCTGAG
    TTGTAAAAAA CTCCAGTTGA CACAAAATAG
  • 1021 ACTACGAAAG TGGCTTTAAC ATATCTGAAC
    ACACAATAGC TAAGACCCAA ACTGGGATTA
  • 1081 GATACCCCAC TATGCTTAGC CCTAAACCTC
    AACAGTTAAA TCAACAAAAC TGCTCGCCAG

15
(No Transcript)
16
Yeast protein interaction map (Barabasi)
17
Another Gold Rush
  • The world wide web contains billion of pages,
    with text, images, data
  • Semantic web, XML-based, provides high quality
    annotated information
  • Soon all books ever written will be in digital
    form
  • Are we ready?

18
2001
2004
2005
19
The Analysis of Patterns
  • Traditionally, the role of analyzing data belongs
    to Statistics.
  • Or does it ?
  • Data analysis performed by physicists,
    biologists, engineers each with their own set of
    tools.
  • Even the task of making or validating these tools
    is not just part of statistics.

20
The Analysis of Patterns
  • Signal processing
  • Data mining
  • Information retrieval
  • Pattern recognition ()
  • Pattern matching
  • Machine Learning

21
The Analysis of Patterns
  • Pattern Recognition
  • Syntactical / Structural
  • Statistical
  • Visual
  • Pattern Discovery vs Pattern Matching
  • In sequences
  • In graphs
  • In images

22
The Analysis of Patterns
  • Grammatical Inference
  • Mining for Association Rules
  • Patterns in Vector Data (classical multivariate
    statistics neural networks machine learning
    etc)
  • Etc, Etc

23
The Analysis of Patterns
  • Many communities working almost independently
  • Occasionally re-discovering the same things
  • A small and fairly stable set of ideas
  • Efficient search for patterns in data
  • Statistical validation issues
  • Pattern visualization
  • Often same tools and concepts

24
Searching for Patterns
  • The search problem can be framed within
    Operational Research / Optimization.
  • (e.g., Integer Programming, Convex Programming,
    etc)
  • Many key ideas from exact optimization have
    revolutionized this field in recent years
  • Where exact solution are theoretically
    impractical (and only then!) we can use
    approximations, then heuristic approaches.
  • Again same heuristics appear in many fields

25
Statistics
  • How do we know that a relation found in a finite
    set of data is reliable, or significant, or even
    interesting?
  • Many issues of hypothesis testing
  • Classical statistics vs statistical learning
    theory

26
What Are Patterns?
  • This is a rather difficult question to answer. I
    hope we will have an answer by the end of this
    meeting.
  • I encourage all speakers and participants to
    suggest some definitions.

27
Gregory Chaitin "Patterns, Randomness and
Information"
  • Information, Complexity, Patterns, Randomness and
    Compression.
  • What are regularities in data? How can they be
    defined? And quantified?
  • Predictability and Compressibility are connected.
  • Randomness can be defined in algorithmic ways.

28
Gregory Chaitin
  • Chaitin will explain what it means that a
    sequence has no pattern, and some far reaching
    consequences
  • ideas can be traced back through Hermann Weyl to
    Leibniz in 1686,
  • connect with Godel Turing
  • the question of how math compares contrasts
    with physics and with biology

29
Patterns in Sets of Points (Vectors)
  • Probably the most developed part of pattern
    analysis
  • Includes much multivariate stats, much
    statistical pattern recognition (e.g., Duda and
    Hart) and machine learning

30
Tijl De Bie Patterns in Sets of Points
  • Patterns in sets of points an overview
  • the role of optimization
  • examples of patterns
  • Dimensionality reduction, classification,
    clustering
  • Emphasize linear patterns (connect to later
    kernels talk)
  • Patterns in sets of points the myriad virtues of
    eigenproblems ?
  • the eigenvalue problem.
  • principal component analysis, canonical
    correlation analysis, Fisher's discriminant,
    partial least squares, and spectral clustering.
  • More from thiss area will be covered in Kernel
    Methods talk

31
Patterns in Sequences
  • After vectors, probably the most important type
    of data
  • DNA
  • Text (web)
  • How to find patterns within and among sequences?
  • What data structures? What statistical models?

32
Suffix Tree and Hidden Markov techniques for
pattern analysis
Esko Ukkonen
  • Efficient Pattern Discovery in sequences requires
    appropriate data structures
  • Suffix tree construction.
  • linear time array constructions
  • using suffix trees for finding motifs with gaps
  • finding cis-regulatory motifs by comparative
    genomics
  • Hidden Markov techniques for haplotyping

33
Dan Gusfield Trees, Arrays, Networks and
Optimization for Finding Patterns in Biological
Sequences
  1. The use of suffix trees and integer programming
    for finding optimal virus signatures.
  2. A current treatment of suffix-arrays and their
    uses.
  3. Algorithms for finding signatures (patterns) of
    historical recombination and gene-conversion in
    SNP (binary) sequences.

34
Raffaele GiancarloPatterns and Compression
  • Patterns are not just necessary for prediction,
    they are also needed for data compression.
  • Many relations between PA and Data Compression.
  • Raffaele Giancarlo (University of Palermo) - On
    Indexing and Compression Two Sides of the Same
    Coin

35
Conceptual Foundations
  • Alberto Apostolico (University of Padova and
    Georgia Tech) - "Algorithmic and Combinatorial
    Foundations of Pattern Discovery"
  • Will discuss various aspects of the interplay
    between algorithmics and statistics, as well as
    the notion itself of pattern.

36
Kernel Methods
  • An idea if we are so good at finding (linear)
    patterns in sets of points
  • Why not transforming all other problems into a
    points problem?
  • Good idea
  • Kernels Methods (from machine learning) can do
    this automatically

37
Bernhard Schoelkopf Kernel Methods
  • Kernel methods combine ideas from statistics and
    optimization
  • State of the art machine learning systems
  • Operate on general types of data
  • Work by embedding data into a euclidean space
  • The structure of the space determined by choosing
    a special kernel function
  • KMs connect various aspects of PA

38
Patterns in Sets
  • The most classic textbook example of data mining
  • You shop at the supermarket, and the
    market-basket contents are recorded by the
    computer system at check-out
  • Discover when some items are associated, when
    it is possible to predict your next purchase,
    etc
  • (this is what Amazon does automatically)

39
Heikki Mannila Finding frequent patterns
  • Part I Finding frequent patterns from data
  • Discovery of frequent patterns finding positive
    conjunctions that are true for a given fraction
    of the observations
  • this basic idea can be instantiated in many ways
  • finding frequent sets from 0/1 data (association
    mining)
  • finding frequent episodes in sequences
  • finding frequent subgraphs in graphs etc.
  • efficient algorithms exist -- the levelwise
    approach
  • theoretical analysis of the algorithms is not
    trivial (leads to connections to hypergraph
    transversals etc.)
  • Part II how can the patterns be used?
  • sometimes interesting in themselves
  • can be used to approximate the joint distribution
  • maximum entropy approaches
  • combining information from several patterns -
    ordering patterns

40
When can we trust the patterns we found?
  • Statistical issues
  • Patterns can be the result of chance
  • Multiple testing increases this risk
  • Small samples, interest in weak patterns, etc
    are other factors
  • Statistical learning theory and Classical
    statistics have developed tools to deal with this
  • These criteria can also guide the search
    algorithms towards more reliable patterns

41
John Shawe-Taylor Statistical Aspects of Pattern
Analysis
  • We want significant / reliable patterns
  • Reliable give us predictive power
  • Significant cannot be explained by chance
  • Factors affecting pattern reliability
  • Pattern magnitude (how strong is the relation)
  • Sample size (how large is the support from data?)
  • Multiple testing (how many other patterns have
    been tested at the same time)
  • This translates into classical machine learning
    and statistics themes

42
Nicolo' Cesa-Bianchi On-line linear learning
algorithms
  • Machine learning has various ways to model the
    pattern discovery process.
  • An approach completely different from classical
    statistics on-line learning.
  • Prediction with expert advice.
  • Learning with linear experts.
  • The Perceptron algorithm and its extensions.
  • On-line learning with kernels.
  • Mistake bounds.
  • From mistake bounds to risk bounds

43
Grammatical Inference
  • Very classical theme in pattern recognition,
    based on Chomskys theory of formal languages and
    grammars
  • Given a finite sample from a language, infer the
    grammar that generates it (with various
    constraints).
  • A childs game

44
Colin de la Higuera "Grammatical Inference, a
Tutorial"
  • The lectures will introduce the key ideas of
    grammatical inference and concentrate specially
    on the algorithmic aspects.
  • Some algorithms that will be described are
  • The "State merging" family Gold, Rpni, Edsm...
  • The "Window" languages Local and k-testable
  • Learning with queries.
  • This class of approaches often goes under the
    name Syntactical Pattern Recognition

45
Patterns in Graphs
  • Edwin Hancock (University of York, UK) -
    Pattern Analysis with Graphs and Trees'
  • Spectral representations of graphs,
  • Pattern spaces from graph spectra,
  • Spectral approaches to matching,
  • Heat kernel methods
  • Probabilistic and spectral methods for graph
    matching and clustering.
  • Applications in computer vision.

46
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com