Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems Winter 2007 - PowerPoint PPT Presentation


PPT – Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems Winter 2007 PowerPoint presentation | free to download - id: 471dfd-NTI4Y


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems Winter 2007


Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 190
Provided by: StanM6


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems Winter 2007

Advanced Topics in Computer Systems Machine
Learning and Data Mining Systems Winter 2007
  • Stan Matwin
  • Professor
  • School of Information Technology and Engineering/
    École dingénierie et de technologie de
  • University of Ottawa
  • Canada

Goals of this course
  • Dual seminar/tutorial structure
  • The tutorial part will teach basic concepts of
    Machine Learning (ML) and Data Mining (DM)
  • The seminar part will
  • introduce interesting areas of current and future
  • Introduce successful applications
  • Preparation to enable advanced self-study on ML/DM

Course outline
  • Machine Learning/Data Mining basic terminology.
  • Symbolic learning Decision Trees
  • Basic Performance Evaluation
  • Introduction to the WEKA system
  • Probabilistic learning Bayesian learning.
  • Text classification

  • Kernel-based methods Support Vector Machines
  • Ensemble-based methods boosting
  • Advanced Performance Evaluation ROC curves
  • Applications in bioinformatics
  • Data mining concepts and techniques Association
  • Feature selection and discretization

Machine Learning / Data Mining basic terminology
  • Machine Learning
  • given a certain task, and a data set that
    constitutes the task,
  • ML provides algorithms that resolve the task
    based on the data, and the solution improves with
  • Examples
  • predicting lottery numbers next Saturday
  • detecting oil spills on sea surface
  • assigning documents to a folder
  • identifying people likely to want a new credit
    card (cross selling)

  • Data Mining extracting regularities from a VERY
    LARGE dataset/database as part of a
    business/application cycle
  • examples
  • cell phone fraud detection
  • customer churn
  • direct mail targeting/ cross sell
  • prediction of aircraft component failures

Basic ML tasks
  • Supervised learning
  • classification/concept learning
  • estimation essentially, extrapolation
  • Unsupervised learning
  • clustering finding groups of similar objects
  • associations in a database, finding that some
    values of attributes go with some other

Concept learning (also known as classification)
a definition
  • the concept learning problem
  • given
  • a set E e1, e2, , en of training instances
    of concepts, each labeled with the name of a
    concept C1, ,Ck to which it belongs
  • determine
  • definitions of each of C1, ,Ck which correctly
    cover E. Each definition is a concept description

Dimensions of concept learning
  • representation
  • data
  • symbolic
  • numeric
  • concept description
  • attribute-value (propositional logic)
  • relational (first order logic)
  • Language of examples and hypotheses
  • Attribute-value (AV) propositional
  • Relational (ILP) first-order logic
  • method of learning
  • top-down
  • bottom-up (covering)
  • different search algorithms

2. Decision Trees
  • A decision tree as a concept representation

wage incr. 1st yr
working hrs
statutory holidays
contribution to hlth plan
wage incr. 1st yr
  • building a univariate (single attribute is
    tested) decision tree from a set T of training
    cases for a concept C with classes C1,Ck
  • Consider three possibilities
  • T contains 1 or more cases all belonging to the
    same class Cj. The decision tree for T is a leaf
    identifying class Cj
  • T contains no cases. The tree is a leaf, but the
    label is assigned heuristically, e.g. the
    majority class in the parent of this node

  • T contains cases from different classes. T is
    divided into subsets that seem to lead towards
    collections of cases. A test t based on a single
    attribute is chosen, and it partitions T into
    subsets T1,,Tn. The decision tree consists of
    a decision node identifying the tested attribute,
    and one branch for ea. outcome of the test. Then,
    the same process is applied recursively to ea.Ti

Choosing the test
  • why not explore all possible trees and choose the
    simplest (Occams razor)? But this is an NP
    complete problem. E.g. in the union example
    there are millions of trees consistent with the
  • notation S set of the training examples
    freq(Ci, S) number of examples in S that belong
    to Ci
  • information measure (in bits) of a message is -
    log2 of the probability of that message
  • idea to maximize the difference between the info
    needed to identify a class of an example in T,
    and the the same info after T has been
    partitioned in accord. with a test X

  • selecting 1 case and announcing its class has
    info meas. - log2(freq(Ci, S)/S) bits
  • to find information pertaining to class
    membership in all classes info(S) -??(freq(Ci,
    S)/S)log2(freq(Ci, S)/S)
  • after partitioning according to outcome of test
  • infoX(T) ?Ti/Tinfo(Ti)
  • gain(X) info(T) - infoX(T) measures the gain
    from partitioning T according to X
  • We select X to maximize this gain

Data for learning the weather (play/dont play)
concept (Witten p. 10)
Info(S) 0.940
Selecting the attribute
  • Gain(S, Outlook) 0.246
  • Gain(S, Humidity) 0.151
  • Gain(S, Wind) 0.048
  • Gain(S, Temp) 0.029
  • Choose Outlook as the top test

How does info gain work?
Gain ratio
  • info gain favours tests with many outcomes
    (patient id example)
  • consider split info(X) ?Ti/Tlog(Ti/T)
  • measures potential info. generated by dividing T
    into n classes (without considering the class
  • gain ratio(X) gain(X)/split info(X)
  • shows the proportion of info generated by the
    split that is useful for classification in the
    example (Witten p. 96), log(k)/log(n)
  • maximize gain ratio

Partition of cases and corresp. tree
In fact, learning DTs with the gain ratio
heuristic is a search
continuous attrs
  • a simple trick sort examples on the values of
    the attribute considered choose the midpoint
    between ea two consecutive values. For m values,
    there are m-1 possible splits, but they can be
    examined linearly
  • cost?

  • From trees to rules
  • traversing a decision tree from root to leaf
    gives a rule, with the path conditions as the
    antecedent and the leaf as the class
  • rules can then be simplified by removing
    conditions that do not contribute to discriminate
    the nominated class from other classes
  • rulesets for a whole class are simplified by
    removing rules that do not contribute to the
    accuracy of the whole set

Geometric interpretation of decision trees
axis-parallel area
b gt b1
a gt a1
a lt a2
Decision rules can be obtained from decision trees
(1)if bgtb1 then class is - (2)if b lt b1 and a gt
a1 then class is (3)if b lt b1 a lt a2 then
class is (4)if b lt b1 and a2 lt a lt a1 then
class is -
b gt b1
a gt a1
a lt a2
notice the inference involved in rule (3)
(No Transcript)
  • lots of datasets can be obtained from
  • ftp
  • cd pub/machine-learning-databases
  • contents are described in the file README in the
  • dir machine-learning-databases at Irvine

Empirical evaluation of accuracy in
classification tasks
  • the usual approach
  • partition the set E of all labeled examples
    (examples with their classification labels) into
    a training set and a testing set
  • use the training set for learning, obtain a
    hypothesis H, set acc 0
  • for ea. element t of the testing set,
  • apply H on t if H(t) label(t) then acc
  • acc acc/testing set

Testing - contd
  • Given a dataset, how do we split it between the
    training set and the test set?
  • cross-validation (n-fold)
  • partition E into n groups
  • choose n-1 groups from n, perform learning on
    their union
  • repeat the choice n times
  • average the n results
  • usually, n 3, 5, 10
  • another approach - learn on all but one example,
    test that example.
  • Leave One Out

Confusion matrix
  • classifier-determined classifier-determined
  • positive label negative label
  • true positive a b
  • label
  • true negative c d
  • label
  • Accuracy (ad)/(abcd)
  • a true positives
  • b false negatives
  • c false positives
  • d true negatives

  • Precision a/(ac)
  • Recall a/(ab)
  • F-measure combines Recall and Precision
  • Fb (b21)PR / (b2 P R)
  • Refelects importance of Recall versus Precision
    eg F0 P

Cost matrix
  • Is like confusion matrix, except costs of errors
    are assigned to the elements outside the diagonal
  • this may be important in applications, e.g. when
    the classifier is a diagnosis rule
  • see
  • http//
  • for a survey of learning with misclassification

Bayesian learning
  • incremental, noise-resistant method
  • can combine prior Knowledge (the K is
  • predictions are probabilistic

  • Bayes law of conditional probability

results in a simple learning rule choose the
most likely (Maximum APosteriori)hypothesis
Example Two hypo (1) the patient has cancer (2)
the patient is healthy
Priors 0.8 of the population has cancer
  • P(not cancer) .992
  • P( - cancer) .02
  • P(-not cancer) .97
  • P(cancer) .008
  • P( cancer) .98
  • P(not cancer) .03

We observe a new patient with a positive test.
How should they be diagnosed? P(cancer)
P(cancer)P(cancer) .98.008 .0078 P(not
cancer) P(not cancer)P(not cancer)
Minimum Description Length
  • revisiting the def. of hMAP
  • we can rewrite it as
  • or
  • But the first log is the cost of coding the data
    given the theory, and the second - the cost of
    coding the theory

  • Observe that
  • for data, we only need to code the exceptions
    the others are correctly predicted by the theory
  • MAP principles tells us to choose the theory
    which encodes the data in the shortest manner
  • the MDL states the trade-off between the
    complexity of the hypo. and the number of errors

Bayes optimal classifier
  • so far, we were looking at the most probable
    hypothesis, given a priori probabilities. But we
    really want the most probable classification
  • this we can get by combining the predictions of
    all hypotheses, weighted by their posterior
  • this is the bayes optimal classifier BOC

Example of hypotheses h1, h2, h3 with posterior
probabilities .4, .3. .3 A new instance is
classif. pos. by h1 and neg. by h2, h3
Bayes optimal classifier
  • V , -
  • P(h1D) .4, P(-h1) 0, P(h1) 1
  • Classification is (show details!)

  • Captures probability dependencies
  • ea node has probability distribution the task
    is to determine the join probability on the data
  • In an appl. a model is designed manually and
    forms of probability distr. Are given
  • Training set is used to fut the model to the data
  • Then probabil. Inference can be carried out, eg
    for prediction

First five variables are observed, and the model
is Used to predict diabetes
P(A, N, M, I, G, D)P(A)P(n)P(MA, n)P(DM, A,
  • how do we specify prob. distributions?
  • discretize variables and represent probability
    distributions as a table
  • Can be approximated from frequencies, eg table
    P(MA, N) requires 24parameters
  • For prediction, we want (DA, n, M, I, G) we
    need a large table to do that

(No Transcript)
  • no other classifier using the same hypo. spac e
    and prior K can outperform BOC
  • the BOC has mostly a theoretical interest
    practically, we will not have the required
  • another approach, Naive Bayes Classifier (NBC)
  • under a simplifying assumption of independence of
    the attribute values given the class value

To estimate this, we need (of possible
values)(of possible instances) examples
(No Transcript)
  • in NBC, the conditional probabilities are
    estimated from training data simply as normalized
    frequencies how many times a given attribute
    value is associated with a given class
  • no search!
  • example
  • m-estimate

  • Example (see the Dec. Tree sec. in these notes)
  • we are trying to predict yes or no for
    Outlooksunny, Temperaturecool, Humidityhigh,

P(yes)9/14 P(no)5/14 P(Windstrongyes)3/9
P(Windstrongno)3/5 etc. P(yes)P(sunnyyes)P(co
olyes)P(highyes)Pstrongyes).0053 P(yes)P(sunny
no)P(coolno)P(highno)Pstrongno).0206 so we
will predict no compare to 1R!
  • Further, we can not only have a decision, but
    also the prob. of that decision
  • we rely on for the conditional probability
  • if the conditional probability is very small, and
    n is small too, then we should assume that nc is
    0. But this biases too strongly the NBC.
  • So smoothen see textbook p. 85
  • Instead, we will use the estimate
  • where p is the prior estimate of probability,
  • m is equivalent sample size. If we do not know
    otherwise, p1/k for k values of the attribute m
    has the effect of augmenting the number of
    samples of class
  • large value of m means that priors p are
    important wrt training data when probability
    estimates are computed, small less important

Text Categorization
  • Representations of text are very high dimensional
    (one feature for each word).
  • High-bias algorithms that prevent overfitting in
    high-dimensional space are best.
  • For most text categorization tasks, there are
    many irrelevant and many relevant features.
  • Methods that sum evidence from many or all
    features (e.g. naïve Bayes, KNN, neural-net) tend
    to work better than ones that try to isolate just
    a few relevant features (decision-tree or rule

Naïve Bayes for Text
  • Modeled as generating a bag of words for a
    document in a given category by repeatedly
    sampling with replacement from a vocabulary V
    w1, w2,wm based on the probabilities P(wj
  • Smooth probability estimates with Laplace
    m-estimates assuming a uniform distribution over
    all words (p 1/V) and m V
  • Equivalent to a virtual sample of seeing each
    word in each category exactly once.

Text Naïve Bayes Algorithm (Train)
Let V be the vocabulary of all words in the
documents in D For each category ci ? C
Let Di be the subset of documents in D in
category ci P(ci) Di / D Let
Ti be the concatenation of all the documents in
Di Let ni be the total number of word
occurrences in Ti For each word wj ? V
Let nij be the number of occurrences
of wj in Ti Let P(wi ci)
(nij 1) / (ni V)
Text Naïve Bayes Algorithm (Test)
Given a test document X Let n be the number of
word occurrences in X Return the category
where ai is the word occurring the ith position
in X
Naïve Bayes Time Complexity
  • Training Time O(DLd CV))
    where Ld is the average length of a document in
  • Assumes V and all Di , ni, and nij pre-computed
    in O(DLd) time during one pass through all of
    the data.
  • Generally just O(DLd) since usually CV lt
  • Test Time O(C Lt)
    where Lt is the average length of a test
  • Very efficient overall, linearly proportional to
    the time needed to just read in all the data.
  • Similar to Rocchio time complexity.

Underflow Prevention
  • Multiplying lots of probabilities, which are
    between 0 and 1 by definition, can result in
    floating-point underflow.
  • Since log(xy) log(x) log(y), it is better to
    perform all computations by summing logs of
    probabilities rather than multiplying
  • Class with highest final un-normalized log
    probability score is still the most probable.

Naïve Bayes Posterior Probabilities
  • Classification results of naïve Bayes (the class
    with maximum posterior probability) are usually
    fairly accurate.
  • However, due to the inadequacy of the conditional
    independence assumption, the actual
    posterior-probability numerical estimates are
  • Output probabilities are generally very close to
    0 or 1.

Textual Similarity Metrics
  • Measuring similarity of two texts is a
    well-studied problem.
  • Standard metrics are based on a bag of words
    model of a document that ignores word order and
    syntactic structure.
  • May involve removing common stop words and
    stemming to reduce words to their root form.
  • Vector-space model from Information Retrieval
    (IR) is the standard approach.
  • Other metrics (e.g. edit-distance) are also used.

The Vector-Space Model
  • Assume t distinct terms remain after
    preprocessing call them index terms or the
  • These orthogonal terms form a vector space.
  • Dimension t vocabulary
  • Each term, i, in a document or query, j, is
    given a real-valued weight, wij.
  • Both documents and queries are expressed as
    t-dimensional vectors
  • dj (w1j, w2j, , wtj)

Graphic Representation
  • Example
  • D1 2T1 3T2 5T3
  • D2 3T1 7T2 T3
  • Q 0T1 0T2 2T3

Document Collection
  • A collection of n documents can be represented in
    the vector space model by a term-document matrix.
  • An entry in the matrix corresponds to the
    weight of a term in the document zero means
    the term has no significance in the document or
    it simply doesnt exist in the document.

Term Weights Term Frequency
  • More frequent terms in a document are more
    important, i.e. more indicative of the topic.
  • fij frequency of term i in document j
  • May want to normalize term frequency (tf) by
    dividing by the frequency of the most common term
    in the document
  • tfij fij / maxifij

Term Weights Inverse Document Frequency
  • Terms that appear in many different documents are
    less indicative of overall topic.
  • df i document frequency of term i
  • number of documents containing term
  • idfi inverse document frequency of term i,
  • log2 (N/ df i)
  • (N total number of documents)
  • An indication of a terms discrimination power.
  • Log used to dampen the effect relative to tf.

TF-IDF Weighting
  • A typical combined term importance indicator is
    tf-idf weighting
  • wij tfij idfi tfij log2 (N/ dfi)
  • A term occurring frequently in the document but
    rarely in the rest of the collection is given
    high weight.
  • Many other ways of determining term weights have
    been proposed.
  • Experimentally, tf-idf has been found to work

Computing TF-IDF -- An Example
  • Given a document containing terms with given
  • A(3), B(2), C(1)
  • Assume collection contains 10,000 documents and
  • document frequencies of these terms are
  • A(50), B(1300), C(250)
  • Then
  • A tf 3/3 idf log(10000/50) 5.3
    tf-idf 5.3
  • B tf 2/3 idf log(10000/1300) 2.0
    tf-idf 1.3
  • C tf 1/3 idf log(10000/250) 3.7
    tf-idf 1.2

Similarity Measure
  • A similarity measure is a function that computes
    the degree of similarity between two vectors.
  • Using a similarity measure between the query and
    each document
  • It is possible to rank the retrieved documents in
    the order of presumed relevance.
  • It is possible to enforce a certain threshold so
    that the size of the retrieved set can be

Similarity Measure - Inner Product
  • Similarity between vectors for the document di
    and query q can be computed as the vector inner
  • sim(dj,q) djq wij wiq
  • where wij is the weight of term i in document
    j and wiq is the weight of term i in the query
  • For binary vectors, the inner product is the
    number of matched query terms in the document
    (size of intersection).
  • For weighted term vectors, it is the sum of the
    products of the weights of the matched terms.

Properties of Inner Product
  • The inner product is unbounded.
  • Favors long documents with a large number of
    unique terms.
  • Measures how many terms matched but not how many
    terms are not matched.

Inner Product -- Examples
  • Binary
  • D 1, 1, 1, 0, 1, 1, 0
  • Q 1, 0 , 1, 0, 0, 1, 1
  • sim(D, Q) 3

Cosine Similarity Measure
  • Cosine similarity measures the cosine of the
    angle between two vectors.
  • Inner product normalized by the vector lengths.

CosSim(dj, q)
K Nearest Neighbor for Text
Training For each each training example ltx,
c(x)gt ? D Compute the corresponding TF-IDF
vector, dx, for document x Test instance
y Compute TF-IDF vector d for document y For
each ltx, c(x)gt ? D Let sx cosSim(d,
dx) Sort examples, x, in D by decreasing value of
sx Let N be the first k examples in D. (get
most similar neighbors) Return the majority class
of examples in N
Illustration of 3 Nearest Neighbor for Text
3 Nearest Neighbor Comparison
  • Nearest Neighbor tends to handle polymorphic
    categories better.

Nearest Neighbor Time Complexity
  • Training Time O(D Ld) to compose TF-IDF
  • Testing Time O(Lt DVt) to compare to all
    training vectors.
  • Assumes lengths of dx vectors are computed and
    stored during training, allowing cosSim(d, dx) to
    be computed in time proportional to the number
    of non-zero entries in d (i.e. Vt)
  • Testing time can be high for large training sets.

Nearest Neighbor with Inverted Index
  • Determining k nearest neighbors is the same as
    determining the k best retrievals using the test
    document as a query to a database of training
  • Use standard VSR inverted index methods to find
    the k nearest neighbors.
  • Testing Time O(BVt)
    where B is the average number of
    training documents in which a test-document word
  • Therefore, overall classification is O(Lt
  • Typically B ltlt D

Support Vector Machines (SVM)
  • a new classifier
  • Attractive because
  • Has sound mathematical foundations
  • Performs very well in diverse and difficuly
  • See textbook (ch. 6.3) and papers by Scholkopf
    placed on the class website

Review of basic analytical geometry
  • Dot product of vectors by coordinates and with
    the angle
  • If vectors a, b are perpendicular, then (a ? b)
    0 (e.g. (0, c) ? (d, 0) 0
  • A hyperplane in an n-dimensional space has the
    property x (w ? x) b 0 w is the weight
    vector, b is the threshold x (x1, , xn) w
    (w1, , wn)
  • A hyperplane divides the n-dimensional space into
    two subspaces one is y y((w ? x) b) gt 0,
    the other is complementary (y y((w ? x) b) lt0)

  • Lets revisit the general classification problem.
  • We want to estimate an unknown function f, all we
    know about it is the training set (x1,y1),
  • The objective is to minimize the expected error
  • where l is a loss function, eg
  • and ?(z) 0 for zlt0 and ?(z)1 otherwise
  • Since we do not know P, we cannot measure risk
  • We want to approximate the true error (risk) by
    the empirical error (risk)

  • We know from the PAC theory that conditions can
    be given on the learning task so that the
    empirical risk converges towards the true risk
  • We also know that the difficulty of the learning
    task depends on the complexity of f (VC
  • It is known that the following relationship
    between the empirical risk and the complexity of
    the language (h denotes VC dimension of the class
    of f)
  • is true with probability at least ? for ngt h

  • Structural Risk Minimization (SRM) chooses the
    class of F to find a balance between the
    simplicity of f (very simple may result in a
    large empirical risk) and and the empirical risk
    (small may require a class function with a large

Points lying on the margin are called support
vectors w can be constructed efficiently
quadratic optimization problem.
Basic idea of SVM
  • Linearly separable problems are easy (quadratic),
    but of course most problems are not l. s.
  • Take any problem and transform it into a
    high-dimensional space, so that it becomes
    linearly separable, but
  • Calculations to obtain the separability plane can
    be done in the original input space (kernel trick)

Basic idea of SVM
  • Original data is mapped into another dot product
    space called feature space F via a non-linear map
  • Then linear separable classifier is performed in
  • Note that the only operations in F are dot
  • Consider e.g.

Lets see that ? geometrically, and that it does
what we want it to do transform a hard
classification problem into an easy one, albeit
in a higher dimension
  • But in general quadratic optimization in the
    feature space could be very expensive
  • Consider classifying 16 x 16 pixel pictures, and
    5th order monomials
  • Feature space dimension in this example is O(
    ) 1010

Here we show that the that transformation from
ellipsoidal decision space to a linear one,
requiring dot product in the the feature space,
can be performed by a kernel function in the
input space in general, k(x,y) (x ? y)d
computes in the input space kernels replace
computation in FS by computation in the input
space in fact, the transformation ? needs not to
be applied when a kernel is used!
Some common kernels used
Using different kernels we in fact use different
classifiers in the input space gaussian,
polynomial, 3-layer neural nets,
Simplest kernel
  • Is the linear kernel (w ? x) b
  • But this only works if the training set is
    linearly separable. This may not be the case
  • For the linear kernel, or even
  • In the feature space

The solution for the non-separable case is to
optimize not just the margin, but the margin plus
the influence of training errors ?i
Classification with SVMs
  • Convert each example x to F(x)
  • Perform optimal hyperplane algorithm in F but
    since we use the kernel all we need to do is to
  • where xi, yi are training instances, ai are
    computed as the solution to the quadratic
    programming problem

Examples of classifiers in the input space
Overall picture
Geometric interpretation of SVM classifier
  • Normalize the weight vector to 1 ( ) and
    set the threshold b 0
  • The set of all w that separate training set is
  • But this is the Version Space
  • VS has a geometric centre (Bayes Optimal
    Classifier) near the gravity point
  • If VS has a shape in which SVM solution is far
    from the VS centre, SVM works poorly

(No Transcript)
(No Transcript)
  • Text classification
  • Image analysis face recognition
  • Bioinformatics gene expression
  • Can the kernel reflect domain knowledge?

SVM contd
  • A method of choice when examples are represented
    by vectors or matrices
  • Input space cannot be readily used as
    attribute-vector (e.g. too many attrs)
  • Kernel methods map data from input space to
    feature space (FS) perform learning in FS
    provided that examples are only used within dot
    point (the kernel trick
  • SVM but also Perceptron, PCA, NN can be done on
    that basis
  • Collectively kernel-based methods
  • The kernel defines the classifier
  • The classifier is independent of the
    dimensionality of the FS can even be infinite
    (gaussian kernel)
  • LIMITATION of SVMs they only work for two-class
  • Remedy use of ECOC

Applications face detection IEEE INTELLIGENT
  • The task to find a rectangle containing a face
    in an image applicable in face recognition,
    surveillance, HCI etc. Also in medical image
    processing and structural defects
  • Difficult task variations that are hard toi
    represent explicitly (hair, moustache, glasses)
  • Cast as a classification problem image regions
    that are faces and non-faces
  • Scanning the image in multiple scales, dividing
    it into (overlapping) frames and classifying the
    frames with an SVM

Face detection contd
  • SVM performing face detection support vectors
    are faces and non-faces
  • Examples are 19x19 pixels, class 1 or -1
  • SVM 2nd degree polynomial with slack variables
  • Representation tricks masking out near-boundary
    area - 361-gt283, removes noise
  • illumination correction reduction of light and
  • Discretization of pixel brightness by histogram

Face detection system architecture
  • Bootstrapping using the system on images with no
    faces and storing false positives to use as
    negative examples in later training

Performance on 2 test sets Set A 313 high
quality Images with 313 faces, set B 23 images
with 155 faces This results in gt4M frames for A
and gt5M frames for B. SVM achieved recall of
97 on A and 74 on B, with 4 and 20 false
positives, resp.
SVM in text classification
  • Example of classifiers (the Reuters corpus 13K
    stories, 118 categories, time split)
  • Essential in document organization (emails!),
    indexing etc.
  • First comes from a PET second from and SVM
  • Text representation BOW mapping docs to large
    vectors indicating which word occurs in a doc as
    many dimensions as words in the corpus (many more
    than in a given doc)
  • often extended to frequencies (normalized) of
    stemmed words

Text classification
  • Still a large number of features, so a stop list
    is applied, and some form of feature selection
    (e.g. based on info gain, or tf/idf) is done,
    down to 300 features
  • Then a simple, linear SVM is used (experiments
    with poly. and RDF kernels indicated they are not
    much better than linear). One against all scheme
    is used
  • What is a poly (e.g. level 2) kernel representing
    in text classification?
  • Performance measured with micro-averaged break
    even point (explain)
  • SVM obtained the best results, with DT second (on
    10 cat.) and Bayes third. Other authors report
    better IB performance (findSim) than here

A ROC for the above experiments (class
grain) ROC obtained by varying the
threshold threshold is learned from values of
and discriminates between classes
How about another representation?
  • N-grams sequences of N consecutive characters,
    eg 3-grams is support vector sup, upp, ppo,
    por, , tor
  • Language-independent, but a large number of
    features (gtgtwords)
  • The more substrings in common between 2 docs, the
    more similar the 2 docs are
  • What if we make these substring non-contiguous?
    With weight measuring non-contiguity? car
  • We will make ea substring a feature, with value
    depending on how frequently and how compactly a
    substring occurs in the text

  • The latter is represented by a decay factor ?
  • Example cat, car, bat, bar
  • Unnormalized K(car,cat) ?4, K(car,car)K(cat,cat)
    2 ?4 ?6,normalized K(car,cat) ?4/( 2?4 ?6)
    1/(2 ?2)
  • Impractical (too many) for larger substrings and
    docs, but the kernel using such features can be
    calculated efficiently (substring kernel SSK)
    maps strings (a whole doc) to a feature vector
    indexed by all k-tuples

  • Value of the feature sum over the occurrences
    of the k-tuple of a decay factor of the length of
    the occurrence
  • Def of SSK ? is an alphabet string finite
    sequence of elems of ? . s length of s
    sij substring of s. u is a subsequence of s
    if there exist indices i(i1,,iu ) with
    1i1ltlt iu s such that uj sij for j1,,u
    (usi for short).
  • Length l(i) of of the subsequence in s is iu -
    i1 1 (span in s)
  • Feature space mapping ? for s is defined by
  • for each u ??n (set of all finite strings of
    length n) features measure the number of
    occurrences of subsequences in s weighed by their
    length (??1)
  • The kernel can be evaluated in O(nst) time
    (see Lodhi paper)

Experimental results with SSK
  • The method is NOT fast, so a subset of Reuters
    (n470/380) was used, and only 4 classes corn,
    crude, earn, acquisition
  • Compared to the BOW representation (see earlier
    in these notes) with stop words removed, features
    weighed by tf/idflog(1tf)log(n/df)
  • F1 was used for the evaluation, C set
  • Best k is between 4 and 7
  • Performance comparable to a classifier based on
    k-grams (contiguous), and also BOW
  • ? controls the penalty for gaps in substrings
    best precision for high ? 0.7. This seems to
    result in high similarity score for docs that
    share the same but semantically different words -
  • Results on full Reuters not as good as with BOW,
    k-grams the conjecture is that the kernel
    performs something similar to stemming, which is
    less important onlarge datasets where there is
    enough data to learn the samness of different

Bioinformatics application
  • Coding sequences in DNA encode proteins.
  • DNA alphabet A, C, G, T. Codon triplet of
    adjacent nucleotides, codes for one aminoacid.
  • Task identify where in the genome the coding
    starts (Translation Initiation Sites). Potential
    start codon is ATG.
  • Classification task does a sequence window
    around the ATG indicate a TIS?
  • Each nucleotide is encoded by 5 bits, exactly one
    is set to 1, indicating whether the nucleotide is
    A, C, G, T, or unknown. So the dimension n of the
    input space 1000 for window size 100 to left
    and right of the ATG sequence.
  • Positive and negaite windows are provided as the
    training set
  • This representation is typical for the kind of
    problem where SVMs do well

  • What is a good feature space for this problem?
    how about including in the kernel some prior
    domain knowledge? Eg
  • Dependencies between distant positions are not
    important or are known not to exist
  • Compare, at each sequence position, two sequences
    locally in a window of size 2l1 around that
    position, with decreasing weight away from the
    centre of the window
  • Where d1 is the order of importance of local
    (within the window) correlations, and
    is 1 for matching nucleotides at position pj,
    0 otherwise

  • Window scores are summed over the length of the
    sequence, and correlations between up to d2
    windows are taken into account
  • Also it is known that the codon below the TIS is
    a CDS CDS shifted by 3 nucleotides is still a
  • Trained with 8000 patterns and tested with 3000

Further results on UCI benchmarks
Ensembles of learners
  • not a learning technique on its own, but a
    method in which a family of weakly learning
    agents (simple learners) is used for learning
  • based on the fact that multiple classifiers that
    disagree with one another can be together more
    accurate than its component classifiers
  • if there are L classifiers, each with an error
    rate lt 1/2, and the errors are independent, then
    the prob. That the majority vote is wrong is the
    area under binomial distribution for more than
    L/2 hypotheses

Boosting as ensemble of learners
  • The very idea focus on difficult parts of the
    example space
  • Train a number of classifiers
  • Combine their decision in a weighed manner

(No Transcript)
  • Bagging Breiman is to learn multiple hypotheses
    from different subset of the training set, and
    then take majority vote. Each sample is drawn
    randomly with replacement (a boostratrap). Ea.
    Bootstrap contains, on avg., 63.2 of the
    training set
  • boosting is a refinement of bagging, where the
    sample is drawn according to a distribution, and
    that distribution emphasizes the misclassified
    examples. Then a vote is taken.

(No Transcript)
  • Lets make sure we understand the makeup of the
    final classifier

  • AdaBoost (Adaptive Boosting) uses the probability
    distribution. Either the learning algorithm uses
    it directly, or the distribution is used to
    produce the sample.
  • See
  • http//
  • for a Web-based demo.

(No Transcript)
Intro. to bioinformatics
  • Bioinformatics collection, archiving,
    organization and interpretation of biological
  • integrated in vitro, in vivo, in silico
  • Requires understanding of basic genetics
  • Based on
  • genomics,
  • proteomics,
  • transriptomics

What is Bioinformatics?
  • Bioinformatics is about integrating biological
    themes together with the help of computer tools
    and biological database.
  • It is a New field of Science where mathematics,
    computer science and biology combine together to
    study and interpret genomic and proteomic

Intro. to bioinformatics
  • Bioinformatics collection, archiving,
    organization and interpretation of biological
  • integrated in vitro, in vivo, in silico
  • Requires understanding of basic genetics
  • Based on
  • genomics,
  • proteomics,
  • transriptomics

Basic biology
  • Information in biology DNA
  • Genotype (hereditary make-up of an organism) and
    phenotype (physical/behavioral characteristics)
    (late 19th century)
  • Biochemical structure of DNA double helix
    1953 nucleotides A, C, G, T
  • Progress in biology and IT made it possible to
    map the entire genomes total genetic material of
    a species written with DNA code
  • For a human, 3109 long
  • Same in all the cells of a person

What is a gene?
  • Interesting to see if there are genes (functional
    elements of the genome) responsible for some
    aspects of the phenotype (e.g. an illness)
  • Testing
  • Cure
  • Genes result in proteins
  • Gene ?protein

RNA (transcription)
What is gene expression?
  • We say that genes code for proteins
  • In simple organisms (prokaryotes), high
    percentage of the genome are genes (85)
  • Is eukaryotes this drops yeast 70, fruit fly
    25, flowers 3
  • Databases with gene information GeneBank/DDBL,
  • Databases with Protein information
  • SwissProt, GenPept, TREMBL, PIR

  • Natural interest to find repetitive and/or common
    subsequences in genome BLAST
  • For this, it is interesting to study genetic
    expression (clustering)

Expression levels
Gene X
Gene Y
Y is activated by X
  • Activation and Inhibition

  • Micro-array give us information about the rate of
    production protein of gene during a experiment.
    Those technologies give us a lot of information,
  • Analyzing microarray data tells us how the gene
    protein production evolve.
  • Each data point represents log expression ratio
    of a particular gene under two different
    experimental conditions. The numerator of each
    ratio is the expression level of the gene in the
    varying condition, whereas the denominator is the
    expression level of the gene in some reference
    condition. The expression measurement is positive
    if the gene expression is induced with respect to
    the reference state and negative if it is
    repressed. We use those values as derivatives.

Microarray technology
Scanning (contd)
Scanning (contd)
Hybridization simulation
9. Data mining
  • definition
  • basic concepts
  • applications
  • challenges

Definition - Data Mining
  • extraction of unknown patterns from data
  • combines methods from
  • databases
  • machine learning
  • visualization
  • involves large datasets

Definition - KDD
  • Knowledge Discovery in Databases
  • consists of
  • selection
  • preprocessing
  • transformation
  • Data Mining
  • Interpretation/Evaluation
  • no explicit reqt of large datasets

Model building
  • Need to normalize data
  • data labelling - replacement of the starter
  • use of other data sources to label
  • linear regression models on STA
  • contextual setting of the time interval

  • Given
  • I i1,, im set of items
  • D set of transactions (a database), each
    transaction is a set of items T?2I
  • Association rule X?Y, X ?I, Y ?I, X?Y0
  • confidence c ratio of transactions that
    contain both X and Y to of all transaction that
    contain X
  • support s ratio of of transactions that
    contain both X and Y to of transactions in D
  • Itemset is frequent if its support gt ?

  • An association rule A ? B is a conditional
    implication among itemsets A and B, where A ? I,
    B ? I and A ? B ?.
  • The confidence of an association rule r A ? B is
    the conditional probability that a transaction
    contains B, given that it contains A.
  • The support of rule r is defined as sup(r)
    sup(A?B). The confidence of rule r can be
    expressed as conf(r) sup(A?B)/sup(A).
  • Formally, let A ?2I sup(A) t t ? D, A ?
    t/D, if R A?B then sup(R) SUP(A?B),
    conf(R) sup(A ? B)/sup(A)

Associations - mining
  • Given D, generate all assoc rules with c, s gt
    thresholds minc, mins
  • (items are ordered, e.g. by barcode)
  • Idea
  • find all itemsets that have transaction support
    gt mins large itemsets

Associations - mining
  • to do that start with indiv. items with large
  • in ea next step, k,
  • use itemsets from step k-1, generate new
    itemset Ck,
  • count support of Ck (by counting the
    candidates which are contained in any t),
  • prune the ones that are not large

Associations - mining
Only keep those that are contained in some
Candidate generation
Ck apriori-gen(Lk-1)
  • From large itemsets to association rules

Subset function
Subset(Ck, t) checks if an itemset Ck is in a
transaction t It is done via a tree structure
through a series of hashing
Hash C on every item in t itemsets
not containing anything from t are ignored
If you got here by hashing item i of t, hash on
all following items of t
set of itemsets
set of itemsets
Check if itemset contained in this leaf
  • L31 2 3, 1 2 4,1 3 4,1 3 5,2 3 4
  • C41 2 3 4 1 3 4 5
  • pruning deletes 1 3 4 5 because 1 4 5 is not
    in L3.
  • See http//
    lassociations for details

DM result evaluation
  • Accuracy
  • ROC
  • lift curves
  • cost

Feature Selection sec. 7.1 in Witten, Frank
  • Attribute-vector representation coordinates of
    the vector are referred to as attributes or
  • curse of dimensionality learning is search,
    and the search space increases drastically with
    the number of attributes
  • Theoretical justification We know from PAC
    theorems that this increase is exponential
    discuss e.g. slide 70
  • Practical justification with divide-and-conquer
    algorithms the partition sizes decrease and at
    some point irrelevant attributes may be selected
  • The task find a subset of the original attribute
    set such that the classifier will perform at
    least as well on this subset as on the original
    set of attributes

Some foundations
  • We are in the classification setting, Xi are the
    attrs and Y is the class. We can define relevance
    of features wrt Optimal Bayes Classifier OBC
  • Let S be a subset of features, and X a feature
    not in S
  • X is strongly relevant if removal of X alone
    deteriorates the performance of the OBC.
  • Xi is weakly relevant if it is not strongly
    relevant AND performance of BOC on S?X is
    better than on S

three main approaches
  • Manually often unfeasible
  • Filters use the data alone, independent of the
    classifier that will be used on this data (aka
    scheme-independent selection)
  • Wrappers the FS process is wrapped in the
    classifier that will be used on the data

Filters - discussion
  • Find the smallest attribute set in which all the
    instances are distinct. Problem cost if
    exhaustive search used
  • But learning and FS are related in a way, the
    classifier already includes the the good
    (separating) attributes. Hence the idea
  • Use one classifier for FS, then another on the
    results. E.g. use a DT, and pass the data on to
    NB. Or use 1R for DT.

Filters contd RELIEF Kira, Rendell
  • Initialize weight of all atrrs to 0
  • Sample instances and check the similar ones.
  • Determine pairs which are in the same class (near
    hits) and in different classes (near misses).
  • For each hit, identify attributes with different
    values. Decrease their weight
  • For each miss, attributes with different values
    have their weight increased.
  • Repeat the sample selection and weighing (2-5)
    many times
  • Keep only the attrs with positive weight
  • Discussion high variance unless the of samples
    very high
  • Deterministic RELIEF use all instances and all
    hits and misses

A different approach
  • View attribute selection as a search in the space
    of all attributes
  • Search needs to be driven by some heuristic
    (evaluation criterion)
  • This could be some measure of the discrimination
    ability of the result of search, or
  • Cross-validation, on the part of the training set
    put aside for that purpose. This means that the
    classifier is wrapped in the FS process, hence
    the name wrapper (scheme-specific selection)

  • Greedy search example
  • A single attribute is added (forward) or deleted
  • Could also be done as best-first search or beam
    search, or some randomized (e.g. genetic) search

  • Computationally expensive (k-fold xval at each
    search step)
  • backward selection often yields better accuracy
  • x-val is just an optimistic estimation that may
    stop the search prematurely
  • in backward mode attr sets will be larger than
  • Forward mode may result in better
  • Experimentally FS does particularly well with NB
    on data on which NB does not do well
  • NB is sensitive to redundant and dependent (!)
  • Forward selection with training set performance
    does well Langley and Sage 94

  • Getting away from numerical attrs
  • We know it from DTs, where numerical attributes
    were sorted and splitting points between each two
    values were considered
  • Global (independent of the classifier) and local
    (different results in ea tree node) schemes exist
  • What is the result of discretization a value of
    an nominal attribute
  • Ordering information could be used if the
    discretized attribute with k values is converted
    into k-1 binary attributes the i-1th attribute
    true represents the fact that the value is lt I
  • Supervised and unsupervised discretization

Unsuprevised discretization
  • Fixed length intervals (equal interval binning)
    eg (max-min)/k
  • How do we know k?
  • May distribute instances unevenly
  • Variable-length intervals, ea containing the same
    number of intervals (equal frequency binning, or
    histogram equalization)

  • Supervised discretization
  • Example of Temperature attribute in the
    play/dont play data
  • A recursive algorithm using information measure/
    We go for the cut point with lowest information
    (cleanest subset)

Supervised discretization contd
  • Whats the right stopping criterion?
  • How about MDL? Compare the info to transmit the
    label of ea instance before the split, and the
    split point in log2(N-1) bits, info for points
    below and info for points above.
  • Ea. Instance costs 1 bit before the split, and
    slightly gt 0 bits after the split
  • This is the Irani, Fayyad 93 method

Error-correcting Output Codes (ECOC)
  • Method of combining classifiers from a two-class
    problem to a k-class problem
  • Often when working with a k-class problem k
    one-against-all classifiers are learned, and then
    combined using ECOC
  • Consider a 4-class problem, and suppose that
    there are 7 classifiers, and classed are coded as
  • Suppose an instance ?a is classified as 1011111
    (mistake in the 2nd classifier).
  • But the this classification is the closest to
    class a in terms of edit (Hamming) distance. Also
    note that class encodings in col. 1 re not error

class class encoding
a 1000 1111111
b 0100 0000111
c 0010 0011001
d 0001 0101010
ECOC contd
  • What makes an encoding error-correcting?
  • Depends on the distance between encodings an
    encoding with distance d between encodings may
    correct up to (d-1)/2 errors (why?)
  • In col. 1, d2, so this encoding may correct 0
  • In col. 2, d4, so single-bit errors will be
  • This example describes row separation there must
    also be column separation (1 in col. 2)
    otherwise two classifiers would make identical
    errors this weakens error correction
  • Gives good results in practice, eg with SVM
    (2-class method)
  • See the

ECOC contd
  • What makes an encoding error-correcting?
  • Depends on the distance between encodings an
    encoding with distance d between encodings may
    correct up to (d-1)/2 errors (why?)
  • In col. 1, d2, so this encoding may correct 0
  • In col. 2, d4, so single-bit errors will be
  • This example describes row separation there must
    also be column separation (1 in col. 2)
    otherwise two classifiers would make identical
    errors this weakens error correction
  • For a small number of classes, exhaustive codes
    as in col. 2 are used
  • See the Dietterich paper on how to design good
    error-correcting codes
  • Gives good results in practice, eg with SVM,
    decision trees, backprop NNs

What is ILP?
  • Machine learning when instances, results and
    background knowledge are represented in First
    Order Logic, rather than attribute-value
  • Given E, E-, BK
  • Find h such that h ? E, h ? E-

  • E
  • boss(mary,john). boss(phil,mary).boss(phil,john).
  • E-
  • boss(john,mary). boss(mary,phil).
  • BK
  • employee(john, ibm). employee(mary,ibm).
  • reports_to(john,mary). reports_to(mary,phil).
  • h boss(X,Y,O)- employee(X,O),
    employee(Y,O),reports_to(Y, X).

Historical justification of the name
  • From facts and BK, induce a FOL hypothesis
    (hypothesis in Logic Programming)

Hypotheses (rules)
Background knowledge
Why ILP? - practically
  • Constraints of classical machine learning
    attribute-value (AV) representation
  • instances are represented as rows in a single
    table, or must be combined into such table
  • This is not the way data comes from databases

(No Transcript)
From tables to models to examples and background
  • Results of learning (in Prolog)
  • party(yes)-participant(_J, senior, _C)
  • Party(yes)-participant(president,_S,_C)

Why ILP - theoretically
  • AV all examples are the same length
  • no recursion
  • How could we learn the concept of reachability in
    a graph

Expressive power of relations, impossible in AV
  • cannot really be expressed in AV representation,
    but is very easy in relational representation
  • linked-to lt0,1gt, lt0,3gt, lt1,2gt,,lt7,8gt
  • can-reach(X,Y) - linked-to(X,Z), can-reach(Z,Y)

Another example of recursive learning
  • E
  • boss(mary,john). boss(phil,mary).boss(phil,john).
  • E-
  • boss(john,mary). boss(mary,phil).
  • BK
  • employee(john, ibm). employee(mary,ibm).
  • reports_to_imm(john,mary). reports_to_imm(mary,phi
  • h boss(X,Y)- employee(X,O), employee(Y,O),report
    s_to(Y, X).
  • reports_to(X,Y)-reports_to_imm(X,Z),
  • reports_to(X,X).

How is learning done covering algorithm
  • Initialize the training set T
  • while the global training set contains ex
    find a clause
    that describes part of relationship Q
    remove the ex covered by this clause
  • Finding a clause
  • initialize the clause to Q(V1,Vk) -
    while T contains ex

    find a literal L to add to the right-hand side
    of the clause
  • Finding a literal greedy search

  • Find a clause loop describes search
  • Need to structure the search space
  • generality semantic and syntactic
  • since logical generality is not decidable, a
    stronger property of ?-subsumption
  • then search from general to specific (refinement)

Heuristics link to he