Loading...

PPT – Advanced Topics in Computer Systems: Machine Learning and Data Mining Systems Winter 2007 PowerPoint presentation | free to download - id: 471dfd-NTI4Y

The Adobe Flash plugin is needed to view this content

Advanced Topics in Computer Systems Machine

Learning and Data Mining Systems Winter 2007

- Stan Matwin
- Professor
- School of Information Technology and Engineering/

École dingénierie et de technologie de

linformation - University of Ottawa
- Canada

Goals of this course

- Dual seminar/tutorial structure
- The tutorial part will teach basic concepts of

Machine Learning (ML) and Data Mining (DM) - The seminar part will
- introduce interesting areas of current and future

research - Introduce successful applications
- Preparation to enable advanced self-study on ML/DM

Course outline

- Machine Learning/Data Mining basic terminology.
- Symbolic learning Decision Trees
- Basic Performance Evaluation
- Introduction to the WEKA system
- Probabilistic learning Bayesian learning.
- Text classification

- Kernel-based methods Support Vector Machines
- Ensemble-based methods boosting
- Advanced Performance Evaluation ROC curves
- Applications in bioinformatics
- Data mining concepts and techniques Association

Rules - Feature selection and discretization

Machine Learning / Data Mining basic terminology

- Machine Learning
- given a certain task, and a data set that

constitutes the task, - ML provides algorithms that resolve the task

based on the data, and the solution improves with

time - Examples
- predicting lottery numbers next Saturday
- detecting oil spills on sea surface
- assigning documents to a folder
- identifying people likely to want a new credit

card (cross selling)

- Data Mining extracting regularities from a VERY

LARGE dataset/database as part of a

business/application cycle - examples
- cell phone fraud detection
- customer churn
- direct mail targeting/ cross sell
- prediction of aircraft component failures

Basic ML tasks

- Supervised learning
- classification/concept learning
- estimation essentially, extrapolation
- Unsupervised learning
- clustering finding groups of similar objects
- associations in a database, finding that some

values of attributes go with some other

Concept learning (also known as classification)

a definition

- the concept learning problem
- given
- a set E e1, e2, , en of training instances

of concepts, each labeled with the name of a

concept C1, ,Ck to which it belongs - determine
- definitions of each of C1, ,Ck which correctly

cover E. Each definition is a concept description

Dimensions of concept learning

- representation
- data
- symbolic
- numeric
- concept description
- attribute-value (propositional logic)
- relational (first order logic)
- Language of examples and hypotheses
- Attribute-value (AV) propositional

representation - Relational (ILP) first-order logic

representation - method of learning
- top-down
- bottom-up (covering)
- different search algorithms

2. Decision Trees

- A decision tree as a concept representation

wage incr. 1st yr

????

????

working hrs

statutory holidays

???

???

???

???

good

good

contribution to hlth plan

wage incr. 1st yr

??

??

bad

good

bad

bad

good

- building a univariate (single attribute is

tested) decision tree from a set T of training

cases for a concept C with classes C1,Ck - Consider three possibilities
- T contains 1 or more cases all belonging to the

same class Cj. The decision tree for T is a leaf

identifying class Cj - T contains no cases. The tree is a leaf, but the

label is assigned heuristically, e.g. the

majority class in the parent of this node

- T contains cases from different classes. T is

divided into subsets that seem to lead towards

collections of cases. A test t based on a single

attribute is chosen, and it partitions T into

subsets T1,,Tn. The decision tree consists of

a decision node identifying the tested attribute,

and one branch for ea. outcome of the test. Then,

the same process is applied recursively to ea.Ti

Choosing the test

- why not explore all possible trees and choose the

simplest (Occams razor)? But this is an NP

complete problem. E.g. in the union example

there are millions of trees consistent with the

data - notation S set of the training examples

freq(Ci, S) number of examples in S that belong

to Ci - information measure (in bits) of a message is -

log2 of the probability of that message - idea to maximize the difference between the info

needed to identify a class of an example in T,

and the the same info after T has been

partitioned in accord. with a test X

- selecting 1 case and announcing its class has

info meas. - log2(freq(Ci, S)/S) bits - to find information pertaining to class

membership in all classes info(S) -??(freq(Ci,

S)/S)log2(freq(Ci, S)/S) - after partitioning according to outcome of test

X - infoX(T) ?Ti/Tinfo(Ti)
- gain(X) info(T) - infoX(T) measures the gain

from partitioning T according to X - We select X to maximize this gain

Data for learning the weather (play/dont play)

concept (Witten p. 10)

Info(S) 0.940

Selecting the attribute

- Gain(S, Outlook) 0.246
- Gain(S, Humidity) 0.151
- Gain(S, Wind) 0.048
- Gain(S, Temp) 0.029
- Choose Outlook as the top test

How does info gain work?

Gain ratio

- info gain favours tests with many outcomes

(patient id example) - consider split info(X) ?Ti/Tlog(Ti/T)
- measures potential info. generated by dividing T

into n classes (without considering the class

info) - gain ratio(X) gain(X)/split info(X)
- shows the proportion of info generated by the

split that is useful for classification in the

example (Witten p. 96), log(k)/log(n) - maximize gain ratio

Partition of cases and corresp. tree

In fact, learning DTs with the gain ratio

heuristic is a search

continuous attrs

- a simple trick sort examples on the values of

the attribute considered choose the midpoint

between ea two consecutive values. For m values,

there are m-1 possible splits, but they can be

examined linearly - cost?

- From trees to rules
- traversing a decision tree from root to leaf

gives a rule, with the path conditions as the

antecedent and the leaf as the class - rules can then be simplified by removing

conditions that do not contribute to discriminate

the nominated class from other classes - rulesets for a whole class are simplified by

removing rules that do not contribute to the

accuracy of the whole set

Geometric interpretation of decision trees

axis-parallel area

?

b gt b1

?

a1

y

n

a gt a1

?

?

?

?

?

a lt a2

a2

?

?

?

b1

Decision rules can be obtained from decision trees

(1)if bgtb1 then class is - (2)if b lt b1 and a gt

a1 then class is (3)if b lt b1 a lt a2 then

class is (4)if b lt b1 and a2 lt a lt a1 then

class is -

b gt b1

y

n

a gt a1

?(1)

?(2)

a lt a2

notice the inference involved in rule (3)

?(3)

?(4)

1R

(No Transcript)

- lots of datasets can be obtained from
- ftp ics.uci.edu
- cd pub/machine-learning-databases
- contents are described in the file README in the
- dir machine-learning-databases at Irvine

Empirical evaluation of accuracy in

classification tasks

- the usual approach
- partition the set E of all labeled examples

(examples with their classification labels) into

a training set and a testing set - use the training set for learning, obtain a

hypothesis H, set acc 0 - for ea. element t of the testing set,
- apply H on t if H(t) label(t) then acc

acc1 - acc acc/testing set

Testing - contd

- Given a dataset, how do we split it between the

training set and the test set? - cross-validation (n-fold)
- partition E into n groups
- choose n-1 groups from n, perform learning on

their union - repeat the choice n times
- average the n results
- usually, n 3, 5, 10
- another approach - learn on all but one example,

test that example. - Leave One Out

Confusion matrix

- classifier-determined classifier-determined
- positive label negative label
- true positive a b
- label
- true negative c d
- label
- Accuracy (ad)/(abcd)
- a true positives
- b false negatives
- c false positives
- d true negatives

- Precision a/(ac)
- Recall a/(ab)
- F-measure combines Recall and Precision
- Fb (b21)PR / (b2 P R)
- Refelects importance of Recall versus Precision

eg F0 P

Cost matrix

- Is like confusion matrix, except costs of errors

are assigned to the elements outside the diagonal

(mis-classifications) - this may be important in applications, e.g. when

the classifier is a diagnosis rule - see
- http//ai.iit.nrc.ca/bibliographies/cost-sensitiv

e.html - for a survey of learning with misclassification

costs

Bayesian learning

- incremental, noise-resistant method
- can combine prior Knowledge (the K is

probabilistic) - predictions are probabilistic

- Bayes law of conditional probability

results in a simple learning rule choose the

most likely (Maximum APosteriori)hypothesis

Example Two hypo (1) the patient has cancer (2)

the patient is healthy

Priors 0.8 of the population has cancer

- P(not cancer) .992
- P( - cancer) .02
- P(-not cancer) .97

- P(cancer) .008
- P( cancer) .98
- P(not cancer) .03

We observe a new patient with a positive test.

How should they be diagnosed? P(cancer)

P(cancer)P(cancer) .98.008 .0078 P(not

cancer) P(not cancer)P(not cancer)

.03.992.0298

Minimum Description Length

- revisiting the def. of hMAP
- we can rewrite it as
- or
- But the first log is the cost of coding the data

given the theory, and the second - the cost of

coding the theory

- Observe that
- for data, we only need to code the exceptions

the others are correctly predicted by the theory - MAP principles tells us to choose the theory

which encodes the data in the shortest manner - the MDL states the trade-off between the

complexity of the hypo. and the number of errors

Bayes optimal classifier

- so far, we were looking at the most probable

hypothesis, given a priori probabilities. But we

really want the most probable classification - this we can get by combining the predictions of

all hypotheses, weighted by their posterior

probabilities - this is the bayes optimal classifier BOC

Example of hypotheses h1, h2, h3 with posterior

probabilities .4, .3. .3 A new instance is

classif. pos. by h1 and neg. by h2, h3

Bayes optimal classifier

- V , -
- P(h1D) .4, P(-h1) 0, P(h1) 1
- Classification is (show details!)

- Captures probability dependencies
- ea node has probability distribution the task

is to determine the join probability on the data - In an appl. a model is designed manually and

forms of probability distr. Are given - Training set is used to fut the model to the data
- Then probabil. Inference can be carried out, eg

for prediction

First five variables are observed, and the model

is Used to predict diabetes

P(A, N, M, I, G, D)P(A)P(n)P(MA, n)P(DM, A,

N)P(ID)P(GI,D)

- how do we specify prob. distributions?
- discretize variables and represent probability

distributions as a table - Can be approximated from frequencies, eg table

P(MA, N) requires 24parameters - For prediction, we want (DA, n, M, I, G) we

need a large table to do that

(No Transcript)

- no other classifier using the same hypo. spac e

and prior K can outperform BOC - the BOC has mostly a theoretical interest

practically, we will not have the required

probabilities - another approach, Naive Bayes Classifier (NBC)
- under a simplifying assumption of independence of

the attribute values given the class value

To estimate this, we need (of possible

values)(of possible instances) examples

(No Transcript)

- in NBC, the conditional probabilities are

estimated from training data simply as normalized

frequencies how many times a given attribute

value is associated with a given class - no search!
- example
- m-estimate

- Example (see the Dec. Tree sec. in these notes)
- we are trying to predict yes or no for

Outlooksunny, Temperaturecool, Humidityhigh,

Windstrong

P(yes)9/14 P(no)5/14 P(Windstrongyes)3/9

P(Windstrongno)3/5 etc. P(yes)P(sunnyyes)P(co

olyes)P(highyes)Pstrongyes).0053 P(yes)P(sunny

no)P(coolno)P(highno)Pstrongno).0206 so we

will predict no compare to 1R!

- Further, we can not only have a decision, but

also the prob. of that decision - we rely on for the conditional probability
- if the conditional probability is very small, and

n is small too, then we should assume that nc is

0. But this biases too strongly the NBC. - So smoothen see textbook p. 85
- Instead, we will use the estimate
- where p is the prior estimate of probability,
- m is equivalent sample size. If we do not know

otherwise, p1/k for k values of the attribute m

has the effect of augmenting the number of

samples of class - large value of m means that priors p are

important wrt training data when probability

estimates are computed, small less important

Text Categorization

- Representations of text are very high dimensional

(one feature for each word). - High-bias algorithms that prevent overfitting in

high-dimensional space are best. - For most text categorization tasks, there are

many irrelevant and many relevant features. - Methods that sum evidence from many or all

features (e.g. naïve Bayes, KNN, neural-net) tend

to work better than ones that try to isolate just

a few relevant features (decision-tree or rule

induction).

Naïve Bayes for Text

- Modeled as generating a bag of words for a

document in a given category by repeatedly

sampling with replacement from a vocabulary V

w1, w2,wm based on the probabilities P(wj

ci). - Smooth probability estimates with Laplace

m-estimates assuming a uniform distribution over

all words (p 1/V) and m V - Equivalent to a virtual sample of seeing each

word in each category exactly once.

Text Naïve Bayes Algorithm(Train)

Let V be the vocabulary of all words in the

documents in D For each category ci ? C

Let Di be the subset of documents in D in

category ci P(ci) Di / D Let

Ti be the concatenation of all the documents in

Di Let ni be the total number of word

occurrences in Ti For each word wj ? V

Let nij be the number of occurrences

of wj in Ti Let P(wi ci)

(nij 1) / (ni V)

Text Naïve Bayes Algorithm(Test)

Given a test document X Let n be the number of

word occurrences in X Return the category

where ai is the word occurring the ith position

in X

Naïve Bayes Time Complexity

- Training Time O(DLd CV))

where Ld is the average length of a document in

D. - Assumes V and all Di , ni, and nij pre-computed

in O(DLd) time during one pass through all of

the data. - Generally just O(DLd) since usually CV lt

DLd - Test Time O(C Lt)

where Lt is the average length of a test

document. - Very efficient overall, linearly proportional to

the time needed to just read in all the data. - Similar to Rocchio time complexity.

Underflow Prevention

- Multiplying lots of probabilities, which are

between 0 and 1 by definition, can result in

floating-point underflow. - Since log(xy) log(x) log(y), it is better to

perform all computations by summing logs of

probabilities rather than multiplying

probabilities. - Class with highest final un-normalized log

probability score is still the most probable.

Naïve Bayes Posterior Probabilities

- Classification results of naïve Bayes (the class

with maximum posterior probability) are usually

fairly accurate. - However, due to the inadequacy of the conditional

independence assumption, the actual

posterior-probability numerical estimates are

not. - Output probabilities are generally very close to

0 or 1.

Textual Similarity Metrics

- Measuring similarity of two texts is a

well-studied problem. - Standard metrics are based on a bag of words

model of a document that ignores word order and

syntactic structure. - May involve removing common stop words and

stemming to reduce words to their root form. - Vector-space model from Information Retrieval

(IR) is the standard approach. - Other metrics (e.g. edit-distance) are also used.

The Vector-Space Model

- Assume t distinct terms remain after

preprocessing call them index terms or the

vocabulary. - These orthogonal terms form a vector space.
- Dimension t vocabulary
- Each term, i, in a document or query, j, is

given a real-valued weight, wij. - Both documents and queries are expressed as

t-dimensional vectors - dj (w1j, w2j, , wtj)

Graphic Representation

- Example
- D1 2T1 3T2 5T3
- D2 3T1 7T2 T3
- Q 0T1 0T2 2T3

Document Collection

- A collection of n documents can be represented in

the vector space model by a term-document matrix. - An entry in the matrix corresponds to the

weight of a term in the document zero means

the term has no significance in the document or

it simply doesnt exist in the document.

Term Weights Term Frequency

- More frequent terms in a document are more

important, i.e. more indicative of the topic. - fij frequency of term i in document j
- May want to normalize term frequency (tf) by

dividing by the frequency of the most common term

in the document - tfij fij / maxifij

Term Weights Inverse Document Frequency

- Terms that appear in many different documents are

less indicative of overall topic. - df i document frequency of term i
- number of documents containing term

i - idfi inverse document frequency of term i,

- log2 (N/ df i)
- (N total number of documents)
- An indication of a terms discrimination power.
- Log used to dampen the effect relative to tf.

TF-IDF Weighting

- A typical combined term importance indicator is

tf-idf weighting - wij tfij idfi tfij log2 (N/ dfi)
- A term occurring frequently in the document but

rarely in the rest of the collection is given

high weight. - Many other ways of determining term weights have

been proposed. - Experimentally, tf-idf has been found to work

well.

Computing TF-IDF -- An Example

- Given a document containing terms with given

frequencies - A(3), B(2), C(1)
- Assume collection contains 10,000 documents and
- document frequencies of these terms are
- A(50), B(1300), C(250)
- Then
- A tf 3/3 idf log(10000/50) 5.3

tf-idf 5.3 - B tf 2/3 idf log(10000/1300) 2.0

tf-idf 1.3 - C tf 1/3 idf log(10000/250) 3.7

tf-idf 1.2

Similarity Measure

- A similarity measure is a function that computes

the degree of similarity between two vectors. - Using a similarity measure between the query and

each document - It is possible to rank the retrieved documents in

the order of presumed relevance. - It is possible to enforce a certain threshold so

that the size of the retrieved set can be

controlled.

Similarity Measure - Inner Product

- Similarity between vectors for the document di

and query q can be computed as the vector inner

product - sim(dj,q) djq wij wiq
- where wij is the weight of term i in document

j and wiq is the weight of term i in the query - For binary vectors, the inner product is the

number of matched query terms in the document

(size of intersection). - For weighted term vectors, it is the sum of the

products of the weights of the matched terms.

Properties of Inner Product

- The inner product is unbounded.
- Favors long documents with a large number of

unique terms. - Measures how many terms matched but not how many

terms are not matched.

Inner Product -- Examples

architecture

management

information

computer

text

retrieval

database

- Binary
- D 1, 1, 1, 0, 1, 1, 0
- Q 1, 0 , 1, 0, 0, 1, 1
- sim(D, Q) 3

Cosine Similarity Measure

- Cosine similarity measures the cosine of the

angle between two vectors. - Inner product normalized by the vector lengths.

CosSim(dj, q)

K Nearest Neighbor for Text

Training For each each training example ltx,

c(x)gt ? D Compute the corresponding TF-IDF

vector, dx, for document x Test instance

y Compute TF-IDF vector d for document y For

each ltx, c(x)gt ? D Let sx cosSim(d,

dx) Sort examples, x, in D by decreasing value of

sx Let N be the first k examples in D. (get

most similar neighbors) Return the majority class

of examples in N

Illustration of 3 Nearest Neighbor for Text

3 Nearest Neighbor Comparison

- Nearest Neighbor tends to handle polymorphic

categories better.

Nearest Neighbor Time Complexity

- Training Time O(D Ld) to compose TF-IDF

vectors. - Testing Time O(Lt DVt) to compare to all

training vectors. - Assumes lengths of dx vectors are computed and

stored during training, allowing cosSim(d, dx) to

be computed in time proportional to the number

of non-zero entries in d (i.e. Vt) - Testing time can be high for large training sets.

Nearest Neighbor with Inverted Index

- Determining k nearest neighbors is the same as

determining the k best retrievals using the test

document as a query to a database of training

documents. - Use standard VSR inverted index methods to find

the k nearest neighbors. - Testing Time O(BVt)

where B is the average number of

training documents in which a test-document word

appears. - Therefore, overall classification is O(Lt

BVt) - Typically B ltlt D

Support Vector Machines (SVM)

- a new classifier
- Attractive because
- Has sound mathematical foundations
- Performs very well in diverse and difficuly

applications - See textbook (ch. 6.3) and papers by Scholkopf

placed on the class website

Review of basic analytical geometry

- Dot product of vectors by coordinates and with

the angle - If vectors a, b are perpendicular, then (a ? b)

0 (e.g. (0, c) ? (d, 0) 0 - A hyperplane in an n-dimensional space has the

property x (w ? x) b 0 w is the weight

vector, b is the threshold x (x1, , xn) w

(w1, , wn) - A hyperplane divides the n-dimensional space into

two subspaces one is y y((w ? x) b) gt 0,

the other is complementary (y y((w ? x) b) lt0)

- Lets revisit the general classification problem.
- We want to estimate an unknown function f, all we

know about it is the training set (x1,y1),

(xn,yn) - The objective is to minimize the expected error

(risk) - where l is a loss function, eg
- and ?(z) 0 for zlt0 and ?(z)1 otherwise
- Since we do not know P, we cannot measure risk
- We want to approximate the true error (risk) by

the empirical error (risk)

- We know from the PAC theory that conditions can

be given on the learning task so that the

empirical risk converges towards the true risk - We also know that the difficulty of the learning

task depends on the complexity of f (VC

dimension) - It is known that the following relationship

between the empirical risk and the complexity of

the language (h denotes VC dimension of the class

of f) - is true with probability at least ? for ngt h

SRM

- Structural Risk Minimization (SRM) chooses the

class of F to find a balance between the

simplicity of f (very simple may result in a

large empirical risk) and and the empirical risk

(small may require a class function with a large

h)

Points lying on the margin are called support

vectors w can be constructed efficiently

quadratic optimization problem.

Basic idea of SVM

- Linearly separable problems are easy (quadratic),

but of course most problems are not l. s. - Take any problem and transform it into a

high-dimensional space, so that it becomes

linearly separable, but - Calculations to obtain the separability plane can

be done in the original input space (kernel trick)

Basic idea of SVM

- Original data is mapped into another dot product

space called feature space F via a non-linear map

? - Then linear separable classifier is performed in

F - Note that the only operations in F are dot

products - Consider e.g.

Lets see that ? geometrically, and that it does

what we want it to do transform a hard

classification problem into an easy one, albeit

in a higher dimension

- But in general quadratic optimization in the

feature space could be very expensive - Consider classifying 16 x 16 pixel pictures, and

5th order monomials - Feature space dimension in this example is O(

) 1010

Here we show that the that transformation from

ellipsoidal decision space to a linear one,

requiring dot product in the the feature space,

can be performed by a kernel function in the

input spacein general, k(x,y) (x ? y)d

computes in the input spacekernels replace

computation in FS by computation in the input

spacein fact, the transformation ? needs not to

be applied when a kernel is used!

Some common kernels used

Using different kernels we in fact use different

classifiers in the input space gaussian,

polynomial, 3-layer neural nets,

Simplest kernel

- Is the linear kernel (w ? x) b
- But this only works if the training set is

linearly separable. This may not be the case - For the linear kernel, or even
- In the feature space

The solution for the non-separable case is to

optimize not just the margin, but the margin plus

the influence of training errors ?i

Classification with SVMs

- Convert each example x to F(x)
- Perform optimal hyperplane algorithm in F but

since we use the kernel all we need to do is to

compute - where xi, yi are training instances, ai are

computed as the solution to the quadratic

programming problem

Examples of classifiers in the input space

Overall picture

Geometric interpretation of SVM classifier

- Normalize the weight vector to 1 ( ) and

set the threshold b 0 - The set of all w that separate training set is
- But this is the Version Space
- VS has a geometric centre (Bayes Optimal

Classifier) near the gravity point - If VS has a shape in which SVM solution is far

from the VS centre, SVM works poorly

(No Transcript)

(No Transcript)

Applications

- Text classification
- Image analysis face recognition
- Bioinformatics gene expression
- Can the kernel reflect domain knowledge?

SVM contd

- A method of choice when examples are represented

by vectors or matrices - Input space cannot be readily used as

attribute-vector (e.g. too many attrs) - Kernel methods map data from input space to

feature space (FS) perform learning in FS

provided that examples are only used within dot

point (the kernel trick - SVM but also Perceptron, PCA, NN can be done on

that basis - Collectively kernel-based methods
- The kernel defines the classifier
- The classifier is independent of the

dimensionality of the FS can even be infinite

(gaussian kernel) - LIMITATION of SVMs they only work for two-class

problems - Remedy use of ECOC

Applications face detection IEEE INTELLIGENT

SYSTEMS

- The task to find a rectangle containing a face

in an image applicable in face recognition,

surveillance, HCI etc. Also in medical image

processing and structural defects - Difficult task variations that are hard toi

represent explicitly (hair, moustache, glasses) - Cast as a classification problem image regions

that are faces and non-faces - Scanning the image in multiple scales, dividing

it into (overlapping) frames and classifying the

frames with an SVM

Face detection contd

- SVM performing face detection support vectors

are faces and non-faces - Examples are 19x19 pixels, class 1 or -1
- SVM 2nd degree polynomial with slack variables
- Representation tricks masking out near-boundary

area - 361-gt283, removes noise - illumination correction reduction of light and

shadow - Discretization of pixel brightness by histogram

equalization

Face detection system architecture

- Bootstrapping using the system on images with no

faces and storing false positives to use as

negative examples in later training

Performance on 2 test sets Set A 313 high

quality Images with 313 faces, set B 23 images

with 155 faces This results in gt4M frames for A

and gt5M frames for B. SVM achieved recall of

97 on A and 74 on B, with 4 and 20 false

positives, resp.

SVM in text classification

- Example of classifiers (the Reuters corpus 13K

stories, 118 categories, time split) - Essential in document organization (emails!),

indexing etc. - First comes from a PET second from and SVM
- Text representation BOW mapping docs to large

vectors indicating which word occurs in a doc as

many dimensions as words in the corpus (many more

than in a given doc) - often extended to frequencies (normalized) of

stemmed words

Text classification

- Still a large number of features, so a stop list

is applied, and some form of feature selection

(e.g. based on info gain, or tf/idf) is done,

down to 300 features - Then a simple, linear SVM is used (experiments

with poly. and RDF kernels indicated they are not

much better than linear). One against all scheme

is used - What is a poly (e.g. level 2) kernel representing

in text classification? - Performance measured with micro-averaged break

even point (explain) - SVM obtained the best results, with DT second (on

10 cat.) and Bayes third. Other authors report

better IB performance (findSim) than here

A ROC for the above experiments (class

grain)ROC obtained by varying the

thresholdthreshold is learnedfrom values of

and discriminates between classes

How about another representation?

- N-grams sequences of N consecutive characters,

eg 3-grams is support vector sup, upp, ppo,

por, , tor - Language-independent, but a large number of

features (gtgtwords) - The more substrings in common between 2 docs, the

more similar the 2 docs are - What if we make these substring non-contiguous?

With weight measuring non-contiguity? car

custard - We will make ea substring a feature, with value

depending on how frequently and how compactly a

substring occurs in the text

- The latter is represented by a decay factor ?
- Example cat, car, bat, bar
- Unnormalized K(car,cat) ?4, K(car,car)K(cat,cat)

2 ?4 ?6,normalized K(car,cat) ?4/( 2?4 ?6)

1/(2 ?2) - Impractical (too many) for larger substrings and

docs, but the kernel using such features can be

calculated efficiently (substring kernel SSK)

maps strings (a whole doc) to a feature vector

indexed by all k-tuples

- Value of the feature sum over the occurrences

of the k-tuple of a decay factor of the length of

the occurrence - Def of SSK ? is an alphabet string finite

sequence of elems of ? . s length of s

sij substring of s. u is a subsequence of s

if there exist indices i(i1,,iu ) with

1i1ltlt iu s such that uj sij for j1,,u

(usi for short). - Length l(i) of of the subsequence in s is iu -

i1 1 (span in s) - Feature space mapping ? for s is defined by
- for each u ??n (set of all finite strings of

length n) features measure the number of

occurrences of subsequences in s weighed by their

length (??1) - The kernel can be evaluated in O(nst) time

(see Lodhi paper)

Experimental results with SSK

- The method is NOT fast, so a subset of Reuters

(n470/380) was used, and only 4 classes corn,

crude, earn, acquisition - Compared to the BOW representation (see earlier

in these notes) with stop words removed, features

weighed by tf/idflog(1tf)log(n/df) - F1 was used for the evaluation, C set

experimentally - Best k is between 4 and 7
- Performance comparable to a classifier based on

k-grams (contiguous), and also BOW - ? controls the penalty for gaps in substrings

best precision for high ? 0.7. This seems to

result in high similarity score for docs that

share the same but semantically different words -

WHY? - Results on full Reuters not as good as with BOW,

k-grams the conjecture is that the kernel

performs something similar to stemming, which is

less important onlarge datasets where there is

enough data to learn the samness of different

inflections

Bioinformatics application

- Coding sequences in DNA encode proteins.
- DNA alphabet A, C, G, T. Codon triplet of

adjacent nucleotides, codes for one aminoacid. - Task identify where in the genome the coding

starts (Translation Initiation Sites). Potential

start codon is ATG. - Classification task does a sequence window

around the ATG indicate a TIS? - Each nucleotide is encoded by 5 bits, exactly one

is set to 1, indicating whether the nucleotide is

A, C, G, T, or unknown. So the dimension n of the

input space 1000 for window size 100 to left

and right of the ATG sequence. - Positive and negaite windows are provided as the

training set - This representation is typical for the kind of

problem where SVMs do well

- What is a good feature space for this problem?

how about including in the kernel some prior

domain knowledge? Eg - Dependencies between distant positions are not

important or are known not to exist - Compare, at each sequence position, two sequences

locally in a window of size 2l1 around that

position, with decreasing weight away from the

centre of the window - Where d1 is the order of importance of local

(within the window) correlations, and

is 1 for matching nucleotides at position pj,

0 otherwise

- Window scores are summed over the length of the

sequence, and correlations between up to d2

windows are taken into account - Also it is known that the codon below the TIS is

a CDS CDS shifted by 3 nucleotides is still a

TDS - Trained with 8000 patterns and tested with 3000

Results

Further results on UCI benchmarks

Ensembles of learners

- not a learning technique on its own, but a

method in which a family of weakly learning

agents (simple learners) is used for learning - based on the fact that multiple classifiers that

disagree with one another can be together more

accurate than its component classifiers - if there are L classifiers, each with an error

rate lt 1/2, and the errors are independent, then

the prob. That the majority vote is wrong is the

area under binomial distribution for more than

L/2 hypotheses

Boosting as ensemble of learners

- The very idea focus on difficult parts of the

example space - Train a number of classifiers
- Combine their decision in a weighed manner

(No Transcript)

- Bagging Breiman is to learn multiple hypotheses

from different subset of the training set, and

then take majority vote. Each sample is drawn

randomly with replacement (a boostratrap). Ea.

Bootstrap contains, on avg., 63.2 of the

training set - boosting is a refinement of bagging, where the

sample is drawn according to a distribution, and

that distribution emphasizes the misclassified

examples. Then a vote is taken.

(No Transcript)

- Lets make sure we understand the makeup of the

final classifier

- AdaBoost (Adaptive Boosting) uses the probability

distribution. Either the learning algorithm uses

it directly, or the distribution is used to

produce the sample. - See
- http//www.research.att.com/yoav/adaboost/index.

html - for a Web-based demo.

(No Transcript)

Intro. to bioinformatics

- Bioinformatics collection, archiving,

organization and interpretation of biological

data - integrated in vitro, in vivo, in silico
- Requires understanding of basic genetics
- Based on
- genomics,
- proteomics,
- transriptomics

What is Bioinformatics?

- Bioinformatics is about integrating biological

themes together with the help of computer tools

and biological database. - It is a New field of Science where mathematics,

computer science and biology combine together to

study and interpret genomic and proteomic

information

Intro. to bioinformatics

- Bioinformatics collection, archiving,

organization and interpretation of biological

data - integrated in vitro, in vivo, in silico
- Requires understanding of basic genetics
- Based on
- genomics,
- proteomics,
- transriptomics

Basic biology

- Information in biology DNA
- Genotype (hereditary make-up of an organism) and

phenotype (physical/behavioral characteristics)

(late 19th century) - Biochemical structure of DNA double helix

1953 nucleotides A, C, G, T - Progress in biology and IT made it possible to

map the entire genomes total genetic material of

a species written with DNA code - For a human, 3109 long
- Same in all the cells of a person

What is a gene?

- Interesting to see if there are genes (functional

elements of the genome) responsible for some

aspects of the phenotype (e.g. an illness) - Testing
- Cure
- Genes result in proteins
- Gene ?protein

RNA (transcription)

What is gene expression?

- We say that genes code for proteins
- In simple organisms (prokaryotes), high

percentage of the genome are genes (85) - Is eukaryotes this drops yeast 70, fruit fly

25, flowers 3 - Databases with gene information GeneBank/DDBL,

EMBL - Databases with Protein information
- SwissProt, GenPept, TREMBL, PIR

- Natural interest to find repetitive and/or common

subsequences in genome BLAST - For this, it is interesting to study genetic

expression (clustering)

Expression levels

Gene X

Gene Y

Time

deltaX

Y is activated by X

- Activation and Inhibition

Microarrays

- Micro-array give us information about the rate of

production protein of gene during a experiment.

Those technologies give us a lot of information, - Analyzing microarray data tells us how the gene

protein production evolve. - Each data point represents log expression ratio

of a particular gene under two different

experimental conditions. The numerator of each

ratio is the expression level of the gene in the

varying condition, whereas the denominator is the

expression level of the gene in some reference

condition. The expression measurement is positive

if the gene expression is induced with respect to

the reference state and negative if it is

repressed. We use those values as derivatives.

Microarrays

Microarray technology

Scanning

Scanning (contd)

Scanning (contd)

Hybridization simulation

9. Data mining

- definition
- basic concepts
- applications
- challenges

Definition - Data Mining

- extraction of unknown patterns from data
- combines methods from
- databases
- machine learning
- visualization
- involves large datasets

Definition - KDD

- Knowledge Discovery in Databases
- consists of
- selection
- preprocessing
- transformation
- Data Mining
- Interpretation/Evaluation
- no explicit reqt of large datasets

Model building

- Need to normalize data
- data labelling - replacement of the starter
- use of other data sources to label
- linear regression models on STA
- contextual setting of the time interval

Associations

- Given
- I i1,, im set of items
- D set of transactions (a database), each

transaction is a set of items T?2I - Association rule X?Y, X ?I, Y ?I, X?Y0
- confidence c ratio of transactions that

contain both X and Y to of all transaction that

contain X - support s ratio of of transactions that

contain both X and Y to of transactions in D - Itemset is frequent if its support gt ?

- An association rule A ? B is a conditional

implication among itemsets A and B, where A ? I,

B ? I and A ? B ?. - The confidence of an association rule r A ? B is

the conditional probability that a transaction

contains B, given that it contains A. - The support of rule r is defined as sup(r)

sup(A?B). The confidence of rule r can be

expressed as conf(r) sup(A?B)/sup(A). - Formally, let A ?2I sup(A) t t ? D, A ?

t/D, if R A?B then sup(R) SUP(A?B),

conf(R) sup(A ? B)/sup(A)

Associations - mining

- Given D, generate all assoc rules with c, s gt

thresholds minc, mins - (items are ordered, e.g. by barcode)
- Idea
- find all itemsets that have transaction support

gt mins large itemsets

Associations - mining

- to do that start with indiv. items with large

support - in ea next step, k,
- use itemsets from step k-1, generate new

itemset Ck, - count support of Ck (by counting the

candidates which are contained in any t), - prune the ones that are not large

Associations - mining

Only keep those that are contained in some

transaction

Candidate generation

Ck apriori-gen(Lk-1)

- From large itemsets to association rules

Subset function

Subset(Ck, t) checks if an itemset Ck is in a

transaction t It is done via a tree structure

through a series of hashing

Hash C on every item in t itemsets

not containing anything from t are ignored

If you got here by hashing item i of t, hash on

all following items of t

set of itemsets

set of itemsets

Check if itemset contained in this leaf

Example

- L31 2 3, 1 2 4,1 3 4,1 3 5,2 3 4
- C41 2 3 4 1 3 4 5
- pruning deletes 1 3 4 5 because 1 4 5 is not

in L3. - See http//www.almaden.ibm.com/u/ragrawal/pubs.htm

lassociations for details

DM result evaluation

- Accuracy
- ROC
- lift curves
- cost
- but also INTERPRETABILITY

Feature Selection sec. 7.1 in Witten, Frank

- Attribute-vector representation coordinates of

the vector are referred to as attributes or

features - curse of dimensionality learning is search,

and the search space increases drastically with

the number of attributes - Theoretical justification We know from PAC

theorems that this increase is exponential

discuss e.g. slide 70 - Practical justification with divide-and-conquer

algorithms the partition sizes decrease and at

some point irrelevant attributes may be selected - The task find a subset of the original attribute

set such that the classifier will perform at

least as well on this subset as on the original

set of attributes

Some foundations

- We are in the classification setting, Xi are the

attrs and Y is the class. We can define relevance

of features wrt Optimal Bayes Classifier OBC - Let S be a subset of features, and X a feature

not in S - X is strongly relevant if removal of X alone

deteriorates the performance of the OBC. - Xi is weakly relevant if it is not strongly

relevant AND performance of BOC on S?X is

better than on S

three main approaches

- Manually often unfeasible
- Filters use the data alone, independent of the

classifier that will be used on this data (aka

scheme-independent selection) - Wrappers the FS process is wrapped in the

classifier that will be used on the data

Filters - discussion

- Find the smallest attribute set in which all the

instances are distinct. Problem cost if

exhaustive search used - But learning and FS are related in a way, the

classifier already includes the the good

(separating) attributes. Hence the idea - Use one classifier for FS, then another on the

results. E.g. use a DT, and pass the data on to

NB. Or use 1R for DT.

Filters contd RELIEF Kira, Rendell

- Initialize weight of all atrrs to 0
- Sample instances and check the similar ones.
- Determine pairs which are in the same class (near

hits) and in different classes (near misses). - For each hit, identify attributes with different

values. Decrease their weight - For each miss, attributes with different values

have their weight increased. - Repeat the sample selection and weighing (2-5)

many times - Keep only the attrs with positive weight
- Discussion high variance unless the of samples

very high - Deterministic RELIEF use all instances and all

hits and misses

A different approach

- View attribute selection as a search in the space

of all attributes - Search needs to be driven by some heuristic

(evaluation criterion) - This could be some measure of the discrimination

ability of the result of search, or - Cross-validation, on the part of the training set

put aside for that purpose. This means that the

classifier is wrapped in the FS process, hence

the name wrapper (scheme-specific selection)

- Greedy search example
- A single attribute is added (forward) or deleted

(backward) - Could also be done as best-first search or beam

search, or some randomized (e.g. genetic) search

Wrappers

- Computationally expensive (k-fold xval at each

search step) - backward selection often yields better accuracy
- x-val is just an optimistic estimation that may

stop the search prematurely - in backward mode attr sets will be larger than

optimal - Forward mode may result in better

comprehensibility - Experimentally FS does particularly well with NB

on data on which NB does not do well - NB is sensitive to redundant and dependent (!)

attributes - Forward selection with training set performance

does well Langley and Sage 94

Discretization

- Getting away from numerical attrs
- We know it from DTs, where numerical attributes

were sorted and splitting points between each two

values were considered - Global (independent of the classifier) and local

(different results in ea tree node) schemes exist - What is the result of discretization a value of

an nominal attribute - Ordering information could be used if the

discretized attribute with k values is converted

into k-1 binary attributes the i-1th attribute

true represents the fact that the value is lt I - Supervised and unsupervised discretization

Unsuprevised discretization

- Fixed length intervals (equal interval binning)

eg (max-min)/k - How do we know k?
- May distribute instances unevenly
- Variable-length intervals, ea containing the same

number of intervals (equal frequency binning, or

histogram equalization)

- Supervised discretization
- Example of Temperature attribute in the

play/dont play data - A recursive algorithm using information measure/

We go for the cut point with lowest information

(cleanest subset)

Supervised discretization contd

- Whats the right stopping criterion?
- How about MDL? Compare the info to transmit the

label of ea instance before the split, and the

split point in log2(N-1) bits, info for points

below and info for points above. - Ea. Instance costs 1 bit before the split, and

slightly gt 0 bits after the split - This is the Irani, Fayyad 93 method

Error-correcting Output Codes (ECOC)

- Method of combining classifiers from a two-class

problem to a k-class problem - Often when working with a k-class problem k

one-against-all classifiers are learned, and then

combined using ECOC - Consider a 4-class problem, and suppose that

there are 7 classifiers, and classed are coded as

follows - Suppose an instance ?a is classified as 1011111

(mistake in the 2nd classifier). - But the this classification is the closest to

class a in terms of edit (Hamming) distance. Also

note that class encodings in col. 1 re not error

correcting

class class encoding

a 1000 1111111

b 0100 0000111

c 0010 0011001

d 0001 0101010

ECOC contd

- What makes an encoding error-correcting?
- Depends on the distance between encodings an

encoding with distance d between encodings may

correct up to (d-1)/2 errors (why?) - In col. 1, d2, so this encoding may correct 0

errors - In col. 2, d4, so single-bit errors will be

corrected - This example describes row separation there must

also be column separation (1 in col. 2)

otherwise two classifiers would make identical

errors this weakens error correction - Gives good results in practice, eg with SVM

(2-class method) - See the

ECOC contd

- What makes an encoding error-correcting?
- Depends on the distance between encodings an

encoding with distance d between encodings may

correct up to (d-1)/2 errors (why?) - In col. 1, d2, so this encoding may correct 0

errors - In col. 2, d4, so single-bit errors will be

corrected - This example describes row separation there must

also be column separation (1 in col. 2)

otherwise two classifiers would make identical

errors this weakens error correction - For a small number of classes, exhaustive codes

as in col. 2 are used - See the Dietterich paper on how to design good

error-correcting codes - Gives good results in practice, eg with SVM,

decision trees, backprop NNs

What is ILP?

- Machine learning when instances, results and

background knowledge are represented in First

Order Logic, rather than attribute-value

representation - Given E, E-, BK
- Find h such that h ? E, h ? E-

/

- E
- boss(mary,john). boss(phil,mary).boss(phil,john).
- E-
- boss(john,mary). boss(mary,phil).

boss(john,phil). - BK
- employee(john, ibm). employee(mary,ibm).

employee(phil,ibm). - reports_to(john,mary). reports_to(mary,phil).

reports_to(john,phil). - h boss(X,Y,O)- employee(X,O),

employee(Y,O),reports_to(Y, X).

Historical justification of the name

- From facts and BK, induce a FOL hypothesis

(hypothesis in Logic Programming)

P R O L O G

P R O L O G

Examples

Hypotheses (rules)

Background knowledge

Why ILP? - practically

- Constraints of classical machine learning

attribute-value (AV) representation - instances are represented as rows in a single

table, or must be combined into such table - This is not the way data comes from databases

(No Transcript)

From tables to models to examples and background

knowledge

- Results of learning (in Prolog)
- party(yes)-participant(_J, senior, _C)
- Party(yes)-participant(president,_S,_C)

Why ILP - theoretically

- AV all examples are the same length
- no recursion
- How could we learn the concept of reachability in

a graph

Expressive power of relations, impossible in AV

7

0

3

6

4

8

1

5

2

- cannot really be expressed in AV representation,

but is very easy in relational representation - linked-to lt0,1gt, lt0,3gt, lt1,2gt,,lt7,8gt
- can-reach(X,Y) - linked-to(X,Z), can-reach(Z,Y)

Another example of recursive learning

- E
- boss(mary,john). boss(phil,mary).boss(phil,john).
- E-
- boss(john,mary). boss(mary,phil).

boss(john,phil). - BK
- employee(john, ibm). employee(mary,ibm).

employee(phil,ibm). - reports_to_imm(john,mary). reports_to_imm(mary,phi

l). - h boss(X,Y)- employee(X,O), employee(Y,O),report

s_to(Y, X). - reports_to(X,Y)-reports_to_imm(X,Z),

reports_to(Z,Y). - reports_to(X,X).

How is learning done covering algorithm

- Initialize the training set T
- while the global training set contains ex

find a clause

that describes part of relationship Q

remove the ex covered by this clause - Finding a clause
- initialize the clause to Q(V1,Vk) -

while T contains ex

find a literal L to add to the right-hand side

of the clause

- Finding a literal greedy search

- Find a clause loop describes search
- Need to structure the search space
- generality semantic and syntactic
- since logical generality is not decidable, a

stronger property of ?-subsumption - then search from general to specific (refinement)

Refinement

Heuristics link to he