1 / 21

Learning on the Test DataLeveraging Unseen

Features

- Ben Taskar Ming FaiWong

Daphne Koller

Introduction

- Most statistical learning models make the

assumption that data instances are IID samples

from some fixed distribution. - In many cases, the data are collected from

different sources, at different times, locations

and under different circumstances. - We usually build a statistical model of features

under the assumption that future data will

exhibit the same regularities as the training

data. - In many data sets, however, there are

scope-limited features whose predictive power is

only applicable to a certain subset of the data.

Examples

- 1. Classifying news articles chronologically
- Suppose the task is to classify news

articles chronologically. New - events, people and places appear and

disappear) in bursts over - time.
- The training data might consist of

articles taken over some time - period these are only somewhat

representative of the future - articles.
- The training data may contain some

features that are not observed - in the training data.
- 2. Classifying customers into categories
- Our training data might be collected from

one geographical region - which may not represent the distribution

in other regions.

We can get away with this difficulty by

mixing all the examples and selecting the

training and test sets randomly. But this

homogeneity cannot be ensured in real world task,

where only the non-representative training data

is actually available for training. The

test data may contain many features that were

never or only rarely observed in training data.

These features may be used for classification.

For ex, in the news article task these local

features might include the names of places or

people currently in the news. In the

customers ex, these local features might include

purchases of products that are specific to a

region.

Scoped Learning

- Suppose we want to classify news articles

chronologically. The phrase XXX said today

might appear in many places in data for different

values of XXX - These features are called scope limited

features or local features. - Another example
- Suppose there are 2 labels grain and trade.

Words like corn or wheat often appear in phrase

tons of wheat". So we can learn that if a word

appears in the context of tons of xxx it is

likely to be associated with grain. So if we find

a phrase like tons of rye in the test data we

can infer that it has some positive interaction

with label grain. - Scoped learning is a probabilistic framework

that combines the traditional IID features with

scope limited features.

The intuitive procedure for

using the local features is to use the

information from the global (IID) features to

infer the rules that govern the local information

for a particular subset of data.When data

exhibits scope they found significant gains in

performance over traditional models which only

uses IID features.All the data instances within

a particular scope exhibit some structural

regularity and we assume that all the future data

will exhibit the same structural regularity.

General Framework

- Notion of scope
- We assume that data instances are sampled from

some set of scopes, each of which is associated

with some data distribution. - Different distributions share a probabilistic

model for some set of global features, but can

contain a different probabilistic model for a

scope-specific set of local features. - These local features may be rarely or never seen

in the scopes comprising the training data.

- Let X denote global features, Z denote local

features, and Y the class variable.For each

global feature Xi, there is a parameter ?i.

Additionally,for each scope and each local

feature Zi, there isa parameter ?iS. - Then the distribution of Y given all the

features and weights is

Probabilistic model

- We assume that the global weights can be learned

from training data. So their values are fixed

when we encounter a new scope and the local

feature weights are unknown and can be treated as

hidden variables in the graphical model. - Idea
- The evidence from global features for

the labels of some of the instances to modify our

beliefs about the role of the local feature

present in these instances to be consistent with

the labels. By learning about the roles of these

features, we can then propagate this information

to improve accuracy on instances that are harder

to classify using global features alone.

- To implement this idea, we define a joint

distribution over ?S and y1, . . . , ym. - Why use Markov Random Fields
- Here the association between the variables are

correlated rather than causal. Markov random

fields are used to model spatial interactions or

interacting features.

Markov Network

- Let V (Vd,Vc) denote a set of random variables,

where Vd are discrete and Vc are continuous

variables, respectively. - A Markov network over V defines a joint

distribution over V, assigning a density over Vc

for each possible assignment vd to Vd. - A Markov network M is an undirected graph whose

nodes correspond to V. - It is parameterization by a set of potential

functions f1(C1), . . . , fl(Cl) such that each C

V is a fully connected subgraph, or clique, in M,

i.e., each Vi, Vj C are connected by an edge in

M. - Here we assume that the f(C) is a log-quadratic

function - The Markov network then represents the

distribution

- In our case the log-quadratic model consists of 3

types of potentials - 1) f(?i,Yj,Xij) exp(?iYjXij)
- relates each global feature Xij in

instance i to its weight ?i and the class

variables Yj of the corresponding instance i. - 2) f(?i,Yj,Zij) exp(?iYjZij)
- relates the local feature Zij to its

weight ?i and the label Y j - Finally, as the local feature weights are assumed

to be hidden, we introduce a prior over their

values, or the form - Overall, our model specifies a joint distribution

as follows

Markov network for two instances, two global

features and three local features

- The graph can be simplified further when we

account for varaibles whose values are fixed. - The global feature weights are learned from the

training data and hence their value is fixed and

we also know all the feature values. - The resulting Markov network is shown below

(Assuming that the instance (x1, z1, y1) contains

the features Z1 and Z2, and the instance(x2, z2,

y2) contains the features Z2 and Z3.) - Y2
- ?1 ?2

?3 - Y1

- This can be reduced further. When Zij0 there is

no interaction between Yj and any of the

variables ?i. - In this case we can simply omit the edge between

?i and Yj - And the resulting Markov network is shown below
- Y2
- ?1 ?2

?3 - Y1

- In this model, we can see that the labels of all

of the instances are correlated with the local

feature weights of features they contain, and

thereby with each other. Thus, for example, if we

obtain evidence (from global features) about the

label Y 1, it would change our posterior beliefs

about the local feature weight 2, which in turn

would change our beliefs about the label Y 2.

Thus, by running probabilistic inference over

this graphical model, we obtain updated beliefs

both about the local feature weights and about

the instance labels.

Learning the Model

- Learning Global Feature Weights
- In this case we simply learn their parameters

from the training data, using standard logistic

regression. Maximum-likelihood (ML) estimation

finds the weights ? that maximize the conditional

likelihood of the labels given the global

features. - Learning Local feature Distributions
- We can exploit such patterns by learning a model

that predicts the prior of the local feature

weights using meta features features of

features. More precisely, we learna model that

predicts the prior mean µi for i from someset of

meta-features mi. As our predictive model for the

mean µi we choose to use a linear regression

model, setting - µi w

mi.

Using the model

- Step1
- Given a training set, we first learn the

model. In the training set, there local and

global features are treated identically. When

applying the model to the test set, however, our

first decision is to determine the set of local

and global features. - Step 2
- Our next step is to generate the Markov

network for the test set. Probabilistic inference

over this model infers the effect of local

features. - Step 3
- We use Expectation Propagation for

inference. It maintains approximate beliefs

(marginals) over nodes of the Markov network and

iteratively adjusts them to achieve local

consistency.

Experimental Results

- Reuters
- The Reuters news articles data set contains

substantial number of documents hand labeled into

grain, crude, trade, and money-fx. - Using this data set, six experimental setups are

created, by using all possible pairings of

categories from the four categories chosen. - The resulting sequence is divided into nine time

segments with roughly the same number of

documents in each segment.

(No Transcript)

- WebKB2
- This data set consists of hand-labeled web

pages from Computer Science department web sites

of four schools Berkeley, CMU, MIT and Stanford

and they are categorized into faculty, student,

course and organization. - Six experimental setups are created by using

all possible pairings of categories from the four

categories.