Title: Learning to Predict Combinatorial Structures Shankar Vembu Joint work with Thomas G
1Learning to Predict Combinatorial
StructuresShankar VembuJoint work with Thomas
Gärtner
2Learning Structured Prediction Models
- Exact vs. Approximate inference
3Learning Structured Prediction Models
- Is it possible to learn structured prediction
models without using any inference algorithm? -
4Learning Structured Prediction Models
- Is it possible to learn structured prediction
models without using any inference algorithm? - Yes For combinatorial structures
-
5Combinatorial Structures
- Partially ordered sets
- Permutations (label ranking)
- Directed cycles
- Graphs
- Multiclass, multilabel, ordinal, hierarchical
classification
6Outline
- Structured prediction and limitations of existing
models - Training combinatorial structures
- Constructing combinatorial structures
- Application settings
7Structured Prediction
- Input and output spaces
- Training data
- Joint scoring function
- Prediction
8Discriminative Structured Prediction
exponential number of constraints ?
9Assumptions
- Decoding (polynomial time)
- given h, x find
- Separation (polynomial time)
- given h, x, y find any
or prove that none exists - Optimality (in NP)
- given h, x, y decide
decoding is the strongest assumption and
optimality is the weakest
10Optimality vs. Non-optimality
- optimality
- given h, x, y decide
- non-optimality
- given h, x, y decide
11Dicycles - Optimality vs. Non-optimality
- optimality - is there no longer cycle?
- given h, x, y decide
- non-optimality - is there any longer cycle?
- given h, x, y decide
12Dicycles - Optimality vs. Non-optimality
- optimality - is there no longer cycle?
- given h, x, y decide
- non-optimality - is there any longer cycle?
- given h, x, y decide
Proposition The non-optimality problem for
directed cycles is NP-complete
13Optimality vs. Non-optimality
Optimality
we are interested mostly in problems where
non-optimality is NP-complete
14Training Combinatorial Structures
15Loss Functions
16Loss Functions
- AUC-loss
- Exponential loss
17Loss Functions
- AUC-loss
- Exponential loss
- 2nd-order Taylor expansion at 0
18Regularised Risk Minimisation
- Assumption 1 Tensor product of Hilbert spaces
- representer theorem
- Assumption 2
has bases
19Regularised Risk Minimisation
20Optimisation with Finite Output Embedding
Let Using the canonical orthonormal bases of
, optimise
21Recipe for Training Combinatorial Structures
22Recipe for Training Combinatorial Structures
- Finite dimensional output embedding
23Recipe for Training Combinatorial Structures
- Finite dimensional output embedding
- Polynomial-time computation of
24Recipe for Training Combinatorial Structures
- Finite dimensional output embedding
- Polynomial-time computation of
- Polynomially-sized unconstrained quadratic program
25Constructing Combinatorial Structures
26Approximation Algorithms
- Decoding
- given h, x find
- Approximate decoding
- given h, x find any
-
27Approximation Measure
- Maximisation problem, Approximation factor 0.65
c(s)
c(s_min)
c(s_max)
0
1
0.65
0.5
28z-approximation
z-approximation is better suited when negative
solutions are possible. It is invariant to (i)
constant offsets, (ii) changing from max to
min, and (iii) using the complement of binary
variables
29Decoding Sibling Systems
- Consider a set system
- with a sibling function and
- an output map such that
30Decoding Sibling Systems
- Consider a set system
- with a sibling function and
- an output map such that
Theorem There is a 1/2-factor z-approximation
algorithm for decoding sibling systems
31Decoding Independence Systems
- Consider a set system
- with
-
32Decoding Independence Systems
- Consider a set system
- with
- Theorem There is a
factor z-approximation algorithm for decoding
independence systems -
33Application settings
34Multiclass
Example
Decoding is trivial
35Multilabel
Decoding
36Dicycles
- Digraphs
- (-1,0,1) adjacency matrix
37Dicycles
38and others
- Ordinal regression
- Hierarchical classification
- Partially ordered sets
- Permutations
- Graphs
39Experiments
40Multilabel Classification
- Yeast dataset 1500 training, 917 test, 14
labels - Comparison with multi-label SVM
41Hierarchical Classification
- WIPO-alpha data set 1352 training, 358 test
- Number of nodes -188, max. depth - 3
LOSS SVM H-SVM H-RLS H-M3 Hamming H-M3 Tree CSOP
0-1 87.2 76.2 72.1 70.9 65 51.1
Hamming 1.84 1.74 1.69 1.67 1.73 1.84
Tree 0.053 0.051 0.05 0.05 0.048 0.046
42Dicycle Policy Estimation
- Artificial setting
- Predicting the cyclic tour of different people
- Hidden policy for each person s/he takes the
route that - maximises reward
- Goal is to estimate the hidden policy
43Dicycle Policy Estimation
- Comparision with SVM-Struct using approximate
inference
44Ongoing Work
- Probabilistic models
- Negative log likelihood as loss
function - Sampling techniques for combinatorial structures
- Markov chain Monte Carlo methods
- Provable guarantees for mixing time
45