Title: Regression trees and regression graphs: Efficient estimators for Generalized Additive Models
1Regression trees and regression graphsEfficient
estimators for Generalized Additive Models
- Adam Tauman Kalai
- TTI-Chicago
2Outline
- Generalized Additive Models (GAM)
- Computationally efficient regression
- Model
- Thm Regression graph algorithm efficiently
learns GAMs - Regression tree algorithm
- Regression graph algorithm
- Correlation boosting
Valiant KearnsSchapire
New
MansourMcAllester
New
3Generalized Additive Models Hastie Tibshirani
Dist. ? over X Y Rd R f(x) E?yx
u(f1(x(1))f2(x(2))fd(x(d))) monotonic u
R!R, arbitrary fi R!R
- e.g., Generalized linear models
- u( wx ), monotonic u
- linear/logistic models
- e.g., f(x) ex2 ex(1)2x(2)2x(d)2
4Non-Hodgkins Lymphoma International Prognostics
Index
NEJM 93
Risk Factors agegt60, sitesgt1, perf. statusgt1,
LDHgtnormal, stagegt2
5Setup
true error ?(h) E?(h(x)-y)2
X Rd Y 0,1 training sample (x1,y1),,(xn
,yn)
6Computationally-efficient regression
KearnsSchapire
Family of target functions
Definition A efficiently learns F
f(x) E?yx 2 F,
8
with probability 1-?,
gt0
E?(h(x)-y)2 E?(f(x)-y)2(term)/nc
n examples
poly(f,1/?)
true error ?(h)
Learning Algorithm A
As runtime must be poly(n,f)
7Properties of M.S.E.
- E(h(x)-y)2 E( h(x)-f(x) f(x)-y )2
E(h(x)-f(x))2 E(f(x)-y)2 - E(h(x)-f(x))(f(x)-y)
- ) hf minimizes E(h(x)-y)2
8Outline
- Generalized Additive Models (GAM)
- Computationally efficient regression
- Model
- Thm Regression graph algorithm efficiently
learns GAMs - Regression tree algorithm
- Regression graph algorithm
- Correlation boosting
Valiant KearnsSchapire
New
MansourMcAllester
New
9Results for GAMs
New
1
.1
0
1
0
.6
0
0
0
.7
0
1
1
Regression Graph Learner
0
0
.8
.4
1
0
1
.2
1
1
1
1
0
1
0
1
1
h Rd ! 0,1
n samples 2 X 0,1 X µ Rd
- Thm reg. graph learner efficiently learns GAMs
- 8dist ? over XY with E?yx f(x) 2 GAM
- E?(h(x)-y)2 E?(f(x)-y)2 O(LV log(dn/?))
- runtime poly(n,d)
8 ? with probability 1-?,
n1/7
10Results for GAMs
New
- f(x) u(?i fi(x(i)))
- u R!R, monotonic, L-Lipschitz (Lmax u(z))
- fi R!R, bounded total variationV ?i s
fi(z)dz
- Thm reg. graph learner efficiently learns GAMs
- 8dist ? over XY with E?yxf(x) 2 GAM
- E?(h(x)-y)2 E?(f(x)-y)2 O(LV log(dn/?))
- runtime poly(n,d)
n1/7
11Results for GAMs
New
1
.1
0
0
.6
0
0
1
0
.7
0
1
1
Regression Tree Learner
0
0
.8
.4
1
0
1
.2
1
1
1
1
0
1
0
1
1
h Rd ! 0,1
n samples 2 X 0,1 X µ Rd
- Thm reg. tree learner inefficiently learns GAMs
- 8dist ? over XY with E?yxf(x) 2 GAM
- E?(h(x)-y)2 E?(f(x)-y)2 O(LV)
- runtime poly(n,d)
(
)
1/4
log(d)
log(n)
12Regression Tree Algorithm
- Regression tree RT Rd ! 0,1
- Training sample (x1,y1),(x2,y2),,(xn,yn) 2 Rd
0,1 -
(x1,y1), (x2,y2),
avg(y1,y2,,yn)
13Regression Tree Algorithm
- Regression tree RT Rd ! 0,1
- Training sample (x1,y1),(x2,y2),,(xn,yn) 2 Rd
0,1 -
x(j) ? ?
(xi,yi) x(j) lt ?
(xi,yi) x(j) ?
avg(yi xi(j)lt?)
avg(yi xi(j)?)
14Regression Tree Algorithm
- Regression tree RT Rd ! 0,1
- Training sample (x1,y1),(x2,y2),,(xn,yn) 2 Rd
0,1
x(j) ? ?
(xi,yi) x(j) lt ?
x(j) ? ?
avg(yi xi(j)lt?)
(xi,yi) x(j) ? and x(j) lt ?
(xi,yi) x(j) ? and x(j) ?
avg(yi x(j)?Æx(j)?)
avg(yi x(j)?Æx(j)lt?)
15Regression Tree Algorithm
- n amount of training data
- Put all data into one leaf
- Repeat until size(RT)n/log2(n)
- Greedily choose leaf and split x(j) ? to
minimize ?(RT,train) ? (RT(xi)-yi)2/n - Divide data in split node into two new leaves
Equivalent to Gini
16Regression Graph Algorithm MansourMcAllester
- Regression graph RG Rd ! 0,1
- Training sample (x1,y1),(x2,y2),,(xn,yn) 2 Rd
0,1
x(j) ? ?
x(j) ? ?
x(j) ? ?
(xi,yi) x(j) ? and x(j) ?
(xi,yi) x(j) lt ? and x(j) lt ?
(xi,yi) x(j) ? and x(j) lt ?
(xi,yi) x(j) lt ? and x(j) ?
avg(yi x(j)?Æx(j)?)
avg(yi x(j)lt?Æx(j)lt?)
avg(yi x(j)?Æx(j)lt?)
avg(yi x(j)lt?Æx(j)?)
17Regression Graph Algorithm MansourMcAllester
- Regression graph RG Rd ! 0,1
- Training sample (x1,y1),(x2,y2),,(xn,yn) 2 Rd
0,1
x(j) ? ?
x(j) ? ?
x(j) ? ?
(xi,yi) x(j) ? and x(j) ?
(xi,yi) x(j) lt ? and x(j) lt ?
(xi,yi) x(j) lt ? and x(j) ? or x(j) ?
and x(j) lt ?
avg(yi x(j)?Æx(j)?)
avg(yi x(j)lt?Æx(j)lt?)
avg(yi (x(j)lt?Æx(j)?)Ç(x(j)?Æx(j)lt?))
18Regression Graph Algorithm MansourMcAllester
- Put all n training data into one leaf
- Repeat until size(RG)n3/7
- Split greedily choose leaf and split x(j) ? to
minimize ?(RG,train) ? (RG(xi)-yi)2/n - Divide data in split node into two new leaves
- Let ? be the decrease in ?(RG,train) from this
split - Merge(s)
- Greedily choose two leaves whose merger increases
?(RG,train) as little as possible - Repeat merging while total increase in
?(RG,train) from merges is ?/2
19Two useful lemmas
- Uniform generalization bound for any n
- Existence of a correlated splitThere always
exists a split I(x(i) ?) s.t.,
regression graph R
probability over training sets (x1,y1),,(xn,yn)
20Motivating natural example
- X 0,1d, f(x) (x(1)x(2)x(d))/d, uniform
? - Size(RT) ¼ exp(Size(RG)c), e.g. d4
x(1)gt½
x(1)gt½
x(2)gt½
x(2)gt½
x(2)gt½
x(2)gt½
x(3)gt½
x(3)gt½
x(3)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(3)gt½
x(3)gt½
x(3)gt½
x(3)gt½
0
.75
1
.5
.25
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
x(4)gt½
.75
1
.75
.75
.5
.5
.5
.25
.5
.75
.5
.25
.5
.25
.25
0
21Regression boosting
- Incremental learning
- Suppose you find something of positive
correlation with y, then reg. graphs make
progress - Weak regression implies strong regression, i.e.
small correlations can efficiently be combined to
get correlation near 1 (error near 0) - Generalizes binary classification
boostingKearnsValiant, Schapire,
MansourMcAllester,
22Conclusions
- Generalized additive models are very general
- Regression graphs, i.e., regression trees with
merging, provably estimate GAMs using polynomial
data and runtime - Regression boosting generalizes binary
classification boosting - Future work
- Improve algorithm/analysis
- Room for interesting work in statistics Å
computational learning theory