Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Description:

Round-robin repeated labeling does well. Selective repeated labeling ... Round Robin ... Gen. Round Robin vs. Single Labeling. P=0.6, k=5. Repeated ... – PowerPoint PPT presentation

Number of Views:223

Avg rating:3.0/5.0

Slides: 33

Provided by: hom4410

Category:

more less

Transcript and Presenter's Notes

Title: Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

1
Get Another Label? Improving Data Quality and
Data Mining Using Multiple, Noisy Labelers
Victor Sheng Foster Provost Panos Ipeirotis

New York University
Stern School

2
Outsourcing KDD preprocessing

Traditionally, data mining teams have invested
substantial internal resources in data
formulation, information extraction, cleaning,
and other preprocessing
Raghu from his Innovation Lecture
the best you can expect are noisy labels
Now, we can outsource preprocessing tasks, such
as labeling, feature extraction, verifying
information extraction, etc.
using Mechanical Turk, Rent-a-Coder, etc.
quality may be lower than expert labeling (much?)
but low costs can allow massive scale
The ideas may apply also to focusing
user-generated tagging, crowdsourcing, etc.

2
3
ESP Game (by Luis von Ahn)
3
4
Other free labeling schemes

Open Mind initiative (www.openmind.org)
Other gwap games
Tag a Tune
Verbosity (tag words)
Matchin (image ranking)
Web 2.0 systems?
Can/should tagging be directed?

5
Noisy labels can be problematic

Many tasks rely on high-quality labels for
objects
learning predictive models
searching for relevant information
finding duplicate database records
image recognition/labeling
song categorization
Noisy labels can lead to degraded task
performance

5
6
Quality and Classification Performance
Here, labels are values for target variable

Labeling quality increases ? classification
quality increases

P 1.0
P 0.8
P 0.6
P 0.5
7
Summary of results

Repeated labeling can improve data quality and
model quality (but not always)
When labels are noisy, repeated labeling can be
preferable to single labeling even when labels
arent particularly cheap
When labels are relatively cheap, repeated
labeling can do much better (omitted)
Round-robin repeated labeling does well
Selective repeated labeling improves substantially

8
Our Focus Labeling using Multiple Noisy
Labelers

Repeated labeling and data quality
Repeated labeling and classification quality
Selective repeated labeling

9
Majority Voting and Label Quality

Ask multiple labelers, keep majority label as
true label
Quality is probability of being correct

P1.0
P0.9
P0.8
P is probabilityof individual labelerbeing
correct
P0.7
P0.6
P0.5
P0.4
10
Tradeoffs for Modeling

Get more labels ? Improve label quality ? Improve
classification
Get more examples ? Improve classification

P 1.0
P 0.8
P 0.6
P 0.5
11
Basic Labeling Strategies

Single Labeling
Get as many data points as possible
one label each
Round-robin Repeated Labeling
Fixed Round Robin (FRR)
keep labeling the same set of points
Generalized Round Robin (GRR)
repeatedly-label data points, giving next label
to point with fewest so far

12
Fixed Round Robin vs. Single Labeling
FRR (100 examples)
SL
p 0.6, labeling quality examples 100
With high noise, repeated labeling better than
single labeling
13
Fixed Round Robin vs. Single Labeling
Single
FRR (50 examples)
p 0.8, labeling quality examples 50
With low noise, more (single labeled) examples
better
14
Gen. Round Robin vs. Single Labeling
P labeling quality k labels
P0.6, k5
GRR
SL
Repeated labeling is better than single labeling
15
Tradeoffs for Modeling

Get more labels ? Improve label quality ? Improve
classification
Get more examples ? Improve classification

P 1.0
P 0.8
P 0.6
P 0.5
15
16
Selective Repeated-Labeling

We have seen
With enough examples and noisy labels, getting
multiple labels is better than single-labeling
When we consider costly preprocessing, the
benefit is magnified (omitted -- see paper)
Can we do better than the basic strategies?
Key observation we have additional information
to guide selection of data for repeated labeling
the current multiset of labels
Example ,-,,,-, vs. ,,,

17
Natural Candidate Entropy

Entropy is a natural measure of label
uncertainty
E(,,,,,)0
E(,-, ,-, ,- )1
Strategy Get more labels for examples with
high-entropy label multisets

18
What Not to Do Use Entropy
Improves at first, hurts in long run
19
Why not Entropy

In the presence of noise, entropy will be high
even with many labels
Entropy is scale invariant
(3 , 2-) has same entropy as (600 , 400-)

20
Estimating Label Uncertainty (LU)

Observe s and s and compute Probs and
Pr-obs
Label uncertainty tail of beta distribution

Beta probability density function
SLU
0.5
0.0
1.0
21
Label Uncertainty

p0.7
5 labels(3, 2-)
Entropy 0.97
CDFb0.34

22
Label Uncertainty

p0.7
10 labels(7, 3-)
Entropy 0.88
CDFb0.11

23
Label Uncertainty

p0.7
20 labels(14, 6-)
Entropy 0.88
CDFb0.04

24
Label Uncertainty vs. Round Robin
similar results across a dozen data sets
25
RecallGen. Round Robin vs. Single Labeling
P labeling quality k labels
P0.6, k5
GRR
SL
Multi-labeling is better than single labeling
26
Label Uncertainty vs. Round Robin
26
similar results across a dozen data sets
27
Another strategyModel Uncertainty (MU)

Learning a model of the data provides an
alternative source of information about label
certainty
Model uncertainty get more labels for instances
that cannot be modeled well
Intuition?
for data quality, low-certainty regions may be
due to incorrect labeling of corresponding
instances
for modeling why improve training data quality
if model already is certain there?

28
Yet another strategyLabel Model Uncertainty
(LMU)

Label and model uncertainty (LMU) avoid examples
where either strategy is certain

29
Comparison
Model Uncertainty alone also improves quality
Label Model Uncertainty
Label Uncertainty
GRR
30
Comparison Model Quality
Across 12 domains, LMU is always better than
GRR. LMU is statistically significantly better
than LU and MU.
Label Model Uncertainty
31
Summary of results

Micro-task outsourcing (e.g., MTurk, RentaCoder
ESP game) has changed the landscape for data
formulation
Repeated labeling can improve data quality and
model quality (but not always)
When labels are noisy, repeated labeling can be
preferable to single labeling even when labels
arent particularly cheap
When labels are relatively cheap, repeated
labeling can do much better (omitted)
Round-robin repeated labeling can do well
Selective repeated labeling improves substantially

32
Opens up many new directions