Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Description:

Round-robin repeated labeling does well. Selective repeated labeling ... Round Robin ... Gen. Round Robin vs. Single Labeling. P=0.6, k=5. Repeated ... – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 33
Provided by: hom4410
Category:

less

Transcript and Presenter's Notes

Title: Get Another Label Improving Data Quality and Data Mining Using Multiple, Noisy Labelers


1
Get Another Label? Improving Data Quality and
Data Mining Using Multiple, Noisy Labelers
Victor Sheng Foster Provost Panos Ipeirotis
  • New York University
  • Stern School

2
Outsourcing KDD preprocessing
  • Traditionally, data mining teams have invested
    substantial internal resources in data
    formulation, information extraction, cleaning,
    and other preprocessing
  • Raghu from his Innovation Lecture
  • the best you can expect are noisy labels
  • Now, we can outsource preprocessing tasks, such
    as labeling, feature extraction, verifying
    information extraction, etc.
  • using Mechanical Turk, Rent-a-Coder, etc.
  • quality may be lower than expert labeling (much?)
  • but low costs can allow massive scale
  • The ideas may apply also to focusing
    user-generated tagging, crowdsourcing, etc.

2
3
ESP Game (by Luis von Ahn)
3
4
Other free labeling schemes
  • Open Mind initiative (www.openmind.org)
  • Other gwap games
  • Tag a Tune
  • Verbosity (tag words)
  • Matchin (image ranking)
  • Web 2.0 systems?
  • Can/should tagging be directed?

5
Noisy labels can be problematic
  • Many tasks rely on high-quality labels for
    objects
  • learning predictive models
  • searching for relevant information
  • finding duplicate database records
  • image recognition/labeling
  • song categorization
  • Noisy labels can lead to degraded task
    performance

5
6
Quality and Classification Performance
Here, labels are values for target variable
  • Labeling quality increases ? classification
    quality increases

P 1.0
P 0.8
P 0.6
P 0.5
7
Summary of results
  • Repeated labeling can improve data quality and
    model quality (but not always)
  • When labels are noisy, repeated labeling can be
    preferable to single labeling even when labels
    arent particularly cheap
  • When labels are relatively cheap, repeated
    labeling can do much better (omitted)
  • Round-robin repeated labeling does well
  • Selective repeated labeling improves substantially

8
Our Focus Labeling using Multiple Noisy
Labelers
  • Repeated labeling and data quality
  • Repeated labeling and classification quality
  • Selective repeated labeling

9
Majority Voting and Label Quality
  • Ask multiple labelers, keep majority label as
    true label
  • Quality is probability of being correct

P1.0
P0.9
P0.8
P is probabilityof individual labelerbeing
correct
P0.7
P0.6
P0.5
P0.4
10
Tradeoffs for Modeling
  • Get more labels ? Improve label quality ? Improve
    classification
  • Get more examples ? Improve classification

P 1.0
P 0.8
P 0.6
P 0.5
11
Basic Labeling Strategies
  • Single Labeling
  • Get as many data points as possible
  • one label each
  • Round-robin Repeated Labeling
  • Fixed Round Robin (FRR)
  • keep labeling the same set of points
  • Generalized Round Robin (GRR)
  • repeatedly-label data points, giving next label
    to point with fewest so far

12
Fixed Round Robin vs. Single Labeling
FRR (100 examples)
SL
p 0.6, labeling quality examples 100
With high noise, repeated labeling better than
single labeling
13
Fixed Round Robin vs. Single Labeling
Single
FRR (50 examples)
p 0.8, labeling quality examples 50
With low noise, more (single labeled) examples
better
14
Gen. Round Robin vs. Single Labeling
P labeling quality k labels
P0.6, k5
GRR
SL
Repeated labeling is better than single labeling
15
Tradeoffs for Modeling
  • Get more labels ? Improve label quality ? Improve
    classification
  • Get more examples ? Improve classification

P 1.0
P 0.8
P 0.6
P 0.5
15
16
Selective Repeated-Labeling
  • We have seen
  • With enough examples and noisy labels, getting
    multiple labels is better than single-labeling
  • When we consider costly preprocessing, the
    benefit is magnified (omitted -- see paper)
  • Can we do better than the basic strategies?
  • Key observation we have additional information
    to guide selection of data for repeated labeling
  • the current multiset of labels
  • Example ,-,,,-, vs. ,,,

17
Natural Candidate Entropy
  • Entropy is a natural measure of label
    uncertainty
  • E(,,,,,)0
  • E(,-, ,-, ,- )1
  • Strategy Get more labels for examples with
    high-entropy label multisets

18
What Not to Do Use Entropy
Improves at first, hurts in long run
19
Why not Entropy
  • In the presence of noise, entropy will be high
    even with many labels
  • Entropy is scale invariant
  • (3 , 2-) has same entropy as (600 , 400-)

20
Estimating Label Uncertainty (LU)
  • Observe s and s and compute Probs and
    Pr-obs
  • Label uncertainty tail of beta distribution

Beta probability density function
SLU
0.5
0.0
1.0
21
Label Uncertainty
  • p0.7
  • 5 labels(3, 2-)
  • Entropy 0.97
  • CDFb0.34

22
Label Uncertainty
  • p0.7
  • 10 labels(7, 3-)
  • Entropy 0.88
  • CDFb0.11

23
Label Uncertainty
  • p0.7
  • 20 labels(14, 6-)
  • Entropy 0.88
  • CDFb0.04

24
Label Uncertainty vs. Round Robin
similar results across a dozen data sets
25
RecallGen. Round Robin vs. Single Labeling
P labeling quality k labels
P0.6, k5
GRR
SL
Multi-labeling is better than single labeling
26
Label Uncertainty vs. Round Robin
26
similar results across a dozen data sets
27
Another strategyModel Uncertainty (MU)
  • Learning a model of the data provides an
    alternative source of information about label
    certainty
  • Model uncertainty get more labels for instances
    that cannot be modeled well
  • Intuition?
  • for data quality, low-certainty regions may be
    due to incorrect labeling of corresponding
    instances
  • for modeling why improve training data quality
    if model already is certain there?

28
Yet another strategyLabel Model Uncertainty
(LMU)
  • Label and model uncertainty (LMU) avoid examples
    where either strategy is certain

29
Comparison
Model Uncertainty alone also improves quality
Label Model Uncertainty
Label Uncertainty
GRR
30
Comparison Model Quality
Across 12 domains, LMU is always better than
GRR. LMU is statistically significantly better
than LU and MU.
Label Model Uncertainty
31
Summary of results
  • Micro-task outsourcing (e.g., MTurk, RentaCoder
    ESP game) has changed the landscape for data
    formulation
  • Repeated labeling can improve data quality and
    model quality (but not always)
  • When labels are noisy, repeated labeling can be
    preferable to single labeling even when labels
    arent particularly cheap
  • When labels are relatively cheap, repeated
    labeling can do much better (omitted)
  • Round-robin repeated labeling can do well
  • Selective repeated labeling improves substantially

32
Opens up many new directions
  • Strategies using learning-curve gradient
  • Estimating the quality of each labeler
  • Example-conditional quality
  • Increased compensation vs. labeler quality
  • Multiple real labels
  • Truly soft labels
  • Selective repeated tagging

33
Thanks!Q A?
34
What if different labelers have different
qualities?
  • (Sometimes) quality of multiple noisy labelers is
    better than quality of best labeler in set
  • here, 3 labelers
  • p-d, p, pd

35
Mechanical Turk Example
36
Estimating Labeler Quality
  • (Dawid, Skene 1979) Multiple diagnoses
  • Assume equal qualities
  • Estimate true labels for examples
  • Estimate qualities of labelers given the true
    labels
  • Repeat until convergence
Write a Comment
User Comments (0)
About PowerShow.com