CoCQA: CoTraining Over Questions and Answers with an Application to Predicting Question Subjectivity - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

CoCQA: CoTraining Over Questions and Answers with an Application to Predicting Question Subjectivity

Description:

CoCQA: Co-Training Over Questions and Answers. with an Application to Predicting ... Sentiment Analysis: (Pang and Lee, 2004) (Yu and Hatzivassiloglou, 2003) ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 32
Provided by: baol4
Category:

less

Transcript and Presenter's Notes

Title: CoCQA: CoTraining Over Questions and Answers with an Application to Predicting Question Subjectivity


1
CoCQA Co-Training Over Questions and
Answerswith an Application to Predicting
Question Subjectivity Orientation
  • Baoli Li, Yandong Liu, and Eugene Agichtein
  • Emory University

2
Community Question Answering
  • An effective way of seeking information from
    other users
  • Can be searched for resolved questions

3
Community Question Answering (CQA)
  • Yahoo! Answers
  • Users
  • Asker post questions
  • Answerer post answers
  • Voter vote for existing answers
  • Questions
  • Subject
  • Detail
  • Answers
  • Answer text
  • Votes
  • Archive millions of questions and answers

4
Lifecycle of a Question in CQA
Choose a category
Compose the question
Open question
Examine
Answer
Answer
Answer
Close question Choose best answers Give ratings
Find the answer?
Yes
Question is closed by system. Best answer is
chosen by voters
No
5
Problem Statement
  • How can we exploit structure of CQA to improve
    question classification?
  • Case Study Question Subjectivity Prediction
  • Subjective questions seek answers containing
    private states such as personal opinion,
    judgment, and experience
  • Objective questions are expected to be answered
    with reliable or authoritative information

6
Example Questions
  • Subjective
  • Has anyone got one of those home blood pressure
    monitors? and if so what make is it and do you
    think they are worth getting?
  • Objective
  • What is the difference between chemotherapy and
    radiation treatments?

7
Motivation
  • Guiding the CQA engine to process questions more
    intelligently
  • Some Applications
  • Ranking/filtering answers
  • Improving question archive search
  • Evaluating answers provided by users
  • Inferring user intent

8
Challenges
  • Some challenges in online real question analysis
  • Typically complex and subjective
  • Can be ill-phrased and vague
  • Not enough annotated data

9
Key Observations
  • Can we utilize the inherent structure of the CQA
    interactions, and use the unlimited amounts of
    unlabeled data to improve classification
    performance?

10
Natural Approach Co-Training
  • Introduced by
  • Combining labeled and unlabeled data with
    co-training, Blum and Mitchell, 1998
  • Two views of the data
  • E.g. content and hyperlinks in web pages
  • Provide complementary information for each other
  • Iteratively construct additional labeled data
  • Can often significantly improve accuracy

11
Questions and Answers Two Views
  • Example
  • Q Has anyone got one of those home blood
    pressure monitors? and if so what make is it and
    do you think they are worth getting?
  • A My mom has one as she is diabetic so its
    important for her to monitor it she finds it
    useful.
  • Answers usually match/fit question
  • My mom she finds
  • Askers can usually identify matching answers by
    selecting the best answer

12
CoCQA A Co-Training Framework over Questions and
Answers
Unlabeled Data ?????????? ??????????
Unlabeled Data ?????????? ??????????
Labeled Data
Labeled Data
CQ
Q
Q
CA
A
A
Classify
---- ----
Validation (Holdout training data)
Stop
13
Details of CoCQA implementation
  • Base classifier
  • LibSVM
  • Term Frequency as Term Weight
  • Also tried Binary, TFIDF
  • Select top K examples with highest confidence
  • Margin value in SVM

14
Feature Set
  • Character 3-grams
  • has, any, nyo, yon, one
  • Words
  • Has, anyone, got, mom, she, finds
  • Word with Character 3-grams
  • Word n-grams (nlt3, i.e. Wi, WiWi1,
    WiWi1Wi2)
  • Has anyone got, anyone got one, she finds it
  • Word and POS n-gram (nlt3, i.e. Wi, WiWi1, Wi
    POSi1, POSiWi1 , POSiPOSi1, etc.)
  • NP VBP, She PRP, VBP finds

15
Overview of Experimental Setup
  • Datasets
  • From Yahoo! Answers
  • Manually labeled data by Amazon Mechanical Turk
  • Metrics
  • Compare CQA to state-of-the semi-supervised method

16
Dataset
  • 1,000 Labeled Questions from Yahoo! Answers
  • 5 categories (Arts, Education, Science, Health
    Sports)
  • 200 questions from each category
  • 10,000 Unlabeled Questions from Yahoo! Answers
  • 2,000 questions from each category
  • Data available at
  • http//ir.mathcs.emory.edu/shared

17
Manual Labeling
  • Annotated using Amazons Mechanical Turk service
  • Each question was judged by 5 Mechanical Turk
    workers
  • 25 questions included in each HIT task
  • Worker needs to pass the qualification test
  • Majority vote to derive gold standard
  • Discarded small fraction (22 out of 1000) of
    nonsensical questions such as Upward Soccer
    Shorts? and 11?fdgdgdfg by manual inspection

18
Example HIT task
19
Subjectivity Statistics by Category
Objective
Subjective
20
Evaluation Metric
  • Macro-Averaged F-1
  • Prediction performance on both subjective
    questions and objective questions is equally
    important
  • F-1
  • Averaged over subjective and objective classes

21
Experimental Settings
  • 5 fold cross validation
  • Methods Compared
  • Supervised LibSVM (Chang and Lin, 2001)
  • Generalized Expectation (GE) (Mann and McCallum,
    2007)
  • CoCQA our method
  • Base classifier LibSVM
  • View 1 question text View 2 answer text

22
F1 for Supervised Learning
F1 with different sets of features
23
Semi Supervised Learning Adding unlabeled data
Comparison between Supervised, GE and CoCQA
24
CoCQA with varying K( new examples added in
each iteration)
25
CoCQA for varying iterations
26
CoCQA for varying amount of labeled data
27
Conclusions and Future Work
  • Problem Non-topical text classification in CQA
  • CoCQA a co-training framework that can exploit
    information from both question and answers
  • Case study subjectivity classification for real
    questions in CQA
  • We plan to explore
  • more sophisticated features
  • related variants of semi-supervised learning
  • other applications (Sentiment classification)

28
Thank you!Baoli Li csblli_at_gmail.comYandong
Liu yandong.liu_at_emory.eduEugene
Agichtein eugene_at_mathcs.emory.edu
29
Performance of Subjective vs. Objective classes
  • Subjective class
  • 80
  • Objective class
  • 60

30
Related work
  • Some related work
  • Question Classification (Zhang and Lee, 2003)(
    Tri et al., 2006)
  • Sentiment Analysis (Pang and Lee, 2004)
  • (Yu and Hatzivassiloglou, 2003)
  • (Somasundaran et al. 2007)

31
Important words for Subjective, Objective classes
by Information Gain
Write a Comment
User Comments (0)
About PowerShow.com