JAVELIN Project Briefing - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

JAVELIN Project Briefing

Description:

How much did the Japan Bank for International Cooperation decide to loan to the ... languages (languages of the data collections where answers may be found) so that ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 97
Provided by: ehn1
Category:

less

Transcript and Presenter's Notes

Title: JAVELIN Project Briefing


1
JAVELIN Project Briefing
  • February 16, 2007
  • Language Technologies InstituteCarnegie Mellon
    University

2
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
3

MLQA Architecture
How much did the Japan Bank for International
Cooperation decide to loan to the Taiwan
High-Speed Corporation?
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
4
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
5
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Answer Type MONEY Keyword _____________
Chinese IX
Japanese IX
6
DocID JY-20010705J1TYMCC1300010, Confidence
44.01 DocID JY-20011116J1TYMCB1300010,
Confidence 42.95
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
7
MLQA Architecture
Answer Candidate Confidence 0.0718 Passage
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
8
MLQA Architecture
Cluster and Re-rank answer candidates.
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
9
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
10
Question Analyzer
  • Primary Subtasks
  • Question Classification
  • Key Term Identification
  • Semantic Analysis

11
Question Classification
  • Hybrid Approach
  • machine learning rule-based (same features)
  • Features
  • Lexical
  • unigrams, bigrams
  • Syntactic
  • focus adjective, main verb, wh-word, determiner
    status of wh-word
  • Semantic
  • focus word type

12
Question Classification Focus words
  • Examples
  • Which town hosted the 2002 Winter Olympics?
  • How long is the Golden Gate Bridge?
  • Determining the semantic type of focus nouns
  • Look up in WordNet
  • town gt town-8135936
  • Use a manually-created mapping
  • town-8135936 gt CITY
  • city-metropolis-urban_center-8005407 gt CITY

13
Question Classification Algorithms
  • Machine Learning
  • Hierarchical classifier
  • E-C MAX_ENT, MAX_ENT
  • E-J MAX_ENT, ADABOOST_OVA
  • Rule-based
  • Example
  • MONEY lt WH_WORDhow_much,
    FOCUS_ADJexpensive,
    FOCUS_TYPEmoney
  • Hybrid Approach
  • Try both ML Rule-based
  • If Rule-based classification succeeded, use it
  • Else use ML-based classification

14
Key Term Identification
  • Sources of evidence
  • Syntactic category (POS) NN, JJ, VB, CD
  • Common phrases in dictionary
  • Named entity tags
  • Quoted text
  • Unification procedure based on priority of
    evidence source

15
Semantic Analysis
  • Semantic Role Labeling
  • ASSERT v0.1
  • Back-up KANTOO (for be, have, etc.)
  • more on SRL later
  • Semantic Predicate Structures
  • Produced from SRL annotations and key terms
  • Focus argument is identified
  • Semantic Predicate Expansion
  • Using small, manually-created ontology
  • Relations is-a, inverse, implies, reflexive

16
Plans for Future Development
  • Question Classification
  • Replace manually-created knowledge sources
    heuristics with learners
  • Re-architecture to place learned components as
    supporting agents to rule-based control
  • Semantic Role Labeling
  • Nominalizations?
  • Predicate Expansion
  • Learn the expansion ontology automatically from
    labeled corpora

17
Retrieval Strategist for NTCIR 6
  • Sentence and block retrieval
  • Blocks are overlapping windows, each containing
    three sentences
  • Annotated Corpora
  • Chinese
  • Sentence and block boundaries
  • Named Entity Types www, phone, cardinal, time,
    percent, person, quoted, money, booktitle, date,
    ordinal, email, location, duration, organization,
    measure
  • Japanese (CLQA and QAC)
  • Sentence segmentation, blocks
  • Named Entity Types time, date, optional,
    location, demo, organization, artifact, made,
    misc, money, any, person, numex
  • Named Entity Subtypes misc, people, cardinal,
    age, weight, length, speed, information, area
  • Question focus terms person_bio, reason, method,
    definition
  • Japanese case markers mo, totomoni, ka, no, tte,
    nado, wo, ni, ga, toshite, e, wa, yori, nitsuite,
    dake, kara, to, ya
  • Propbank-style semantic roles target, arg0-4,
    argx

18
Query Formulation for NTCIR 6
  • Retrieve, rank and score blocks
  • One weighted synonym inner clause for each
    keyterm, containing alternate forms, weighted by
    confidence

weightblock( weight1 wsyn( 1.0 term1 0.85
alt1a 0.60 alt1b ) weight2 wsyn( 1.0 term2
0.75 alt2a ) )
19
Translation Module Overview
20
Outline
  • What TM does
  • Then (TM at NTCIR-5)
  • Now (TM at NTCIR-6)
  • Going Forward

21
Translation Module
  • Responsible for all translation-related tasks
    within Javelin
  • Currently, TMs main task is to translate
    keyterms (given by the Question Analyzer) from
    the source language (language of the user input
    question) to the target languages (languages of
    the data collections where answers may be found)
    so that the answer can be located and extracted
    based on the translated keyterms

22
Then (NTCIR-5)
  • Goal Produce a high-quality translation for each
    keyterm based on question context
  • View A translation problem
  • Evaluation Based on gold-standard translation
  • How
  • Use multiple translation resources
  • Dictionaries
  • MT systems
  • Web-mining techniques
  • Use web co-occurrence statistics to select the
    best combination of translated keyterms for a
    given question

23
Then (NTCIR-5)
TM
Translation Gathering
Translation Selection
Source Language Keyterms
Target Language Keyterms
World Wide Web
Dictionaries
MT Systems
Web Mining
24
Then (NTCIR-5)
  • Problems
  • A correct translation may not be useful in
    document retrieval and answer extraction
  • Bill Clinton, William J. Clinton, President
    Clinton (Alternate Forms)
  • Took over, invaded, attacked, occupied
    (Near-synonyms)
  • Which one is correct? Which one(s) are good for
    retrieval and extraction in a QA system?
  • Needs better translation of named entities
  • Accessing the web for gathering statistics could
    be slow

25
Now (NTCIR-6)
  • Goal For each keyterm produce a SET of
    translations useful for retrieval and extraction
  • View A Cross-Lingual Information Retrieval
    (CLIR) problem
  • Evaluation No direct evaluation, but based on
    retrieval results
  • How
  • Use multiple translation resources
  • Dictionaries
  • MT systems
  • Better web-mining techniques
  • Wikipedia
  • Named entity lists
  • More of everything
  • Use multiple translation candidates
  • Better for retrieval and extraction recall
  • But need to minimized noise but retain recall
  • Rank translation candidates
  • Use simpler web statistics

26
Now (NTCIR-6)
TM
Target Language Keyterms
Target Language Keyterms
Translation Gathering
Translation Alternatives Scoring
Source Language Keyterms
Target Language Keyterms
Named Entity Lists
Dictionaries
World Wide Web
Web Mining
MT Systems
Wikipedia
27
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
28
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
29
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
30
Going Forward
  • Improving Translation Coverage
  • Improve web-mining translators
  • Improve keyterm extraction
  • May need alternate forms of the source language
    keyterm
  • May need to segment/transform extracted keyterms
  • Japanese translation is poor
  • Named entities not translated properly
  • Keyterm segmentation problems
  • Improving CLIR
  • Use corpus statistics for ranking translation
    candidates
  • Use established data sets for CLIR experiments
    (TREC, NTCIR)

31
Chinese Answer Extraction Module
  • Outline
  • Review answer extraction in NTCIR5 through an
    example
  • Explain the new techniques we developed for
    NTCIR6

32
NTCIR5 Chinese Answer Extractor Module (we will
use an English example as our running example,
the techniques in this module is
language-independent)
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
33
NTCIR5 Chinese Answer Extractor Module
1. Identify Named-entities
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
34
NTCIR5 Chinese Answer Extractor Module
2. Identify expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
35
NTCIR5 Chinese Answer Extractor Module
3. Extract answer candidate that has a
named-entity type matches the expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
36
NTCIR5 Chinese Answer Extractor Module
Score the answer candidate based on surface
distance to key terms. Select the answer
candidate closest to all key terms.
5 word tokens apart
Wisconsin
28
percent
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
37
NTCIR6 Chinese Answer Extractor Module
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
38
NTCIR6 Chinese Answer Extractor Module
Find the best alignment of key terms using
max-flow dynamic programming algorithm
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
39
NTCIR6 Chinese Answer Extractor Module
Using max-flow algorithm, we take into account
partial matching of terms and synonym expansion,
by assigning different scores to these types of
matching
Q
percent
make
0.8
0.9
A
28
produce
Partial Matching
Synonym Expansion
40
NTCIR6 Chinese Answer Extractor Module
Identify answer type and select answer candidates
that have matching NE type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
41
NTCIR6 Chinese Answer Extractor Module
Produce dependency parse trees
whn
head
pcomp-n
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
subj
gen
prep
det
root
mod
i
gen
mod
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
42
NTCIR6 Chinese Answer Extractor Module
Extract relation triples among the matching terms
whn
head
pcomp-n
Q
percent
of
the
nations
cheese
Wisconsin
produce
subj
gen
prep
root
mod
i
gen
mod
T
Wisconsin
produce
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
43
NTCIR6 Chinese Answer Extractor Module
Extract relation triples cont.
Q
prep
subj
percent
of
Wisconsin
produce
pcomp-n
whn, head
cheese
of
percent
produce
mod
mod,i
T
28 percent
of
Wisconsin
produce
pcomp-n
obj
cheese
of
28 percent
produce
44
NTCIR6 Chinese Answer Extractor Module
Combine multiple sources of information using a
maximum-entropy model.
Q
What
percent
of
28
percent
of
1. atype-NE matching
T
Percent
Answer Type PERCENT
2. Dependency path matching
mod
prep
T
28 percent
of
Q
percent
of
pcomp-n
pcomp-n
cheese
of
cheese
of
3. Term alignment score
4. Sentence term occurrence
5. Passage term occurrence
6. etc
45
Future work for Chinese IX
  • Currently building a more powerful and expressive
    model to learn the syntax and semantic
    transformation from question to answer.
  • Plug in more external resources such as
    paraphrase database and semantic resources
    (gazetteers, WordNet, thesaurus) into the new
    model.

46
Japanese Answer Extraction
47
NTCIR6 CLQA EJ/JJ task Answer Extraction
  • Given retrieved documents, we want to pick Named
    Entities that belong to the expected answer type
    or other relevant type.
  • Named Entity tagging
  • Used used CaboCha for 9 NEs.
  • Patterns based NE tagger is also used for NUMEX,
    DATE, TIME, PERCENT classes and for more
    fine-grained NEs (ORGANIZATION.UNIVERSITY)
  • NE family assumption
  • Families
  • LOCATION, PERSON, ORGANIZATION, ARTIFACT
  • NUMEX, PERCENT
  • DATE, TIME
  • If the answer type is LOCATION, pick other
    members in the family into answer candidate pool
    too, because NE tagger may have mistakenly tagged
    LOCATION as PERSON
  • MaxEnt Learner learns different weights for
    LOCATION-LOCATION and LOCATION-PERSON
  • Then, we want to estimate the probability of each
    Named Entity being an answer.
  • Used Maximum Etropy model where we can
    incorpolate easily customizable features
  • We can model both proximity (used in JAVELIN IIs
    LIGHT IX ) and patterns (used in JAVELIN IIs FST
    IX)

48
NTCIR6 CLQA EJ/JJ task Answer Extraction
  • Numeric features
  • Q denotes question sentence, and A denotes answer
    candidate sentence.
  • KEYTERM of key terms from Q found in A
  • ALIAS of aliases (obtained from Wikipedia and
    Eijiro) from Q found in A
  • RELATED_TERM of related terms (obtained from
    web mining) from Q found in A
  • KEYTERM_DIST Closest sentence level distance of
    a key term from Q.
  • ALIAS_DIST Closest sentence level distance of a
    key term from Q.
  • RELATED_TERM_DIST Closest sentence level
    distance of a key term from Q.
  • PREDICATE_ARGUMENT in what degree, predicate
    argument structure from Q and A are similar
  • Binary features
  • ATYPE pairs of answer types in Q and A
  • KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
    -KANJI, -PAREN
  • If a certain word occurs directly after the key
    term in A.

49
NTCIR6 QAC Overview
  • Japanese-to-Japanese complex (non-factoid) QA
    task.
  • Answer unit is larger than factoid QA task
  • Phrases, sentences, multiple sentences,
    summarized text
  • In reality, it was a pilot task
  • Kind of questions were not predefined
  • Small training data
  • Evaluation currently, human judgment only
  • Unknown N in number of top N answer candidate to
    evaluate
  • Data
  • Corpus Mainichi news paper 1998-2001
  • Training 30 questions
  • Formal run 100 questions

50
NTCIR6 QAC Complex Questions
  • Example questions by expected answer (translated)
  • Relationship, difference
  • What is the difference between skeleton and luge?
  • Reason, cause
  • Why was it easy to predict the eruption of Mt.
    Usu?
  • What is the background of the rise of Islamic
    fundamentalism?
  • Definition, description, (person bio)
  • What is the NPO law?
  • What are the problems of aged Mir space station?
  • Effect, result
  • How does dioxin affect to human body?
  • Method, process
  • Degree
  • Opinion
  • What did Ryoko Tamura comment after winning the
    gold medal?

51
NTCIR6 QAC Our Approach
  • Question Analysis
  • Keyterm extraction, Dictionary based keyterm
    expansion, answer type analysis
  • Document Retrieval
  • Same as factoid QA. Block (3-sentence level)
    retrieval with Indri
  • Answer Extraction
  • Machine learning (Maximum Entropy model) using
    keyterm, answer types, patterns, as features
  • Answer Selection
  • Duplicate answer merging

52
NTCIR6 QAC Answer types
  • A-type categories we defined
  • How to make use of answer type?
  • As a feature in answer extraction phase

METHOD PROCESS REASON RESULT CONDITION
DEFINITION PERSON_BIO DEGREE
How do you choose. In what process, how
???????/???? Why..?, What is the reason of
????????????,???????????? In what
condition? What is ?, What is the advantage of
? Who is ? How much damage?
53
NTCIR6 QAC Keyterm expansion
  • Based observations, we found some vocabulary
    mismatches between questions and answers
  • Created an synonym/aliase dictionary from
    Wikipedia and Eijiro (English-to-Japanese
    dictionary)
  • From Wikipedia
  • Use redirection information where aliases can be
    extracted
  • E.g. Carnegie Mellon, CMU
  • From Eijiro
  • Assume target words are synonyms each other
  • Risk of treating financial bank and river
    bank as synonym

54
NTCIR6 QAC Answer Extraction
  • Used Maximum Entropy model where we can
    incorporate easily customizable features
  • One-sentence assumption
  • Finding answer boundaries is difficult, because
    in non-factoid QA, it requires more
    text-understanding
  • So, we assumed one sentence is an appropriate
    span to start with
  • Then, answer extraction problem became more like
    an answer selection problem
  • Top N answer candidates to return
  • As long as the score given to the answer
    candidate is over the threshold

55
NTCIR6 QAC Answer Extraction Features
  • Numeric features (Q denotes question sentence,
    and A denotes answer candidate sentence. )
  • KEYTERM of key terms from Q found in A
  • ALIAS of aliases (obtained from Wikipedia and
    Eijiro) from Q found in A
  • RELATED_TERM of related terms (obtained from
    web mining) from Q found in A
  • KEYTERM_DIST Closest sentence level distance of
    a key term from Q.
  • ALIAS_DIST Closest sentence level distance of a
    key term from Q.
  • RELATED_TERM_DIST Closest sentence level
    distance of a key term from Q.
  • SENTENCE_LENGTH length of the A sentence in
    normal distribution

56
NTCIR6 QAC Answer Extraction Features
  • Binary features
  • Q denotes question sentence, and A denotes answer
    candidate sentence.
  • PATTERN_CUE If there is a hand-crafted cue in A
  • HAS_SUBJ If there is a subject in A
  • HAS_PRON If there is a pronoun in A
  • ATYPE Answer type analyzed from Q
  • LIST_QUE If list ques  "(1)","???","?","?" 
    found in A
  • PAREN If  "?","?" or "(",")"  found in A
  • PARAGRAPH_HEAD If A is the beginning of a
    paragraph.
  • KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
    -KANJI, -PAREN
  • If a certain word occurs directly after the key
    term in A.
  • Observation -NO and -KANJI are strong feature

57
NTCIR6 QAC Human judgment result
  • Manual judgment was done by a person who is
    outside of NTCIR for all 100 questions.
  • Top 4 answer candidates were evaluated
  • Answer candidates are labeled as
  • A the candidate contains the answer
  • B the candidate contains the answer but the main
    topic of the candidate is not about the answer
  • C the candidate contains a part of the answer
  • D the candidate does not contain the answer
  • Judged result
  • A24, B30, C13, D310 out of 377 answer
    candidates
  • Our Interpretation of the result
  • Precision is 18 ((ABC)/(ABCD)) for loose
    evaluation.
  • For 42 of questions, we were able to return at
    least one candidates with A,B,C label

58
Future plans
  • In answer type analysis, classify multiple binary
    features of question, instead of picking out only
    one A-type category.
  • Instead of introducing one-sentence assumption,
    see the answer extraction as answer segmentation
    problem
  • Automatic evaluation metrics from
    text-segmentation task, such as COAP
    (Co-Occurrence Agreement Probability), will be
    available even if factoid and non-factoid
    questions are mixed together. (cf. Basic Element
    approach)

59
NTCIR6 CLQA EJ/JJ task Future Work
  • Use or develop more accurate NE tagger, trading
    off with speed
  • True annotation
  • ltPERSONgt??lt/PERSONgt????????ltPERSONgt???lt/PERSONgt???
    ?????
  • Output from CaboCha
  • ltLOCATIONgt?lt/LOCATIONgt?????????ltPERSONgt??lt/PERSONgt
    ltORGANIZATIONgt?lt/ORGANIZATIONgt????????
  • Output from Bar
  • ltPERSONgt??lt/PERSONgt????????ltPERSONgt??lt/PERSONgtltORG
    ANIZATIONgt?lt/ORGANIZATIONgt????????
  • Try other classifier learner algorithms
  • SVM, Ada boost, Decision Tree, Voted Perceptron,
    e.t.c.
  • Feature engineering and beyond
  • We put this and this features and got the best
    accuracy.
  • So what?
  • Want to interpret the result by answering a
    question
  • How much does the feature A contributed to
    extract the answer?

60
Answer Generator
  • NTCIR5
  • Cluster similar or redundant answers
  • For a cluster containing K answers whose
    extraction confidence scores are S1, S2, ..., SK,
    the cluster confidence is computed as
  • NTCIR6
  • Apply an answer ranking model to estimate a
    probability of an answer given multiple answer
    relevance and similarity features

61
Answer Ranking Model
  • Two subtasks for answer ranking
  • Identify relevant answer candidates estimate
    P(correct(Ai)Ai,Q)
  • Exploit answer redundancy estimate
    P(correct(Ai)Ai,Aj)
  • Goal Estimate P(correct(Ai)Q,A1, An)
  • Use logistic regression to estimate answer
    probability given the degree of answer relevance
    and the amount of supporting evidence provided in
    the set of answer candidates

62
Answer Ranking Model (2)
where, simk(Ai, Aj) a scoring function used
to calculate an answer similarity between Ai and
Aj relk(Ai) a feature function used to produce
an answer relevance score for an answer Ai K1,
K2 the number of feature functions for answer
validity and answer similarity scores,
respectively N the number of answer
candidates a0,ßk?k weights learned from training
data
63
Feature Representation
  • Answer Relevance Features
  • Knowledge-based Features
  • Data-driven Features
  • Answer Similarity Features
  • String distance metrics
  • Synonyms
  • Each feature produces an answer relevance or
    answer similarity score

64
Knowledge-based Feature Gazetteers
  • Electronic gazetteers provide geographic
    information
  • English
  • Use Tipster Gazetteer, CIA World Factbook,
    Information about the US states
    (www.50states.com)
  • Japanese
  • Extract Japanese location information from Yahoo
  • Use Gengo GoiTaikei location names
  • Chinese
  • Extract location names from the Web and HowNet
  • Translated names
  • Translate country names provided by the CIA World
    Factbook and the Tipster gazetteers into Chinese
    and Japanese names
  • Top 3 translations were used

65
Relevance Score (Gazetteers)
66
Knowledge-based Feature Ontologies
  • Ontologies such as WordNet contain information
    about relationships between words and general
    meaning types (synsets, semantic categories,
    etc.)
  • English
  • WordNet WordNet 2.1 contains 155,327 words,
    117,597 synsets and 207,016 word-sense pairs
  • Japanese
  • Gengo GoiTaikei contains 300,000 Japanese words
    with their associated 3,000 semantic classes
  • Chinese
  • HowNet contains 65,000 Chinese concepts and
    75,000 corresponding English equivalents

67
Relevance Score (WordNet)
68
Data-driven Feature Google
  • Use Google for English, Japanese and Chinese
  • For each answer candidate Ai
  • 1. Initialize the Google score gs(Ai) 0
  • 2. Create a query
  • 3. Retrieve the top 10 snippets from
  • Google
  • 4. For each snippet s
  • 4.1. Initialize the co-occurrence score
  • cs(s) 1
  • 4.2. For each keyterm translation k in s
  • 4.2.1. Compute distance d, the minimum
  • number of words between k and the
  • answer candidate
  • 4.2.2. Update the snippet co-occurrence
  • score
  • 4.3 gs(Ai) gs(Ai) cs(s)
  • Question What is the prefectural capital
  • city whose name is written in hiragana?
  • Keyterms and their translation
  • - prefectural ???? (0.75)
  • ?????? (0.25)
  • - capital city ?? (0.78)
  • ????? (0.11)
  • ?? (0.11)
  • - written ??(0.6)
  • - hiragana ??? (0.5)
  • ?? (0.3)
  • ???? (0.11)
  • ???? (0.1)
  • Answer candidate ?????
  • Query
  • ????? (???? OR
  • ??????) (?? OR ?????
  • OR ??) (??) (? ?? OR
  • ?? OR ???? OR ????)

69
Data-driven Feature Wikipedia
  • Use Wikipedia for English, Japanese and Chinese
  • Algorithm

70
Similarity Features
  • String Distance
  • Levenshtein, Cosine, Jaro and Jaro-Winkler
  • Synonyms
  • Binary similarity score for synonyms
  • English WordNet synonyms, Wikipedia redirection,
    CIA World Factbook
  • Japanese Wikipedia redirection, EIJIRO
    dictionary
  • Chinese Wikipedia redirection

1, if Ai is a synonym of Aj 0, otherwise
sim(Ai,Aj)
71
Answer Similarity using Canonical forms
  • Type specific conversion rules

72
AG Results (E-J)
  • E-J
  • E-C

73
Breakdown by Answer Type
E-J
E-C
74
Effects of Keyterm Translation on Answer Ranking
75
Future Work
  • Continue to analyze the effects of keyterm
    translation on answer ranking
  • Improve Web validation
  • Query relaxation when there is no matched Wed
    documents
  • e.g. Which city in Japan is the "Ramen Museum"
    located?
  • Ramen Museum is translated into "ramen ??? and
    there is no matched Web documents
  • Change the query to (ramen AND ???) or
    incorporate English keyterm (Ramen Museum)
  • Extend our joint prediction model to CLQA
  • Apply a probabilistic graphical model to estimate
    the joint probability of all answer candidates,
    from which the probability of an answer can be
    inferred

76
NTCIR Evaluation Results
77
Evaluation Metrics
  • Datasets
  • Monthly Evaluation
  • Current Perfomance
  • Performance History
  • Periodic Analysis
  • Error Analysis (on Development Set)
  • Plans for Future Development

78
Evaluation Metrics Report
  • End-to-end and modular evaluation
  • Evaluation of speed and accuracy
  • Summary of internal HTML evaluation reports
  • http//durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javeli
    n_Project/Multilingual/Evaluation

79
Plans for Future Development
  • Question Classification
  • Replace manually-created knowledge sources
    heuristics with learners
  • Re-architecture to place learned components as
    supporting agents to rule-based control
  • Semantic Role Labeling
  • Nominalizations
  • Semantic Predicate Expansion
  • Automatic ontology acquisition

80
Structured Retrieval for Question Answering
81
Standard Approach to Retrieval for QA
Output Answers
Input Question
  • Question Analysis
  • Determines what linguistic and semantic
    constraints must hold for a document to contain a
    valid answer
  • Formulates a bag-of-words query using question
    keywords and named entity placeholder
    representing expected answer type
  • Document Retrieval
  • Corpus indexed for keywords and named entity
    annotations 13
  • Provides best-match documents containing keywords
    and NE
  • Answer Extraction and Post Processing
  • Checks constraints and extracts NE answer

82
Issues with Standard Approach
  • Why is the standard approach sometimes
    sub-optimal for QA?
  • May not scale to large collections
  • When question keywords are frequent and co-occur
    frequently, many documents that do not answer the
    question may be matched. eg. What country is
    Berlin in?
  • Named entities can help narrow the search space
    but still match sentences such as Berlin is near
    Poland.
  • May be slow
  • If answer extraction or constraint checking is
    not a cheap operation, current approach may
    retrieve large numbers of irrelevant documents
    that need to be checked.
  • May be ineffective for non-factoid (e.g.
    relationship) questions
  • Can we reduce the number of documents we need to
    process in order to find an answer more quickly?
  • more relevant documents more highly ranked

83
Alternative Structured Retrieval
  • Using higher-order information can distinguish
    relevant vs. non-relevant results which look the
    same to a bag-of-words retrieval model
  • Linguistic and semantic analyses are stored as
    annotations and indexed as fields.
  • Constraint checking at retrieval time can improve
    document ranking based on matching constraints,
    thereby reducing post-processing burden.

84
The Role of Retrieval in QA
Output Answers
Input Question
  • Coarse, first-pass filter to narrow search space
    for answers
  • Finding actual answers requires checking
    linguistic and semantic constraints
  • Bag-of-words retrieval does not support such
    constraint checking at retrieval time
  • May need to process large number of irrelevant
    documents to find best answers.
  • Want to improve document ranking based on
    constraints

85
Retrieval Approaches for QA
Output Answers
Input Question
  • System A
  • Query composed of question keywords and a named
    entity placeholder
  • Bag-of-words retrieval
  • Constraint checking using ASSERT, answer
    extraction
  • System B
  • Likely answer-bearing structures posited one
    query per structure
  • Structured retrieval with Constraint checking
  • Answer extraction

86
Research Questions
  • How can we compare Systems A and B?
  • Experiment Answer-Bearing Sentence Retrieval
  • How does the effectiveness of the structured
    approach compare to bag-of-words?
  • Does structured retrieval effectiveness vary with
    question complexity?
  • Experiment The Effect of Annotation Quality
  • To what degree is the effectiveness of structured
    retrieval dependent on the quality of the
    annotations?

87
ExperimentAnswer-Bearing Sentence Retrieval
  • Hypothesis Structured Retrieval retrieves more
    relevant documents more highly ranked compared to
    bag-of-words retrieval
  • AQUAINT Corpus (LDC2002T31)
  • Sentence Segmentation by MXTerminator 14
  • Named Entity Recognition by BBN Identifinder 1
  • Semantic Role Labels by ASSERT 12
  • 109 TREC 2002 Factoid Questions
  • Exhaustive document-level judgments over AQUAINT
    2, 8
  • Training (55) and test (54) sets, with similar
    answer type distribution
  • Answer-bearing sentences manually identified
  • Must completely contain the answer without
    requiring inference or aggregation of information
    across multiple sentences
  • Gold-standard question analysis/query formulation

88
Example Answer Bearing Sentence
Q1402 What year did Wilt Chamberlain score 100
points? A At the time of his 100-point game with
the Philadelphia Warriors in 1962, Chamberlain
was renting an apartment in New York.
TARGET
renting
89
Question-Structure Mapping
Q1402 What year did Wilt Chamberlain score 100
points?
TARGET
combinesentence( max( combinetarget(
max( combine./argm-tmp( 100 point
anydate ) ) max( combine./arg0(
max( combineperson( chamberlain
) ) ) ) ) ) )
ARGM-TMP
100 points
An answer-bearing structure
A structured query that retrieves instances of
this structure
90
Answer-Bearing Sentence Retrieval
  • Two experimental conditions
  • single structure one structured query
  • Only answer-bearing sentences matching a single
    structure are considered relevant
  • every structure many queries, round robin
  • Any answer-bearing sentence considered relevant
  • Most QA systems somewhere in between, querying
    for several, but not all, structures
  • Keyword Named Entity Baseline

91
Results Training Topics
12.8
96.9
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
92
Results Test Topics
11.4
46.6
93
Results
Training Topics
Test Topics
96.9
46.6
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
94
Structure Complexity
  • Results show that, on average, structured
    retrieval has superior recall of answer-bearing
    sentences.
  • For what types of queries is structured retrieval
    most helpful?
  • Analyze recall at 200 for queries of different
    levels of complexity.
  • Complexity of structure estimated by counting the
    number of combine operators, not including the
    outermost.

95
The more complex the structure sought, the more
useful knowledge of that structure is in ranking
answer-bearing sentences.
96
In the test set, there are fewer queries, total,
and fewer highly complex queries. This widens
confidence intervals, but there is still a range
where 95 confidence intervals do not overlap
much or at all.
97
The Effect of Annotation Quality
  • Penn Treebank WSJ 9 corpus
  • WSJ_GOLD Gold standard Propbank 6 annotations
  • WSJ_DEGRADED Semantic role labeling by ASSERT
    (88.8 accurate)
  • All questions answerable over the corpus
  • Exhaustively generated sentence-level relevance
    judgments
  • 10,690 questions having more than one answer

98
Question and Judgment Generation
  • Each sentence that contains a Propbank annotation
    can answer at least one question
  • Dow Jones publishes The Wall Street Journal,
    Barrons magazine, other periodicals and
    community newspapers.
  • What does Dow Jones publish?
  • Who publishes The Wall Street Journal, Barrons
    ...?
  • Does Dow Jones publish The Wall Street Journal,
    Barrons ... ?

TARGET
publishes
99
What does Dow Jones publish?
  • The group of sentences that answer this question
  • WSJ_0427 Dow Jones publishes The Wall Street
    Journal, Barrons magazine, other periodicals and
    community newspapers and operates electronic
    business information services.
  • WSJ_0152 Dow Jones publishes The Wall Street
    Journal, Barrons magazine, and community
    newspapers and operates financial news services
    and computer data bases.
  • WSJ_1551 Dow Jones also publishes Barrons
    magazine, other periodicals and community
    newspapers and operates electronic business
    informaiton services.

100
Judgments for WSJ_DEGRADED
  • Sentences relevant for WSJ_GOLD are not relevant
    for WSJ_DEGRADED if ASSERT omits or mislabels an
    argument
  • This models the reality of a QA system that can
    not determine relevance if annotations are
    missing or incorrect or if sentence can not be
    analyzed
  • Constraint checking and answer extraction both
    depend on the analysis

101
Annotation Quality Results
Structured retrieval is robust to degraded
annotation quality.
102
Structured Retrieval Recall
  • Structured retrieval ranks sentences that satisfy
    the constraints highly.
  • Structured retrieval outperforms the bag-of-words
    approach in terms of recall of relevant
    sentences.
  • Structured retrieval performs best when query
    structures anticipate answer-bearing structures,
    and when these structures are complex.

103
Structured Retrieval Precision
  • For questions with keywords that frequently
    co-locate in the corpus, structured retrieval
    should offer a sizable precision advantage, eg.
    What country is Berlin in?
  • Querying on Berlin alone matches over 6,000
    documents in the AQUAINT collection, most of
    which do not answer the question.
  • Questions such as this were intentionally
    excluded during construction of the test
    collection to ease the human assessment burden.

104
Structured Retrieval Efficiency
  • Structured queries are slower to evaluate, but
    retrieve more relevant results more highly
    ranked, compared to bag-of-words queries.
  • A QA system seeking to achieve a certain recall
    threshold will have to process fewer documents
  • Processing fewer results can improve end-to-end
    system runtime, even for systems in which answer
    extraction cost is low.
  • The structured retrieval approach requires that
    the corpus be pre-processed off-line.
  • Using the bag-of-words approach, a QA system is
    free to run analysis tools on-the-fly, but this
    could negatively impact the latency of an
    interactive system.

105
Structured Retrieval Robustness
  • Although accuracy degrades when the annotation
    quality degrades, the relative performance edge
    that structured retrieval enjoys over
    bag-of-words is maintained. (details in the
    paper)

106
Exploring the Problem Space
Corpus-based view
Query-based view
Domain
Keyword distribution over Corpus
Newswire, WMD, Medical
Language
EN, JP, CH
Annotations
Annotation or Structure distribution over Corpus
NE, SRL, NomSRL, Syntax, Special purpose event
frames
There is a hypothesized sub-space in which
structured retrieval consistently outperforms.
We may be able to determine the boundaries
experimentally and then generalize.
107
Conclusions
  • Structured retrieval retrieves more relevant
    documents, more highly ranked, compared to
    bag-of-words retrieval.
  • The better ranking requires the QA system to
    process fewer documents to achieve a certain
    level of recall of answer-bearing sentences.
  • Although accuracy degrades when the annotation
    quality degrades, the relative performance edge
    that structured retrieval enjoys over
    bag-of-words is maintained.
  • Details are in the paper (submitted to SIGIR)

108
Future Work
  • Question Analysis for Structured Retrieval
  • Map question structures into likely
    answer-bearing structures
  • Mitigate computational burden of corpus
    annotation
  • How to merge results from different structured
    queries in the event that more than one structure
    is considered relevant?

109
Experimental Plan
110
References
  • 1 Bikel, Schwartz and Weischedel. An algorithm
    that learns whats in a name. ML,
    34(1-3)211-231, 1999.
  • 2 Bilotti, Katz and Lin. What works better for
    question answering stemming or morphological
    query expansion? In Proc. of the IR4QA Workshop
    at SIGIR04. 2004.
  • 6 Kingsbury, Palmer and Marcus. Adding
    semantic annotation to the penn treebank. In
    Proc. of HLT02. 2002.
  • 8 Lin and Katz. Building a reusable test
    collection for question answering. JASIST
    57(7)851-861. 2006.
  • 9 Marcus, Marcinkiewicz and Santorini.
    Building a large annotated corpus of english the
    penn treebank. CL, 19(2)313-330. 1993.
  • 12 Pradhan, Ward, Hacioglu, Martin and
    Jurafsky. Shallow semantic parsing using support
    vector machines. In Proc. of HLT/NAACL04. 2004.
  • 13 Prager, Brown, Coden and Radev.
    Question-answering by predictive annotation. In
    Proc. of SIGIR00. 2000.
  • 14 Reynar and Ratnaparkhi. A maximum entropy
    approach to identifying sentence boundaries. In
    Proc. of ANLP97. 1997.
  • 19 Zhai and Lafferty. A study of smoothing
    methods for language models applied to ad hoc
    information retrieval. In Proc. of SIGIR01.
    2001.

111
Semantic Role Labeling for JAVELIN
  • Introduction to Semantic Role Labeling
  • Extracting information such as WHO did WHAT to
    WHOM, WHEN and HOW, from the sentence
  • Predicate describes the action/event, its
    arguments give information about who, what, whom,
    when etc
  • Useful for Information Extraction, Question
    Answering and Summarization. e.g.
    bromocriptine-induced activation of p38 MAP
    kinase contains information used to answer
    questions such as What activates p38 MAP
    kinase? or What induces the activation of p38
    MAP kinase?

112
Semantic Role Labeling for JAVELIN
  • English SRL ASSERT from U. Colorado
  • ASSERT performance (F-measure)
  • Hand corrected parses 89.4
  • Automatic parsing 79.4
  • Upgrade to ASSERT 0.14b
  • Current ASSERT is slow model is loaded once per
    document
  • Explore using the remote/client service options
    from new ASSERT

113
Semantic Role Labeling for JAVELIN
  • Example from ASSERT
  • I mean, the line was out the door, when I first
    got there.
  • ARG0 I TARGET mean ARG1 the line was out
    the door when I first got there
  • I mean the line was out ARGM-TMP the door
    ARGM-TMP when ARG0 I ARGM-TMP first TARGET
    got ARGM-LOC there
  • ASSERT misses the be and have verbs. KANTOOs
    rule based system is used to handle these cases.
    be and have occur frequently in questions.
  • PROPBANK doesnt have any examples of predicate
    be in the training corpus
  • Plan for own future work on SRL for questions

114
Semantic Role Labeling for JAVELIN
  • SRL for Chinese
  • C-ASSERT Chinese ASSERT
  • 82.02 F-score
  • Chinese extension of English ASSERT
  • Example
  • ????????????????????
  • ARG0 ??? ?? ARGM-ADV ? ??? ARG0 ? ?? ???
    TARGET ?? ARG1 ?? ?

115
Semantic Role Labeling for JAVELIN
  • SRL for Japanese Not much work done. Develop
    SRL system in house starting as a class project
  • Recently released - NAIST Text Corpus v1.2 beta-
    includes verbal and nominative predicates with
    labeled arguments
  • The Kyoto Text Corpus v4.0 includes POS,
    non-projective dependency parse for 40,000
    sentences and case role, anaphora, ellipsis and
    co-reference for 5,000 sentences
  • Use CRF and Tree-CRF for the learning task

116
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com