Relation Extraction for Academic Collaboration 10709 Project Presentation

About This Presentation

Title:

Relation Extraction for Academic Collaboration 10709 Project Presentation

Description:

'Amy Karlson''Benjamin B. Bederson' 0.6666666666666666 ' ... 'Sven Koenig''Reid Simmons' 0.6666666666666666 'Yan Liu''Jaime Carbonell' 0.6666666666666666 ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 18

Provided by: sophi2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Relation Extraction for Academic Collaboration 10709 Project Presentation

1
Relation Extraction for Academic
Collaboration10-709 Project Presentation

Justin Betteridge, Matthew Bilotti, Simon Fung,
Sophie Wang
February 16, 2006

2
Academic Collaboration

When two academic researchers work together...
on a proposal
by co-authoring a paper
by co-chairing a committee
in the same project or research group
This is evidence of Academic Collaboration
Binary, symmetric relation
Arguments are of type

3
Motivation

Why might we be interested in extracting Academic
Collaboration relations?
Social Networking
Explore the transitivity of the relation
Proof-of-concept for extending relation
extraction machinery to other types of relations

4
Architectural Overview
Query Formulator
Query Formulator
Pattern Bank
Relation Bank
IR
Relation Extractor
Pattern Extractor
5
Co-training Algorithm

Do until termination condition is reached
For each pattern in the pattern bank
Generate an IR query and send it to the IR engine
getting back a set of documents
For each document in the set
extract relations
Score all relations (new and old)
Remove relations below threshold

6
Co-training Algorithm II

For each relation in the relation bank
Generate an IR query and send it to the IR engine
getting back a set of documents
For each document in the set,
extract context strings for patterns
Score all patterns (new and old)
Remove patterns below threshold
Loop

7
Extraction Pattern Formalism

From the proposal
left between right
Arguments extracted with respect to context
Current status quo
context string
argument extracted from page title
Extracts the relation
CollaboratesWith( , )

8
Detecting Argument Types

CollaboratesWith( , )
and must be of type
Essential to weed out low quality relations
produced by noisy patterns such as in
collaboration with
Heuristics currently encoded as regular
expressions

9
Measuring Confidence with Coverage

Confidence for an Extraction Pattern
Intuitively, relations vote for patterns
Query each relation, try to extract the pattern
score proportion of successful relations
Confidence for a Relation
Query each pattern, try to extract the relation
Score proportion of successful patterns

10
Issues with Coverage as Confidence

Seed relations and pattern must co-occur
Very little tolerance for new information
It is difficult for a new pattern that broadens
the scope of the relations extracted to gain
enough confidence to surpass the threshold
Scores tend to zero as pools grow
However, ad-hoc methods of confidence method
combination from one iteration to the next
introduces a new problem there is no way to oust
bad relations or patterns once extracted

11
Example Seed Data for Co-Training

Extraction Patterns
in collaboration with
my advisor is
Relations
CollaboratesWith( Tom, Roni )
CollaboratesWith( William, Ken )

12
Extraction Pattern Examples
Query my advisor is sitecs.cmu.edu
13
Extracted Relations

"Miroslav Dudik""Rob Schapire"
0.3333333333333333
"Personal""Prof. Sanjeev" 0.3333333333333333
"Research""Professors Jonathan"
0.3333333333333333
"Sharon Whiteman""Mary Vernon"
0.3333333333333333
"Sudhakar""Prof. Edward" 0.3333333333333333
"Ting""Professor Andrew" 0.3333333333333333
"Adriana Karagiozova""Moses Charikar"
0.6666666666666666
"Akash Lal""Tom Reps" 0.6666666666666666
"Amy Karlson""Benjamin B. Bederson"
0.6666666666666666
"Aravind Kalaiah""Dr. Amitabh"
0.6666666666666666
"Chi Zhang""Randolph Y. Wang"
0.6666666666666666
"Gaurav Shah""Matt Blaze" 0.6666666666666666
"Jennifer Beckmann""Jeff Naughton"
0.6666666666666666
"Lucja Kot""Dexter Kozen" 0.6666666666666666
"Mark Sandler""Jon Kleinberg"
0.6666666666666666
"Nina""Prof. Avrim" 0.6666666666666666
"Patrick Ng""Uri Keich" 0.6666666666666666
"Pavlos Papageorgiou""Prof. Michael"
0.6666666666666666
"Pratyusa Manadhata""Jeannette M. Wing"
0.6666666666666666

14
Learned Patterns

My advisor is 0.6
Near misses (hard to assess confidence)
I work with 0.4
Together with 0.0667
Languages Research under 0.0333
Computer Science advisor 0.0333
Languages under Prof 0.0
Study under Prof 0.0
currently working with 0.0
user studies with 0.0

15
Bad Patterns

From citations
Amit Agarwal and, etc. (other authors)
L1 Norm with (part of a title)
From professional titles
Professor, Professor of Mathematics, etc.
From course web pages
courses cs686 2003sp
Other
be addressed to

16
Software and Datasets Used

Indri retrieval engine
Locally crawled collection of pages from CS
departments of universities
Using a local collection greatly improved the
development experience by shortening the
debugging cycle, and relieved us from the Google
API query quota
No features of Indri that Google does not support
were used so that Google could be substituted for
Indri in the future

17
Future Work

Different methods of combining confidence scores
including weighting of votes during scoring
Different confidence metrics, e.g., PMI
Additional useful sources of information
bibliographies, anchor text and link structure
advisor-advisee cross-refs, department or lab
organization
Better argument type checking
Tuning of the threshold
Termination condition
Integration with citations group
Integrate with Google
Make code run faster