Relation Extraction for Academic Collaboration 10709 Project Presentation - PowerPoint PPT Presentation

About This Presentation
Title:

Relation Extraction for Academic Collaboration 10709 Project Presentation

Description:

'Amy Karlson''Benjamin B. Bederson' 0.6666666666666666 ' ... 'Sven Koenig''Reid Simmons' 0.6666666666666666 'Yan Liu''Jaime Carbonell' 0.6666666666666666 ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 18
Provided by: sophi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Relation Extraction for Academic Collaboration 10709 Project Presentation


1
Relation Extraction for Academic
Collaboration10-709 Project Presentation
  • Justin Betteridge, Matthew Bilotti, Simon Fung,
    Sophie Wang
  • February 16, 2006

2
Academic Collaboration
  • When two academic researchers work together...
  • on a proposal
  • by co-authoring a paper
  • by co-chairing a committee
  • in the same project or research group
  • This is evidence of Academic Collaboration
  • Binary, symmetric relation
  • Arguments are of type

3
Motivation
  • Why might we be interested in extracting Academic
    Collaboration relations?
  • Social Networking
  • Explore the transitivity of the relation
  • Proof-of-concept for extending relation
    extraction machinery to other types of relations

4
Architectural Overview
Query Formulator
Query Formulator
Pattern Bank
Relation Bank
IR
Relation Extractor
Pattern Extractor
5
Co-training Algorithm
  • Do until termination condition is reached
  • For each pattern in the pattern bank
  • Generate an IR query and send it to the IR engine
    getting back a set of documents
  • For each document in the set
  • extract relations
  • Score all relations (new and old)
  • Remove relations below threshold

6
Co-training Algorithm II
  • For each relation in the relation bank
  • Generate an IR query and send it to the IR engine
    getting back a set of documents
  • For each document in the set,
  • extract context strings for patterns
  • Score all patterns (new and old)
  • Remove patterns below threshold
  • Loop

7
Extraction Pattern Formalism
  • From the proposal
  • left between right
  • Arguments extracted with respect to context
  • Current status quo
  • context string
  • argument extracted from page title
  • Extracts the relation
  • CollaboratesWith( , )

8
Detecting Argument Types
  • CollaboratesWith( , )
  • and must be of type
  • Essential to weed out low quality relations
    produced by noisy patterns such as in
    collaboration with
  • Heuristics currently encoded as regular
    expressions

9
Measuring Confidence with Coverage
  • Confidence for an Extraction Pattern
  • Intuitively, relations vote for patterns
  • Query each relation, try to extract the pattern
  • score proportion of successful relations
  • Confidence for a Relation
  • Query each pattern, try to extract the relation
  • Score proportion of successful patterns

10
Issues with Coverage as Confidence
  • Seed relations and pattern must co-occur
  • Very little tolerance for new information
  • It is difficult for a new pattern that broadens
    the scope of the relations extracted to gain
    enough confidence to surpass the threshold
  • Scores tend to zero as pools grow
  • However, ad-hoc methods of confidence method
    combination from one iteration to the next
    introduces a new problem there is no way to oust
    bad relations or patterns once extracted

11
Example Seed Data for Co-Training
  • Extraction Patterns
  • in collaboration with
  • my advisor is
  • Relations
  • CollaboratesWith( Tom, Roni )
  • CollaboratesWith( William, Ken )

12
Extraction Pattern Examples
Query my advisor is sitecs.cmu.edu
13
Extracted Relations
  • "Miroslav Dudik""Rob Schapire"
    0.3333333333333333
  • "Personal""Prof. Sanjeev" 0.3333333333333333
  • "Research""Professors Jonathan"
    0.3333333333333333
  • "Sharon Whiteman""Mary Vernon"
    0.3333333333333333
  • "Sudhakar""Prof. Edward" 0.3333333333333333
  • "Ting""Professor Andrew" 0.3333333333333333
  • "Adriana Karagiozova""Moses Charikar"
    0.6666666666666666
  • "Akash Lal""Tom Reps" 0.6666666666666666
  • "Amy Karlson""Benjamin B. Bederson"
    0.6666666666666666
  • "Aravind Kalaiah""Dr. Amitabh"
    0.6666666666666666
  • "Chi Zhang""Randolph Y. Wang"
    0.6666666666666666
  • "Gaurav Shah""Matt Blaze" 0.6666666666666666
  • "Jennifer Beckmann""Jeff Naughton"
    0.6666666666666666
  • "Lucja Kot""Dexter Kozen" 0.6666666666666666
  • "Mark Sandler""Jon Kleinberg"
    0.6666666666666666
  • "Nina""Prof. Avrim" 0.6666666666666666
  • "Patrick Ng""Uri Keich" 0.6666666666666666
  • "Pavlos Papageorgiou""Prof. Michael"
    0.6666666666666666
  • "Pratyusa Manadhata""Jeannette M. Wing"
    0.6666666666666666

14
Learned Patterns
  • My advisor is 0.6
  • Near misses (hard to assess confidence)
  • I work with 0.4
  • Together with 0.0667
  • Languages Research under 0.0333
  • Computer Science advisor 0.0333
  • Languages under Prof 0.0
  • Study under Prof 0.0
  • currently working with 0.0
  • user studies with 0.0

15
Bad Patterns
  • From citations
  • Amit Agarwal and, etc. (other authors)
  • L1 Norm with (part of a title)
  • From professional titles
  • Professor, Professor of Mathematics, etc.
  • From course web pages
  • courses cs686 2003sp
  • Other
  • be addressed to

16
Software and Datasets Used
  • Indri retrieval engine
  • Locally crawled collection of pages from CS
    departments of universities
  • Using a local collection greatly improved the
    development experience by shortening the
    debugging cycle, and relieved us from the Google
    API query quota
  • No features of Indri that Google does not support
    were used so that Google could be substituted for
    Indri in the future

17
Future Work
  • Different methods of combining confidence scores
  • including weighting of votes during scoring
  • Different confidence metrics, e.g., PMI
  • Additional useful sources of information
  • bibliographies, anchor text and link structure
    advisor-advisee cross-refs, department or lab
    organization
  • Better argument type checking
  • Tuning of the threshold
  • Termination condition
  • Integration with citations group
  • Integrate with Google
  • Make code run faster
Write a Comment
User Comments (0)
About PowerShow.com