Extracting Keyphrases from Books using Language Modeling Approaches - PowerPoint PPT Presentation

Loading...

PPT – Extracting Keyphrases from Books using Language Modeling Approaches PowerPoint presentation | free to view - id: 17420e-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Extracting Keyphrases from Books using Language Modeling Approaches

Description:

Automatic Identification of Keyphrases from chapters of books. Language independent ... Kpspotter: a exible information gain-based keyphrase extraction system. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: researc88
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Extracting Keyphrases from Books using Language Modeling Approaches


1
Extracting Keyphrases from Books using Language
Modeling Approaches
  • Rohini U
  • AOL India RD,
  • Bangalore India
  • Rohini.uppuluri_at_corp.aol.com
  • Vamshi Ambati
  • Language Technologies Institute
  • Carnegie Mellon University
  • Pittsburgh, USA
  • vamshi_at_cs.cmu.edu

2
Agenda
  • Keyphrase Extraction
  • Value addition to Digital Libraries
  • Methods of Keyphrase Extraction
  • Related Work
  • Our Solution

3
What are Keyphrases?
  • Keyphrases
  • (Give example)
  • Where used?
  • Cataloguing in Libraries for IR purposes
  • Quick Summarization of documents

4
Why important to ULIB?
  • Vast growth in digital content
  • More than a Million books!
  • Short Meta data description useful to user
    while reading
  • For further processing of books like
    summarization, IR etc

5
How do we extract KPs?
  • Manual entry
  • Reliable, high quality outcome
  • But, time-consuming, expensive
  • Automatic
  • Fast extraction but less reliable
  • No expense at all

6
Automatic techniques for KPE
  • Rule based methods
  • Heuristics (paragraph beginning, headline etc)
  • Krulwich Burkey etc
  • Using Linguistic tools
  • Statistical techniques
  • Term counts and weighting based Methods
  • Learn model from training data
  • Turney et. al5, KEA6 , KSpotter3 etc

7
Requirements for a KPE for ULIB
  • Automatic Identification of Keyphrases from
    chapters of books
  • Language independent
  • Easily adaptable for different domains
  • No training data to learn from
  • Most books in ULIB do not have keywords as part
    of the metadata

8
Solution Outline
  • Language Modeling based
  • Given n-grams
  • Measure Informativeness, Phraseness
  • Score n-grams based on the above measures
  • Pick top K phrases as Keyphrases

9
Extracting Keyphrases from Books
Text
Cleaning Initialization
Candidate Keyphrases Extraction
Scoring
Pruning

Extracted Keyphrases
10
Extracting Keyphrases from Books
  • Topics are also used to construct user profiles
    via explicit specication of interests or
    automatic analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited

Extracted Keyphrases
11
Extracting Keyphrases from Books
  • Topics are also used to construct user proles via
    explicit specication of interests or automatic
    analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
topics construct user, construct user
profiles, user profiles explicit, profiles
explicit specification, explicit specification
interests, specification interests
automatic, automatic analysis web, analysis web
pages, web pages visited

Extracted Keyphrases
12
Extracting Keyphrases from Books
  • Topics are also used to construct user proles via
    explicit specication of interests or automatic
    analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
profiles explicit specication 0.0281 explicit
specication interests 0.0281 specication
interests automatic 0.0272 user proles explicit
0.0260 construct user proles 0.0260 interests
automatic analysis 0.0255 topics construct user
0.0243 automatic analysis web 0.0227 web
pages visited 0.0226 analysis web pages 0.0217

Extracted Keyphrases
13
Scoring
  • Phraseness
  • Measures degree to which a given n-gram can be
    considered a phrase
  • Based on Co-occurrence of words
  • Example..
  • Informativeness
  • Measures how informative a given n-gram is
  • There is a, a lot of etc
  • Comparing co occurrence on a general corpus Vs
    given text(book)
  • Total Score
  • Phraseness-Score Informativeness-Score

14
Scoring - Phraseness
  • Computed by measuring distance between unigram
    model and N-gram model
  • Point wise KL-divergence (Takashi et. al 2004)
  • dw (pq) p(w)log(p(w)/q(w))
  • Phraseness measure
  • dw (LMfgN LMfg1)

15
Scoring - Informativeness
  • Computed by measuring distance between n-gram
    model from given data and n-gram model from
    general data
  • Point wise KL-divergence (Takashi et. al 2004)
  • dw (pq) p(w)log(p(w)/q(w))
  • Informativeness measure
  • dw (LMfg1 LMbg1)

16
Extracting Keyphrases from Books
  • Topics are also used to construct user proles via
    explicit specication of interests or automatic
    analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
profiles explicit specication 0.0281 explicit
specication interests 0.0281 specication
interests automatic 0.0272 user proles explicit
0.0260 construct user proles 0.0260 interests
automatic analysis 0.0255 topics construct user
0.0243 automatic analysis web 0.0227 web
pages visited 0.0226 analysis web pages 0.0217

Extracted Keyphrases
17
Extracting Keyphrases from Books
  • Topics are also used to construct user proles via
    explicit specication of interests or automatic
    analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
proles explicit specication explicit
specication interests specication interests
automatic user proles explicit construct user
proles interests automatic analysis topics
construct user automatic analysis web web
pages visited analysis web pages

Extracted Keyphrases
18
(No Transcript)
19
Conclusions and Future Work
  • Discussed benefits of Keyphrases in ULIB context
  • Demonstrated the building of a KPE that works for
    books
  • Robust evaluation
  • Building a test set from books in ULIB for
    generic robust evaluation of KPE tools
  • Are chapters really independent in a book
  • Revisit the assumption

20
  • Thank you

21
References
  • Fred J. Damerau. Generating and evaluating
    domain-oriented multi-word terms from texts.
    Information Processing and Management,
    29(4)433-447, 1993.
  • S.T Dumais, J Platt, D. Heckerman, and M. Sahami.
    Inductive learning algorithms and representations
    for text categorization. In Proceedings of the
    7th international conference on information and
    knowledge management, page 148-155. ACM Press,
    1998.
  • Min Song, Il-Yeol Song, and Xiaohua Hu.
    Kpspotter a exible information gain-based
    keyphrase extraction system. In WIDM '03
    Proceedings of the 5th ACM international workshop
    on Web information and data management, pages
    50-53, New York, NY, USA, 2003. ACM Press.
  • Takashi Tomokiyo and Mathew Hurst. A language
    modeling approach to keyphrase extraction. In
    Proceedings of the ACL 2003 workshop on Multiword
    expressions, pages 3340, Morristown, NJ, USA,
    2003. Association for Computational Linguistics.
  • P.D. Turney. Learning algorithms for keyphrase
    extraction. Information Retrieval, 2(4)303-336,
    2006.
  • I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin,
    and C.G Nevill-Manning. Kea Practical automatic
    keyphrase extraction. In E. A. Fox and N. Rowe,
    editors, Proceedings of digital libraries 99 The
    fourth ACM conference on digital libraries, pages
    254-255. ACM Press, 1999.
  • Mikio Yamamoto and Kenneth W. Church. Using
    suffix arrays to compute term frequency and
    document frequency for all substrings in a
    corpus. Computational Linguistics, 27(1)1-30,
    2001
About PowerShow.com