Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process

Description:

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single ... Simple (only free text, not linguistic data) ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 15
Provided by: Dan5233
Category:

less

Transcript and Presenter's Notes

Title: Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process


1
Naive Algorithms for Key-phrase Extraction and
Text Summarization from a Single Document
inspired by the Protein Biosynthesis Process
  • Daniel Gayo Avello
  • (University of Oviedo)

2
Whats the problem?
?
  • Document reading is a time consuming task
  • Many common documents (e.g., e-mail, newsgroup
    posts, web pages) lack of abstract or keywords
  • But, they are electronic so we can work on
    them in some way

?
?
8
3
Whats the problem? (cont.)
  • Many techniques to perform several Natural
    Language Processing (NLP) useful tasks
  • Language identification.
  • Document categorization and clustering.
  • Keyword extraction.
  • Text summarization.
  • Quite different
  • With/Without human supervision.
  • With/Without training.
  • With/Without complex linguistic data.
  • With/Without document corpora.

?
?
17
4
Any suggestion?
  • It would be great to use only one technique to
    carry out several of those tasks.
  • Desirable goals
  • Simple (only free text, not linguistic data)
  • Fully automatic (neither supervision nor ad hoc
    heuristics)
  • Scalable (from one web page to several web sites)
  • Could it be a bio-inspired solution?

25
5
Our (bio-inspired) hypothesis
  • Living beings are defined by their genome.
  • Document from a corpus Individual from a
    population
  • So?
  • Lets imagine a document genome
  • Similar documents (similar language/topic) ?
    Similar genomes.
  • More interesting, translation from document
    genome to significance proteins (i.e.,
    keyphrases and summaries).

33
6
Our biological inspiration
  • The protein biosynthesis process

Termination
Elongation
Initiation
Could we mimic this to distill from a single
document keyphrases and summaries!?
Polypeptide chain
Transcription
42
7
The ingredients
Biological element Computational counterpart
tRNA Spliced document genome
mRNA Documents plain text
Ribosome Algorithm
Polypeptide chain Document chunks with significance weights
Protein Keyphrases
50
8
A DNA for Natural Language?
  • n-grams (slices of adjoining n characters)
  • Frequency not the most relevant weight for each
    n-gram.
  • There exist different measures to show relation
    between both elements in a bigram
  • Mutual information.
  • Dice coefficient.
  • Loglike.
  • Cannot be applied straightforward to n-grams ?
  • But, they can be generalized (Ferreira and
    Pereira, 1999) ?

58
9
A DNA for Natural Language? (cont.)
67
10
Document genome translation
  • So
  • Document genome spliced into pseudo-tRNA.
  • Document used as pseudo-mRNA.
  • We attach to the document pseudo-tRNA
    molecules (with max. weight) while average
    significance per character continues growing.
  • Result Document spliced into chunks with
    maximum average significance.
  • The
  • rain
  • in
  • Spain
  • stays mainly in
  • the plain

20 The
49 The r
73 The ra
pseudo-mRNA
The rain in Spain stays mainly in the plain.
etc.
75
11
Folding the protein / summarization
  • To obtain keyphrases the protein (text chunks)
    must be folded
  • At this moment we are studying different
    alternatives
  • Mutual reinforcement?
  • Chunks Documents ? Apply classical IR
    techniques?
  • Others?
  • Automatic text summarization
  • Simple but useful approach.
  • Use the shortest paragraphs with the most
    significant keyphrases.

83
12
  • To test feasibility of these ideas a prototype
    was developed.
  • blindLight http//www.purl.org/NET/blindLight
  • It receives a user-provided URL and produces
  • A blindlighted version of the original URL.
  • A list of keyphrases.
  • An automatic summary.

92
13
Conclusions
  • Proof-of-concept tests have been performed
  • Details in the paper
  • Results can be improved.
  • Thorough study and analysis is needed.
  • Really promising! ?
  • Summary of the proposal
  • Free text from just one document.
  • Language independent (currently only western
    languages).
  • Bio-inspired.
  • Extremely simple to implement.

100
14
Merci beaucoup!
Muchas gracias!
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com