Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process

About This Presentation

Title:

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process

Description:

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single ... Simple (only free text, not linguistic data) ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 15

Provided by: Dan5233

Category:

more less

Transcript and Presenter's Notes

Title: Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process

1
Naive Algorithms for Key-phrase Extraction and
Text Summarization from a Single Document
inspired by the Protein Biosynthesis Process

Daniel Gayo Avello
(University of Oviedo)

2
Whats the problem?
?

Document reading is a time consuming task
Many common documents (e.g., e-mail, newsgroup
posts, web pages) lack of abstract or keywords
But, they are electronic so we can work on
them in some way

?
?
8
3
Whats the problem? (cont.)

Many techniques to perform several Natural
Language Processing (NLP) useful tasks
Language identification.
Document categorization and clustering.
Keyword extraction.
Text summarization.
Quite different
With/Without human supervision.
With/Without training.
With/Without complex linguistic data.
With/Without document corpora.

?
?
17
4
Any suggestion?

It would be great to use only one technique to
carry out several of those tasks.
Desirable goals
Simple (only free text, not linguistic data)
Fully automatic (neither supervision nor ad hoc
heuristics)
Scalable (from one web page to several web sites)
Could it be a bio-inspired solution?

25
5
Our (bio-inspired) hypothesis

Living beings are defined by their genome.
Document from a corpus Individual from a
population
So?
Lets imagine a document genome
Similar documents (similar language/topic) ?
Similar genomes.
More interesting, translation from document
genome to significance proteins (i.e.,
keyphrases and summaries).

33
6
Our biological inspiration

The protein biosynthesis process

Termination
Elongation
Initiation
Could we mimic this to distill from a single
document keyphrases and summaries!?
Polypeptide chain
Transcription
42
7
The ingredients
Biological element Computational counterpart
tRNA Spliced document genome
mRNA Documents plain text
Ribosome Algorithm
Polypeptide chain Document chunks with significance weights
Protein Keyphrases
50
8
A DNA for Natural Language?

n-grams (slices of adjoining n characters)
Frequency not the most relevant weight for each
n-gram.
There exist different measures to show relation
between both elements in a bigram
Mutual information.
Dice coefficient.
Loglike.
Cannot be applied straightforward to n-grams ?
But, they can be generalized (Ferreira and
Pereira, 1999) ?

58
9
A DNA for Natural Language? (cont.)
67
10
Document genome translation

So
Document genome spliced into pseudo-tRNA.
Document used as pseudo-mRNA.
We attach to the document pseudo-tRNA
molecules (with max. weight) while average
significance per character continues growing.
Result Document spliced into chunks with
maximum average significance.
The
rain
in
Spain
stays mainly in
the plain

20 The
49 The r
73 The ra
pseudo-mRNA
The rain in Spain stays mainly in the plain.
etc.
75
11
Folding the protein / summarization

To obtain keyphrases the protein (text chunks)
must be folded
At this moment we are studying different
alternatives
Mutual reinforcement?
Chunks Documents ? Apply classical IR
techniques?
Others?
Automatic text summarization
Simple but useful approach.
Use the shortest paragraphs with the most
significant keyphrases.

83
12

To test feasibility of these ideas a prototype
was developed.
blindLight http//www.purl.org/NET/blindLight
It receives a user-provided URL and produces
A blindlighted version of the original URL.
A list of keyphrases.
An automatic summary.

92
13
Conclusions

Proof-of-concept tests have been performed
Details in the paper
Results can be improved.
Thorough study and analysis is needed.
Really promising! ?
Summary of the proposal
Free text from just one document.
Language independent (currently only western
languages).
Bio-inspired.
Extremely simple to implement.

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process - PowerPoint PPT Presentation

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single ... Simple (only free text, not linguistic data) ... – PowerPoint PPT presentation