One Size Fits All? A Simple Technique to Perform Several NLP Tasks

About This Presentation

Title:

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Description:

One Size Fits All? A Simple Technique to Perform Several NLP Tasks Daniel Gayo Avello (University of Oviedo) Introduction blindLight is a modified vector model with ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 18

Provided by: Dan5233

Category:

more less

Transcript and Presenter's Notes

Title: One Size Fits All? A Simple Technique to Perform Several NLP Tasks

1
One Size Fits All?A Simple Technique to Perform
Several NLP Tasks

Daniel Gayo Avello
(University of Oviedo)

2
Introduction

blindLight is a modified vector model with
applications to several NLP tasks
Automatic summarization,
categorization,
clustering and
information retrieval.

7
3
Vector Model vs. blindLight Model

blindLight Model
Different documents ? different length vectors.
D? no collection ? no D ? no vector space!
Terms ? just character n-grams.
Term weights ? in-document n-gram significance
(function of just document term frequency)
Similarity measure ? asymmetric (in fact, two
association measures). Kind-of light pairwise
alignment (A vs. B ? B vs. A).
Advantages
Document vectors Unique document
representations
Suitable for ever-growing document sets.
Bilingual IR is trivial.
Highly tunable by linearly combining the two
assoc measures.
Issues
Not tuned yet, so
Poor performance with broad topics.

Vector Model
Document ? D-dimensional vector of terms.
D ? number of distinct terms within whole
collection of documents.
Terms ? words/stems/character n-grams.
Term weights ? function of in-document term
frequency, in-collection term frequency, and
document length.
Assoc measures ? symmetric Dice, Jaccard,
Cosine,
Issues
Document vectors ? Document representations
But document representations with regards to
whole collection.
Curse of dimensionality ? feature reduction.
Feature reduction when using n-grams as terms ?
ad hoc thresholds.

13
4
Whats n-gram significance?

Can we know how important an n-gram is within
just one document without regards to any external
collection?
Similar problem Extracting multiword items from
text (e.g. European Union, Mickey Mouse, Cross
Language Evaluation Forum).
Solution by Ferreira da Silva and Pereira Lopes
Several statistical measures generalized to be
applied to arbitrary length word n-grams.
New measure Symmetrical Conditional Probability
(SCP) which outperforms the others.
So, our proposal to answer first question
If SCP shows the most significant multiword items
within just one document it can be applied to
rank character n-grams for a document according
to their significances.

20
5
Whats n-gram significance? (cont.)

Equations for SCP
(w1wn) is an n-gram. Lets suppose we use
quad-grams and lets take (igni) from the text
Whats n-gram significance.
(w1w1) / (w2w4) (i) / (gni) ?
(w1w2) / (w3w4) (ig) / (ni)
(w1w3) / (w4w4) (ign) / (i) ?
For instance, in ? p((w1w1)) p((i)) would be
computed from the relative frequency of
appearance within the document of n-grams
starting with i (e.g. (igni), (ific), or (ican)).
In ? p((w4w4)) p((i)) would be computed from
the relative frequency of appearance within the
document of n-grams ending with i (e.g. (m_si),
(igni), or (nifi)).

27
6
Whats n-gram significance? (cont.)

Current implementation of blindLight uses
quad-grams because
They provide better results than tri-grams.
Their significances are computed faster than n5
n-grams.
How would it work mixing different length
n-grams within the same document vector?
Interesting question to solve in the future
Two example blindLight document vectors
Q document Cuando despertó, el dinosaurio
todavía estaba allí.
T document Quando acordou, o dinossauro ainda
estava lá.
Q vector (45 elements)(Cuan, 2.49), (l_di,
2.39), (stab, 2.39), ..., (saur, 2.31), (desp,
2.31), ..., (ando, 2.01), (avía, 1.95), (_all,
1.92)
T vector (39 elements) (va_l, 2.55), (rdou,
2.32), (stav, 2.32), ..., (saur, 2.24), (noss,
2.18), ..., (auro, 1.91), (ando, 1.88), (do_a,
1.77)
How can such vectors be numerically compared?

33
7
Comparing blindLight doc vectors
Document Total Significance

Some equations

Document Vectors
Intersected Document Vector Total Significance
Intersected Document Vector
Pi and Rho Asymmetric Similarity Measures
40
8
Comparing blindLight doc vectors (cont.)
Pi SQOT/SQ 20.48/97.52 0.21 Rho SQOT/ST
20.48/81.92 0.25

The dinosaur is still here

O
Q doc vector SQ 97.52
T doc vector ST 81.92
QOT vector SQOT 20.48
47
9
Clustering case studyGenetic classification
of languages

The relation between ancestor and descendant
languages is usually called genetic relationship.
Such relationships are displayed as a tree of
families of languages.
Comparative method looks for regular (i.e.
systematic) correspondences in lexicon and thus
allows linguists to propose hypothesis about
genetic relationship.
Languages are not only subject to systematic
changes but also random, so comparative method is
sensitive to noise, specially when studying
languages that have diverged more than 10,000
years ago.
Joseph H. Greenberg developed the so-called mass
lexical comparison method which compares large
samples of equivalent words for two languages.
Our experiment is quite similar to this mass
comparison method and to the work done by Stephen
Huffman using the Acquaintance technique.

53
10
Clustering case studyGenetic classification
of languages (cont.)

Two different kinds of linguistic data
Orthographic version of first three chapters from
the Book of Genesis.
Phonetic transcriptions of The North Wind and
the Sun.
Similarity measure to compare document vectors
was 0.5?0.5?.
Clustering algorithm similar to Jarvis-Patrick.
Both resultant trees coherent to each other and
consistent with linguistic theories.

60
11
Categorization case studyLanguage identification

Categorization using blindLight is
straightforward
Each category vector is compared with the
document,
the greater the similarity the most likely the
membership.
Using previous experiment results category
vectors on the right were built to develop a
language identifier. Many of them are
artificial obtained by intersecting several
language vectors.
The language identifier operation is simple.
Lets suppose an English sample of text
It is compared against Basque, Finnish, Italic,
northGermanic and westGermanic.
The most likely category is westGermanic so
it is compared against Dutch-German and English.
The most likely is English which is a final
category.

Basque Finnish Italic Catalan-French Catalan Frenc
h Italian Portuguese-Spanish Portuguese Spanish no
rthGermanic Danish-Swedish Danish Swedish Faroese
Norwegian westGermanic Dutch-German Dutch German E
nglish
67
12
Categorization case studyLanguage
identification (cont.)

Preliminary results using 1,500 posts from
soc.culture.basque
soc.culture.catalan
soc.culture.french
soc.culture.galiza (Galician is not known by
the identifier).
soc.culture.german
Posts were submitted in a raw form including the
whole header to check noise tolerance.
It was found that actual samples of around 200
characters can be identified in spite of lengthy
headers (500 to 900 characters).
Results for Galician
As with rest of the groups plenty of spam (i.e.
English posts).
Most of the posts written in Spanish.
Posts actually written in Galician 63
identified as Portuguese, 37 as Spanish,
graceful decade?
Results for other languages

73
13
Information Retrieval using blindLight

? (Pi) and ? (Rho) can be linearly combined into
different association measures to perform IR.
Just two tested up to now ? and
(which performs slightly better).
IR with blindLight is pretty easy
For each document within the dataset a 4-gram is
computed and stored.
When a query is submitted to the system
A 4-gram (Q) is computed for the query text.
For each doc vector (T)
Q and T are O-intersected obtaining ? and ?
values.
? and ? are combined into a unique association
measure (e.g. piro).
A reverse ordered list of documents is built and
returned to answer the query.
Features and issues
No indexing phase. Documents can be added at any
moment. ?
Comparing each query with every document not
really feasible with big data sets. ?

Rho, and thus PiRho, values are negligible when
compared to Pi. norm function scales PiRho
values into the range of Pi values.
80
14
Bilingual IR with blindLight
We have compared n-gram vectors for
pseudo-translations with vectors for actual
translations (Source Spanish, Target
English). 38.59 of the n-grams within
pseudo-translated vectors are also within actual
translations vectors. 28.31 of the n-grams
within actual translations vectors are present at
pseudo-translated ones. Promising technique but
thorough work is required.

INGREDIENTS Two aligned parallel corpora.
Languages S(ource) an T(arget).
METHOD
Take the original query written in natural
language S (queryS).
Chopped the original query in chunks with 1, 2,
, L words.
Find in the S corpus sentences containing each of
these chunks. Start with the longest ones and
once youve found sentences for one chunk delete
its subchunks.
Replace every of these S sentences by its T
sentence equivalent.
Compute an n-gram vector for every T sentence,
O-intersect all the vectors for each chunk.
Mixed all the O-intersected n-gram vectors into a
unique query vector (queryT).
Voilà! You have obtained a vector for a
hypothetical queryT without having translated
queryS.

? Nice translated n-grams
? Nice un-translatedn-grams
For instance, EuroParl
? Not-Really-Nice un-translated n-grams
Encontrar documentos en los que se habla de las
discusiones sobre la reforma de instituciones
financieras y, en particular, del Banco Mundial y
del FMI durante la cumbre de los G7 que se
celebró en Halifax en 1995.
Encontrar Encontrar documentos Encontrar
documentos en ... instituciones instituciones
financieras instituciones financieras y ...
(1315) mantiene excelentes relaciones con las
instituciones financieras internacionales. (5865)
el fortalecimiento de las instituciones
financieras internacionales (6145) La Comisión
deberá estudiar un mecanismo transparente para
que las instituciones financieras europeas
? Definitely-Not-Nice noise n-grams
(1315) has excellent relationships with the
international financial institutions (5865)
strengthening international financial
institutions (6145) The Commission will have to
look at a transparent mechanism so that the
European financial institutions
instituciones financieras al_i, anci, atio,
cial, _fin, fina, ial_, inan_, _ins, inst, ions,
itut, l_in, nanc, ncia, nsti, stit, tion, titu,
tuti, utio
87
15
Information Retrieval Results

Experiments with small collections
CACM (3,204 docs and 64 queries).
CISI (1,460 docs and 112 queries).
Results similar to those achieved by several
systems but not as good as reached by SMART for
instance.
CLEF 2004 results
Monolingual IR within Russian documents 72
documents found from 123 relevant ones, average
precision 0.14
Bilingual IR using Spanish to query English docs
145 documents found from 375 relevant ones,
average precision 0.06.
However, blindLight does not apply
Stop word removal.
Stemming.
Query term weighting.
Problems specially with broad topics.

93
16
Conclusions

Genetic classification of languages (clustering)
using blindLight
Coherent results for both orthographic and
phonetic input.
Results are also consistent with linguistic
theories.
Results useful to develop language identifiers.
Language identification (categorization) using
blindLight
Accuracy higher than 97.
Information-to-Noise ratio around 2/7.
Information retrieval performance must be
improved, however
Language independent.
Straightforward bilingual IR.
To sum up, blindLight is an extremely simple
technique which appears to be flexible enough to
be applied to a wide range of NLP tasks showing
in all of them adequate performance.

100
17
One Size Fits All?A Simple Technique to Perform
Several NLP TasksDaniel Gayo Avello(University
of Oviedo)
Thank you!

Write a Comment

User Comments (0)

About PowerShow.com

One Size Fits All? A Simple Technique to Perform Several NLP Tasks - PowerPoint PPT Presentation

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

One Size Fits All? A Simple Technique to Perform Several NLP Tasks Daniel Gayo Avello (University of Oviedo) Introduction blindLight is a modified vector model with ... – PowerPoint PPT presentation