Title: One Size Fits All? A Simple Technique to Perform Several NLP Tasks
1One Size Fits All?A Simple Technique to Perform
Several NLP Tasks
- Daniel Gayo Avello
- (University of Oviedo)
2Introduction
- blindLight is a modified vector model with
applications to several NLP tasks - Automatic summarization,
- categorization,
- clustering and
- information retrieval.
7
3Vector Model vs. blindLight Model
- blindLight Model
- Different documents ? different length vectors.
- D? no collection ? no D ? no vector space!
- Terms ? just character n-grams.
- Term weights ? in-document n-gram significance
(function of just document term frequency) - Similarity measure ? asymmetric (in fact, two
association measures). Kind-of light pairwise
alignment (A vs. B ? B vs. A). - Advantages
- Document vectors Unique document
representations - Suitable for ever-growing document sets.
- Bilingual IR is trivial.
- Highly tunable by linearly combining the two
assoc measures. - Issues
- Not tuned yet, so
- Poor performance with broad topics.
- Vector Model
- Document ? D-dimensional vector of terms.
- D ? number of distinct terms within whole
collection of documents. - Terms ? words/stems/character n-grams.
- Term weights ? function of in-document term
frequency, in-collection term frequency, and
document length. - Assoc measures ? symmetric Dice, Jaccard,
Cosine, - Issues
- Document vectors ? Document representations
- But document representations with regards to
whole collection. - Curse of dimensionality ? feature reduction.
- Feature reduction when using n-grams as terms ?
ad hoc thresholds.
13
4Whats n-gram significance?
- Can we know how important an n-gram is within
just one document without regards to any external
collection? - Similar problem Extracting multiword items from
text (e.g. European Union, Mickey Mouse, Cross
Language Evaluation Forum). - Solution by Ferreira da Silva and Pereira Lopes
- Several statistical measures generalized to be
applied to arbitrary length word n-grams. - New measure Symmetrical Conditional Probability
(SCP) which outperforms the others. - So, our proposal to answer first question
- If SCP shows the most significant multiword items
within just one document it can be applied to
rank character n-grams for a document according
to their significances.
20
5Whats n-gram significance? (cont.)
- Equations for SCP
- (w1wn) is an n-gram. Lets suppose we use
quad-grams and lets take (igni) from the text
Whats n-gram significance. - (w1w1) / (w2w4) (i) / (gni) ?
- (w1w2) / (w3w4) (ig) / (ni)
- (w1w3) / (w4w4) (ign) / (i) ?
- For instance, in ? p((w1w1)) p((i)) would be
computed from the relative frequency of
appearance within the document of n-grams
starting with i (e.g. (igni), (ific), or (ican)). - In ? p((w4w4)) p((i)) would be computed from
the relative frequency of appearance within the
document of n-grams ending with i (e.g. (m_si),
(igni), or (nifi)).
27
6Whats n-gram significance? (cont.)
- Current implementation of blindLight uses
quad-grams because - They provide better results than tri-grams.
- Their significances are computed faster than n5
n-grams. - How would it work mixing different length
n-grams within the same document vector?
Interesting question to solve in the future - Two example blindLight document vectors
- Q document Cuando despertó, el dinosaurio
todavía estaba allí. - T document Quando acordou, o dinossauro ainda
estava lá. - Q vector (45 elements)(Cuan, 2.49), (l_di,
2.39), (stab, 2.39), ..., (saur, 2.31), (desp,
2.31), ..., (ando, 2.01), (avía, 1.95), (_all,
1.92) - T vector (39 elements) (va_l, 2.55), (rdou,
2.32), (stav, 2.32), ..., (saur, 2.24), (noss,
2.18), ..., (auro, 1.91), (ando, 1.88), (do_a,
1.77) - How can such vectors be numerically compared?
33
7Comparing blindLight doc vectors
Document Total Significance
Document Vectors
Intersected Document Vector Total Significance
Intersected Document Vector
Pi and Rho Asymmetric Similarity Measures
40
8Comparing blindLight doc vectors (cont.)
Pi SQOT/SQ 20.48/97.52 0.21 Rho SQOT/ST
20.48/81.92 0.25
- The dinosaur is still here
O
Q doc vector SQ 97.52
T doc vector ST 81.92
QOT vector SQOT 20.48
47
9Clustering case studyGenetic classification
of languages
- The relation between ancestor and descendant
languages is usually called genetic relationship. - Such relationships are displayed as a tree of
families of languages. - Comparative method looks for regular (i.e.
systematic) correspondences in lexicon and thus
allows linguists to propose hypothesis about
genetic relationship. - Languages are not only subject to systematic
changes but also random, so comparative method is
sensitive to noise, specially when studying
languages that have diverged more than 10,000
years ago. - Joseph H. Greenberg developed the so-called mass
lexical comparison method which compares large
samples of equivalent words for two languages. - Our experiment is quite similar to this mass
comparison method and to the work done by Stephen
Huffman using the Acquaintance technique.
53
10Clustering case studyGenetic classification
of languages (cont.)
- Two different kinds of linguistic data
- Orthographic version of first three chapters from
the Book of Genesis. - Phonetic transcriptions of The North Wind and
the Sun. - Similarity measure to compare document vectors
was 0.5?0.5?. - Clustering algorithm similar to Jarvis-Patrick.
- Both resultant trees coherent to each other and
consistent with linguistic theories.
60
11Categorization case studyLanguage identification
- Categorization using blindLight is
straightforward - Each category vector is compared with the
document, - the greater the similarity the most likely the
membership. - Using previous experiment results category
vectors on the right were built to develop a
language identifier. Many of them are
artificial obtained by intersecting several
language vectors. - The language identifier operation is simple.
Lets suppose an English sample of text - It is compared against Basque, Finnish, Italic,
northGermanic and westGermanic. - The most likely category is westGermanic so
- it is compared against Dutch-German and English.
- The most likely is English which is a final
category.
Basque Finnish Italic Catalan-French Catalan Frenc
h Italian Portuguese-Spanish Portuguese Spanish no
rthGermanic Danish-Swedish Danish Swedish Faroese
Norwegian westGermanic Dutch-German Dutch German E
nglish
67
12Categorization case studyLanguage
identification (cont.)
- Preliminary results using 1,500 posts from
- soc.culture.basque
- soc.culture.catalan
- soc.culture.french
- soc.culture.galiza (Galician is not known by
the identifier). - soc.culture.german
- Posts were submitted in a raw form including the
whole header to check noise tolerance. - It was found that actual samples of around 200
characters can be identified in spite of lengthy
headers (500 to 900 characters). - Results for Galician
- As with rest of the groups plenty of spam (i.e.
English posts). - Most of the posts written in Spanish.
- Posts actually written in Galician 63
identified as Portuguese, 37 as Spanish,
graceful decade? - Results for other languages
73
13Information Retrieval using blindLight
- ? (Pi) and ? (Rho) can be linearly combined into
different association measures to perform IR. - Just two tested up to now ? and
(which performs slightly better). - IR with blindLight is pretty easy
- For each document within the dataset a 4-gram is
computed and stored. - When a query is submitted to the system
- A 4-gram (Q) is computed for the query text.
- For each doc vector (T)
- Q and T are O-intersected obtaining ? and ?
values. - ? and ? are combined into a unique association
measure (e.g. piro). - A reverse ordered list of documents is built and
returned to answer the query. - Features and issues
- No indexing phase. Documents can be added at any
moment. ? - Comparing each query with every document not
really feasible with big data sets. ?
Rho, and thus PiRho, values are negligible when
compared to Pi. norm function scales PiRho
values into the range of Pi values.
80
14Bilingual IR with blindLight
We have compared n-gram vectors for
pseudo-translations with vectors for actual
translations (Source Spanish, Target
English). 38.59 of the n-grams within
pseudo-translated vectors are also within actual
translations vectors. 28.31 of the n-grams
within actual translations vectors are present at
pseudo-translated ones. Promising technique but
thorough work is required.
- INGREDIENTS Two aligned parallel corpora.
Languages S(ource) an T(arget). - METHOD
- Take the original query written in natural
language S (queryS). - Chopped the original query in chunks with 1, 2,
, L words. - Find in the S corpus sentences containing each of
these chunks. Start with the longest ones and
once youve found sentences for one chunk delete
its subchunks. - Replace every of these S sentences by its T
sentence equivalent. - Compute an n-gram vector for every T sentence,
O-intersect all the vectors for each chunk. - Mixed all the O-intersected n-gram vectors into a
unique query vector (queryT). - Voilà! You have obtained a vector for a
hypothetical queryT without having translated
queryS.
? Nice translated n-grams
? Nice un-translatedn-grams
For instance, EuroParl
? Not-Really-Nice un-translated n-grams
Encontrar documentos en los que se habla de las
discusiones sobre la reforma de instituciones
financieras y, en particular, del Banco Mundial y
del FMI durante la cumbre de los G7 que se
celebró en Halifax en 1995.
Encontrar Encontrar documentos Encontrar
documentos en ... instituciones instituciones
financieras instituciones financieras y ...
(1315) mantiene excelentes relaciones con las
instituciones financieras internacionales. (5865)
el fortalecimiento de las instituciones
financieras internacionales (6145) La Comisión
deberá estudiar un mecanismo transparente para
que las instituciones financieras europeas
? Definitely-Not-Nice noise n-grams
(1315) has excellent relationships with the
international financial institutions (5865)
strengthening international financial
institutions (6145) The Commission will have to
look at a transparent mechanism so that the
European financial institutions
instituciones financieras al_i, anci, atio,
cial, _fin, fina, ial_, inan_, _ins, inst, ions,
itut, l_in, nanc, ncia, nsti, stit, tion, titu,
tuti, utio
87
15Information Retrieval Results
- Experiments with small collections
- CACM (3,204 docs and 64 queries).
- CISI (1,460 docs and 112 queries).
- Results similar to those achieved by several
systems but not as good as reached by SMART for
instance. - CLEF 2004 results
- Monolingual IR within Russian documents 72
documents found from 123 relevant ones, average
precision 0.14 - Bilingual IR using Spanish to query English docs
145 documents found from 375 relevant ones,
average precision 0.06. - However, blindLight does not apply
- Stop word removal.
- Stemming.
- Query term weighting.
- Problems specially with broad topics.
93
16Conclusions
- Genetic classification of languages (clustering)
using blindLight - Coherent results for both orthographic and
phonetic input. - Results are also consistent with linguistic
theories. - Results useful to develop language identifiers.
- Language identification (categorization) using
blindLight - Accuracy higher than 97.
- Information-to-Noise ratio around 2/7.
- Information retrieval performance must be
improved, however - Language independent.
- Straightforward bilingual IR.
- To sum up, blindLight is an extremely simple
technique which appears to be flexible enough to
be applied to a wide range of NLP tasks showing
in all of them adequate performance.
100
17One Size Fits All?A Simple Technique to Perform
Several NLP TasksDaniel Gayo Avello(University
of Oviedo)
Thank you!