One Size Fits All? A Simple Technique to Perform Several NLP Tasks - PowerPoint PPT Presentation

About This Presentation
Title:

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Description:

One Size Fits All? A Simple Technique to Perform Several NLP Tasks Daniel Gayo Avello (University of Oviedo) Introduction blindLight is a modified vector model with ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 18
Provided by: Dan5233
Category:

less

Transcript and Presenter's Notes

Title: One Size Fits All? A Simple Technique to Perform Several NLP Tasks


1
One Size Fits All?A Simple Technique to Perform
Several NLP Tasks
  • Daniel Gayo Avello
  • (University of Oviedo)

2
Introduction
  • blindLight is a modified vector model with
    applications to several NLP tasks
  • Automatic summarization,
  • categorization,
  • clustering and
  • information retrieval.

7
3
Vector Model vs. blindLight Model
  • blindLight Model
  • Different documents ? different length vectors.
  • D? no collection ? no D ? no vector space!
  • Terms ? just character n-grams.
  • Term weights ? in-document n-gram significance
    (function of just document term frequency)
  • Similarity measure ? asymmetric (in fact, two
    association measures). Kind-of light pairwise
    alignment (A vs. B ? B vs. A).
  • Advantages
  • Document vectors Unique document
    representations
  • Suitable for ever-growing document sets.
  • Bilingual IR is trivial.
  • Highly tunable by linearly combining the two
    assoc measures.
  • Issues
  • Not tuned yet, so
  • Poor performance with broad topics.
  • Vector Model
  • Document ? D-dimensional vector of terms.
  • D ? number of distinct terms within whole
    collection of documents.
  • Terms ? words/stems/character n-grams.
  • Term weights ? function of in-document term
    frequency, in-collection term frequency, and
    document length.
  • Assoc measures ? symmetric Dice, Jaccard,
    Cosine,
  • Issues
  • Document vectors ? Document representations
  • But document representations with regards to
    whole collection.
  • Curse of dimensionality ? feature reduction.
  • Feature reduction when using n-grams as terms ?
    ad hoc thresholds.

13
4
Whats n-gram significance?
  • Can we know how important an n-gram is within
    just one document without regards to any external
    collection?
  • Similar problem Extracting multiword items from
    text (e.g. European Union, Mickey Mouse, Cross
    Language Evaluation Forum).
  • Solution by Ferreira da Silva and Pereira Lopes
  • Several statistical measures generalized to be
    applied to arbitrary length word n-grams.
  • New measure Symmetrical Conditional Probability
    (SCP) which outperforms the others.
  • So, our proposal to answer first question
  • If SCP shows the most significant multiword items
    within just one document it can be applied to
    rank character n-grams for a document according
    to their significances.

20
5
Whats n-gram significance? (cont.)
  • Equations for SCP
  • (w1wn) is an n-gram. Lets suppose we use
    quad-grams and lets take (igni) from the text
    Whats n-gram significance.
  • (w1w1) / (w2w4) (i) / (gni) ?
  • (w1w2) / (w3w4) (ig) / (ni)
  • (w1w3) / (w4w4) (ign) / (i) ?
  • For instance, in ? p((w1w1)) p((i)) would be
    computed from the relative frequency of
    appearance within the document of n-grams
    starting with i (e.g. (igni), (ific), or (ican)).
  • In ? p((w4w4)) p((i)) would be computed from
    the relative frequency of appearance within the
    document of n-grams ending with i (e.g. (m_si),
    (igni), or (nifi)).

27
6
Whats n-gram significance? (cont.)
  • Current implementation of blindLight uses
    quad-grams because
  • They provide better results than tri-grams.
  • Their significances are computed faster than n5
    n-grams.
  • How would it work mixing different length
    n-grams within the same document vector?
    Interesting question to solve in the future
  • Two example blindLight document vectors
  • Q document Cuando despertó, el dinosaurio
    todavía estaba allí.
  • T document Quando acordou, o dinossauro ainda
    estava lá.
  • Q vector (45 elements)(Cuan, 2.49), (l_di,
    2.39), (stab, 2.39), ..., (saur, 2.31), (desp,
    2.31), ..., (ando, 2.01), (avía, 1.95), (_all,
    1.92)
  • T vector (39 elements) (va_l, 2.55), (rdou,
    2.32), (stav, 2.32), ..., (saur, 2.24), (noss,
    2.18), ..., (auro, 1.91), (ando, 1.88), (do_a,
    1.77)
  • How can such vectors be numerically compared?

33
7
Comparing blindLight doc vectors
Document Total Significance
  • Some equations

Document Vectors
Intersected Document Vector Total Significance
Intersected Document Vector
Pi and Rho Asymmetric Similarity Measures
40
8
Comparing blindLight doc vectors (cont.)
Pi SQOT/SQ 20.48/97.52 0.21 Rho SQOT/ST
20.48/81.92 0.25
  • The dinosaur is still here


O
Q doc vector SQ 97.52
T doc vector ST 81.92
QOT vector SQOT 20.48
47
9
Clustering case studyGenetic classification
of languages
  • The relation between ancestor and descendant
    languages is usually called genetic relationship.
  • Such relationships are displayed as a tree of
    families of languages.
  • Comparative method looks for regular (i.e.
    systematic) correspondences in lexicon and thus
    allows linguists to propose hypothesis about
    genetic relationship.
  • Languages are not only subject to systematic
    changes but also random, so comparative method is
    sensitive to noise, specially when studying
    languages that have diverged more than 10,000
    years ago.
  • Joseph H. Greenberg developed the so-called mass
    lexical comparison method which compares large
    samples of equivalent words for two languages.
  • Our experiment is quite similar to this mass
    comparison method and to the work done by Stephen
    Huffman using the Acquaintance technique.

53
10
Clustering case studyGenetic classification
of languages (cont.)
  • Two different kinds of linguistic data
  • Orthographic version of first three chapters from
    the Book of Genesis.
  • Phonetic transcriptions of The North Wind and
    the Sun.
  • Similarity measure to compare document vectors
    was 0.5?0.5?.
  • Clustering algorithm similar to Jarvis-Patrick.
  • Both resultant trees coherent to each other and
    consistent with linguistic theories.

60
11
Categorization case studyLanguage identification
  • Categorization using blindLight is
    straightforward
  • Each category vector is compared with the
    document,
  • the greater the similarity the most likely the
    membership.
  • Using previous experiment results category
    vectors on the right were built to develop a
    language identifier. Many of them are
    artificial obtained by intersecting several
    language vectors.
  • The language identifier operation is simple.
    Lets suppose an English sample of text
  • It is compared against Basque, Finnish, Italic,
    northGermanic and westGermanic.
  • The most likely category is westGermanic so
  • it is compared against Dutch-German and English.
  • The most likely is English which is a final
    category.

Basque Finnish Italic Catalan-French Catalan Frenc
h Italian Portuguese-Spanish Portuguese Spanish no
rthGermanic Danish-Swedish Danish Swedish Faroese
Norwegian westGermanic Dutch-German Dutch German E
nglish
67
12
Categorization case studyLanguage
identification (cont.)
  • Preliminary results using 1,500 posts from
  • soc.culture.basque
  • soc.culture.catalan
  • soc.culture.french
  • soc.culture.galiza (Galician is not known by
    the identifier).
  • soc.culture.german
  • Posts were submitted in a raw form including the
    whole header to check noise tolerance.
  • It was found that actual samples of around 200
    characters can be identified in spite of lengthy
    headers (500 to 900 characters).
  • Results for Galician
  • As with rest of the groups plenty of spam (i.e.
    English posts).
  • Most of the posts written in Spanish.
  • Posts actually written in Galician 63
    identified as Portuguese, 37 as Spanish,
    graceful decade?
  • Results for other languages

73
13
Information Retrieval using blindLight
  • ? (Pi) and ? (Rho) can be linearly combined into
    different association measures to perform IR.
  • Just two tested up to now ? and
    (which performs slightly better).
  • IR with blindLight is pretty easy
  • For each document within the dataset a 4-gram is
    computed and stored.
  • When a query is submitted to the system
  • A 4-gram (Q) is computed for the query text.
  • For each doc vector (T)
  • Q and T are O-intersected obtaining ? and ?
    values.
  • ? and ? are combined into a unique association
    measure (e.g. piro).
  • A reverse ordered list of documents is built and
    returned to answer the query.
  • Features and issues
  • No indexing phase. Documents can be added at any
    moment. ?
  • Comparing each query with every document not
    really feasible with big data sets. ?

Rho, and thus PiRho, values are negligible when
compared to Pi. norm function scales PiRho
values into the range of Pi values.
80
14
Bilingual IR with blindLight
We have compared n-gram vectors for
pseudo-translations with vectors for actual
translations (Source Spanish, Target
English). 38.59 of the n-grams within
pseudo-translated vectors are also within actual
translations vectors. 28.31 of the n-grams
within actual translations vectors are present at
pseudo-translated ones. Promising technique but
thorough work is required.
  • INGREDIENTS Two aligned parallel corpora.
    Languages S(ource) an T(arget).
  • METHOD
  • Take the original query written in natural
    language S (queryS).
  • Chopped the original query in chunks with 1, 2,
    , L words.
  • Find in the S corpus sentences containing each of
    these chunks. Start with the longest ones and
    once youve found sentences for one chunk delete
    its subchunks.
  • Replace every of these S sentences by its T
    sentence equivalent.
  • Compute an n-gram vector for every T sentence,
    O-intersect all the vectors for each chunk.
  • Mixed all the O-intersected n-gram vectors into a
    unique query vector (queryT).
  • Voilà! You have obtained a vector for a
    hypothetical queryT without having translated
    queryS.

? Nice translated n-grams
? Nice un-translatedn-grams
For instance, EuroParl
? Not-Really-Nice un-translated n-grams
Encontrar documentos en los que se habla de las
discusiones sobre la reforma de instituciones
financieras y, en particular, del Banco Mundial y
del FMI durante la cumbre de los G7 que se
celebró en Halifax en 1995.
Encontrar Encontrar documentos Encontrar
documentos en ... instituciones instituciones
financieras instituciones financieras y ...
(1315) mantiene excelentes relaciones con las
instituciones financieras internacionales. (5865)
el fortalecimiento de las instituciones
financieras internacionales (6145) La Comisión
deberá estudiar un mecanismo transparente para
que las instituciones financieras europeas
? Definitely-Not-Nice noise n-grams
(1315) has excellent relationships with the
international financial institutions (5865)
strengthening international financial
institutions (6145) The Commission will have to
look at a transparent mechanism so that the
European financial institutions
instituciones financieras al_i, anci, atio,
cial, _fin, fina, ial_, inan_, _ins, inst, ions,
itut, l_in, nanc, ncia, nsti, stit, tion, titu,
tuti, utio
87
15
Information Retrieval Results
  • Experiments with small collections
  • CACM (3,204 docs and 64 queries).
  • CISI (1,460 docs and 112 queries).
  • Results similar to those achieved by several
    systems but not as good as reached by SMART for
    instance.
  • CLEF 2004 results
  • Monolingual IR within Russian documents 72
    documents found from 123 relevant ones, average
    precision 0.14
  • Bilingual IR using Spanish to query English docs
    145 documents found from 375 relevant ones,
    average precision 0.06.
  • However, blindLight does not apply
  • Stop word removal.
  • Stemming.
  • Query term weighting.
  • Problems specially with broad topics.

93
16
Conclusions
  • Genetic classification of languages (clustering)
    using blindLight
  • Coherent results for both orthographic and
    phonetic input.
  • Results are also consistent with linguistic
    theories.
  • Results useful to develop language identifiers.
  • Language identification (categorization) using
    blindLight
  • Accuracy higher than 97.
  • Information-to-Noise ratio around 2/7.
  • Information retrieval performance must be
    improved, however
  • Language independent.
  • Straightforward bilingual IR.
  • To sum up, blindLight is an extremely simple
    technique which appears to be flexible enough to
    be applied to a wide range of NLP tasks showing
    in all of them adequate performance.

100
17
One Size Fits All?A Simple Technique to Perform
Several NLP TasksDaniel Gayo Avello(University
of Oviedo)
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com