Natural Language Processing Applications - PowerPoint PPT Presentation


PPT – Natural Language Processing Applications PowerPoint presentation | free to view - id: 276aa2-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Natural Language Processing Applications


Most stemmers don't use lexical look up. There are shortcomings: ... stemming is imperfect and the size and diversity of the web increase the chance of a mismatch ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 127
Provided by: vena4


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Natural Language Processing Applications

Natural Language Processing Applications
  • Lecture 7
  • Fabienne Venant
  • Université Nancy2 / Loria

Information Retrieval
What is Information Retrieval?
  • Information retrieval (IR) is finding material
    (usually documents) of an unstructured nature
    (usually text) that satisfies an information need
    from within large collections (usually stored on
  • Applications
  • Many universities and public libraries use IR
    systems to provide access to books journals and
    other documents.
  • Web search
  • Large volumes of unstable, unstructured dat
  • Speed is important
  • Cross-language IR
  • Finding documents written in another language
  • Touches on Machine translation
  • ....

  • The set of texts can be very large hence hence
    efficiency is a concern
  • Textual data is noisy, incomplete and
    untrustworthy hence robustness is a concern
  • Information may be hidden
  • Need to derive information from raw data
  • Need to derive information from vaguely expressed

IR Basic concepts
  • Information needs queries and relevance
  • Indexing helps speeding up retrieval
  • Retrieval models describe how to search and
    recover relevant documents
  • Evaluation IR systems are large and convincing
    evaluation is tricky

Information needs
Information needs
  • INFORMATION NEED the topic about which the
    user desires to know more
  • QUERY what the user conveys to the computer in
    an attempt to communicate the information need
  • RELEVANCE a document is relevant if it is one
    that the user perceives as containing information
    of value wrt their personal information need
  • Ex
  • topic pipeline leaks
  • relevant documents doesnt matter if they use
    those words or express the concept with other
    words such a  pipeline rupture .

Capturing information needs
  • Information needs can be hard to capture
  • One possibility use natural language
  • Advantage expressive enough to allow all needs
    to be described
  • Drawbacks
  • Semantic analysis of arbitrary NL is very hard
  • Users may not want to type full blown sentences
    into a search engine

  • Information needs are typically expressed as a
  • Where shall I go on holiday? ? holiday
  • Two main types of possible queries
  • How much blood does the human heart pump in one
  • Boolean queries
  • ? heart AND blood AND minutes
  • Web types queries
  • ? human biology

  • A query
  • is usually quite short and incomplete
  • may contain misspelled or poorly selected words
  • may contain too many or too few words
  • The information need
  • may be difficult to describe precisely,especially
    when the user isn't familiar about the topic
  • Precise understanding of the document content is

Persistent vs one-off Queries
  • Queries might or not evolve over times
  • Persistent queries
  • predefined and routinely performed
  • Top ten performing shares today
  • Continuous queries persistent queries that
    allow users to receive new results when they
    become available
  • typical of Information extraction and News
    Routing systems
  • One-off (or ad-hoc) queries
  • created to obtain information as the need arises
  • typical of Web searching

  • Relevance is subjective
  • python ambiguous but not for user
  • Topicality vs. Utility a document is relevant
    wrt a specific Goal
  • ? A document is relevant if it addresses the
    stated information need, not because it just
    happens to contain all the words in the query.
  • Relevance is a gradual concept (a document is not
    just relevant or not it is more or less relevant
    to a query)
  • IR systems usually rank retrieved documents by
  • But many algorithm use a binary decision of

The big picture
  • An IR system looks for data matching some
    criteria defined by the users in their queries.
  • The langage used to ask a question is called the
    query language.
  • These queries use keywords (atomic items
    characterizing some data).
  • The basic unit of data is a document (can be a
    file, an article, a paragraph, etc.).
  • A document corresponds to free text (may be
  • All the documents are gathered into a collection
    (or corpus).

Searching for a given word in a document
  • One way to do that is to start at the beginning
    and to read through all the text
  • Pattern matching (re) speed of modern computer?
    grepping through tex can be a very effective
  • Enough for simple querying of modest collections
    (millions of words)
  • But for many purposes, you do need more
  • To process large document collections (billions
    ot trillions of words) quickly.
  • To allow more flexible matching operations. For
    example, it is impractical to perform the query
    Romans NEAR countrymen with grep, where NEAR
    might be defined as within 5 words or within
    the same sentence.
  • To allow ranked retrieval in many cases you
    want the best answer to an information need among
    many documents that contain certain words
  • -- gtYou need an Index

Motivation for Indexing
  • Extremely large dataset
  • Only a tiny fraction of the dataset is relevant
    to a given query
  • Speed is essential (0.25 second for web
  • Indexing helps speedup retrieval

Indexing documents
  • How to relate the users information need with
    some documents content ?
  • Idea using an index to refer to documents
  • Usually an index is a list of terms that appear
    in a document, it can be represented
    mathematically as
  • index doci? Uj keywordj
  • Here, the kind of index we use maps keywords to
    the list of documents they appear in
  • index' keywordj ? Ui doci
  • We call this an inverted index.

Indexing documents
  • The set of keywords is usually called the
    dictionary (or vocabulary)
  • A document identifier appearing in the list
    associated with a keyword is called a posting
  • The list of document identifiers associated with
    a given keyword is called a posting list

Inverted files
  • The most common indexing technique
  • Source file collection organised by documents
  • Inverted file collection organised by terms

Inverted Index
  • Given a dictionary of terms (also called
    vocabulary or vocabulary lexicon)
  • For each term, record in a list which documents
    the term occurs in
  • Each item in the list
  • records that a term appeared in a document
  • and, later, often, the positions in the document
  • is conventionally called a posting
  • The list is then called a postings list (or
    inverted list),

Inverted Index
From  an introduction to information
retrieval , C.D. Manning,P. Raghavan and
  • Draw the inverted index that would be built for
    the following document collection
  • Doc 1 breakthrough drug for schizophrenia
  • Doc 2 new schizophrenia drug
  • Doc 3 new approach for treatment of schizophrenia
  • Doc 4 new hopes for schizophrenia patients
  • For this document collection, what are the
    returned results for these queries
  • schizophrenia AND drug
  • schizophrenia AND NOT(drug OR approach)

Indexing documents
  • Arising questions how to build an index
    automatically ? What are the relevant keywords ?
  • Some additional desiderata
  • fast processing of large collections of
  • having flexible matching operations (robust
  • having the possibility to rank the retrieved
    document in terms of relevance
  • To ensure these requirements (especially fast
    processing) are fulfilled, the indexes are
    computed in advance
  • Note that the format of the index has a huge
    impact on the performances of the system

Indexing documents
  • NB an index is built in 4 steps
  • Gathering of the collection (each document is
    given a unique identifier)
  • Segmentation of each document into a list of
    atomic tokens ? tokenization
  • Linguistic processing of the tokens in order to
    normalize them ?lemmatizing.
  • Indexing the documents by computing the
    dictionary and lists of postings

Manual indexing
  • Advantages
  • Human judgement are most reliable
  • Retrieval is better
  • Drawbacks
  • Time consuming
  • Not always consistent
  • different people build different indexes for the
    same document.

Automatic indexing
  • Using NLU?
  • Not fast enough in real world settings (e.g., web
  • Not robust enough (low coverage)
  • Difficulty what to include and what to exclude.
  • Indexes should not contain headings for topics
    for which there is no information in the document
  • Can a machine parse full sentences of ideas and
    recognize the core ideas, the important terms,
    and the relationships between related concepts
    throughout the entire text?

Building the vocabulary
Stop list
  • The members of which are discarded during
  • some extremely common words which would appear to
    be of little value in helping select documents
    matching a user need are excluded from the
    vocabulary entirely.
  • These words are called STOP WORDS
  • Collection strategy
  • Sort the terms by collection frequency (the total
    number of times each term appears in the document
  • Take the most frequent terms
  • often hand-filtered for their semantic content
    relative to the domain of the documents being
  • What counts as a stop word depends on the
  • in a collection of legal article law can be
    considered a stop word
  • Ex
  • a an and are as at be by for from has he in is it
    its of on that the to was were will with

Why eliminate stop words?
  • Efficiency
  • Eliminating stop words reduces the size of the
    index considerably
  • Eliminating stop words reduces retrieval time
  • Quality of results
  • Most of the time not indexing stop words does
    little harm
  • keyword searches with terms like the and by dont
    seem very useful
  • BUT, this is not true for phrase searches.
  • The phrase query President of the United States
    is more precise than President AND United
  • The meaning of flights to London is likely to
    be lost if the word to is stopped out.
  • .....

Building the vocabulary
  • Processing a stream of characters to extract
  • 1st task tokenization, main difficulties
  • token delimiters (ex Chinese)
  • apostrophes (ex Oneill, Finlands capital)
  • hyphens (ex Hewlett-Packard, state-of-the-art)
  • segmented compound nouns (ex Los Angeles)
  • unsegmented compound nouns (icecream, breadknife)
  • numerical data (dates, IP addresses)
  • word order (ex Arabic wrt nouns and numbers)

Solutions for tokenization issues
  • Using a pre-defined dictionary with largest
    matches and heuristics for unknown words
  • Using learning algorithms trained over
    hand-segmented words

Choosing keywords
  • Selecting the words that are most likely to
    appear in a query
  • These words characterize the documents they
    appear in
  • Which are they?

The bag of words approach
  • Extreme interpretation of the the principle of
    compositional semnaics
  • The meaning of documents resides solely in the
    words that are contained within them
  • The exact ordering of the terms in a document is
    ignored but the number of occurrences of each
    term is material

  • Not the same thing a bit! said the Hatter.
  • You might just as well say that I see what I
    eat is the same thing as I eat what I see!
  • You might just as well say, added the March
    Hare, that I like what I get is the same thing
    as I get what I like!
  • You might just as well say, added the Dormouse,
    who seemed to be talking in its sleep, that I
    breathe when I sleep is the same thing as I
    sleep when I breathe!

Bags of words
  • Nevertheless, it seems intuitive that two
    documents with similar bag of words
    representations are similar in content..

Whats in a bag of words?
  • Are all words in a document equally important?
  • stop words do not contribute in any way to
    retrieval and scoring
  • BoW contain terms
  • What should count as a term?
  • Words
  • Phrases (e.g., president of the US)

Morphological normalization
  • Should index terms be word forms, lemmas or
  • Matching morphological variants increase recall
  • Example morphological variants
  • anticipate, anticipating, anticipated,
  • Company/Companies, sell/sold
  • USA vs U.S.A.,
  • 22/10/2007 vs 10/22/2007 vs 2007/10/22
  • university vs University
  • Idea using equivalence classes of terms,
  • ex Opel, OPEL, opel ? opel
  • Two techniques
  • stemming refers to a crude heuristic process
    that chops off the ends of words in the hope of
    achieving this goal correctly most of the time
  • Lemmatisation refers to doing things,properly
    with the use of a vocabulary and morphological
    analysis of words, normally aiming to remove
    inflectional endings only and to return a
    dictionary form of a word, which is known as the
  • NB documents and queries have to be processed
    using the same tokenization process !

Stemming and Lemmatization
  • Role reducing inflectional forms to common base
  • Example
  • car, cars, cars, cars ? car
  • am, are, is ? be
  • Stemming removes suffixes (surface markers) to
    produce root forms
  • Lemmatization reduces a word to a canonical form
    (using a dictionary and a morphological analyser)
  • Illustration of the difficulty
  • plurals (woman/women, crisis/crisis)
  • derivational morphology (automatize/automate)
  • English ? Porter stemming algorithm (University
    of Cambridge, UK, 1980)

Porter stemmer
  • Algorithm based on a set of context-sensitive
    rewriting rules
  • http//
  • http//
  • Rules are composed of a pattern (left-hand-side)
    and a string (right-hand-side), example
  • (.)sses ? \1 ss sses ? ss caresses ? caress
  • (. aeiou.)ies ? \1i ies ? i ponies ? poni,
    ties ?ti
  • (. aeiou.)ss ? \1 ss ss ? ss caress ?
  • Rules may be constrained by conditions on the
    words measure, example
  • (m gt 1) (.)ement ? \1 replacement ? replac
    but not cement ? c
  • (mgt0) (.)eed -gt \1ee feed -gt feed but agreed
    -gt agree
  • (v) ed -gt \1 plastered -gt plaster but bled
    -gt bled
  • (v) ing -gt \1 motoring -gt motor but sing -gt

Porter Stemmer Word measure
  • Assumed that a list of consonants is denoted by
    C, and a list of vowels by V
  • Any word, or part of a word has one of the four
  • CVCV ... C
  • CVCV ... V
  • VCVC ... C
  • VCVC ... V
  • These may all be represented by the single form
  • CVCVC ... V where the square brackets denote
    arbitrary presence of their contents.
  • Using (VC)m to denote VC repeated m times, this
    may again be written as
  • C(VC)mV.
  • m will be called the measure of any word or word
    part when represented in this form.
  • Here are some examples
  • m0 TR,   EE,   TREE,   Y,   BY
  • m1 TROUBLE,   OATS,   TREES,   IVY

  • What is the Porter measure of the following words
    (give your computation) ?
  • crepuscular
  • rigorous
  • placement
  • cr ep usc ul ar
  • m 4
  • r ig or ous
  • C VC VC VC
  • m 3
  • pl ac em ent
  • C VC VC VC
  • m 3

  • Most stemmers also removes suffixes such as ed,
    ing, ational, ation, able, ism...
  • Relational ? relate
  • Most stemmers dont use lexical look up
  • There are shortcomings
  • Stemming can result in non-words
  • Organization ? Organ
  • Doing ? doe
  • Unrelated words can be reduced to the same stem
  • police, policy ?polic

  • Popular stemmers
  • Porters
  • Lovins
  • Iterated Lovins
  • Kstem

  • Exceptions needs to be handled
  • sought ? seek, sheep ?sheep, feet ?foot
  • Computationally more expensive than stemming as
    it lookups words in a dictionnary
  • Lemmatizer for French
  • http//
  • FLEMM (F. Namer)
  • POS taggers with lemmatization TreeTagger, LT-POS

What is actually used?
  • Most retrieval systems use stemming/lemmatising
    and stop word lists
  • Stemming increases recall while harming precision
  • Most web search engines do use stop word lists
    but not stemming/lemmatising because
  • the text collection is extremely large so that
    the change of matching morphogical variants is
  • recall is not an issue
  • stemming is imperfect and the size and diversity
    of the web increase the chance of a mismatch
  • stemming/tokenising tools are available for few

Example Text Representations
  • Scientists have found compelling new evidence of
    possible ancient
  • microscopic life on mars, derived from magnetic
    crystals in a meteorit that fell to Earth from
    the red planet, NASA anounced on Monday.
  • Web search scientists, found, compelling, new,
  • possible, ancient, microscopic, life, mars,
    derived, magnetic, crystals,
  • meteorite, fell, earth, red, planet, NASA,
    anounced, Monday
  • Information service or library search scientist,
    find, compelling,
  • new, evidence, possible, ancient, microscopic,
    life, mars, derived,
  • magnetic, crystal, meteorite, fall, earth, red,
    planet, NASA,
  • anounce, Monday

  • Document unit
  • An index can map terms
  • ... to documents
  • ... to paragraphs in documents
  • ... to sentences in document
  • ... to positions in documents
  • An IR system should be designed to offer choices
    of granularity.
  • For now, we will henceforth assume that a
    suitable size document unit has been chosen,
    together with an appropriate way of dividing or
    aggregating files, if needed.

Index Content
  • The index usually stores some or all of the
    following information
  • For each term
  • Document count. How many documents the term
    occurs in.
  • Total Frequency count. How many times the term
    occurs accross all documents ? popularity
  • For each term and for each document
  • Frequency How often the term occurs in that
  • Position. The offsets at which the term occurs in
    that document.

Retrieval model
What is a retrieval model
  • A model is an abstraction of a process here
  • Conclusions derived by the model are good if the
    model provides a good approximation of the
    retrieval process
  • IR Model variables queries, documents, terms,
    relevance, users, information needs
  • Existing types of retrieval models
  • Boolean models
  • Vector space models
  • Probabilistic models
  • Models based on Belief nets
  • Models based on language models

Retrieval Models the general intuition
  • Documents and user information needs are
    represented using index terms
  • Index terms serve as links to documents
  • Queries consists of index terms
  • Relevance can be measured in terms of a match
    between queries and document index

Exact vs. Best Match
  • Exact Match
  • A query specifies precise retrieval criteria
  • Each document either matches or fails to match
    the query
  • The result is a set of documents (no ranking)
  • Best match
  • A query describes good or best matching documents
  • The result is a ranked list of documents

Stastical retrieval
Statistical Models
  • A document is typically represented by a bag of
    words (unordered words with frequencies)
  • User specifies a set of desired terms with
    optional weights
  • Weighted query terms
  • Q lt database 0.5 text 0.8 information
    0.2 gt
  • Unweighted query terms
  • Q lt database text information gt
  • No Boolean conditions specified in the query.

Statistical Retrieval
  • Retrieval based on similarity between query and
  • Output documents are ranked according to
    similarity to query
  • Similarity based on occurrence frequencies of
    keywords in query and document
  • Automatic relevance feedback can be supported
  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval
  • The user marks some returned documents as
    relevant or nonrelevant.
  • The systemcomputes a better representation of the
    information need base on the user feedback.
  • The system displays a revised set of retrieval

Boolean model
The boolean model
  • Most common exact-match model
  • Basic assumptions
  • An index term is either present or absent in a
  • All index terms provide equal evidence wrt
    information needs
  • Queries are boolean combinations of index terms
  • x AND y docts that contains both x and y
    (intersection of addresses)
  • x OR y docts that contains x, y or both (union
    of addresses)
  • NOT x docts that do not contain x (complement
    set of addresses)
  • Additionnally,
  • proximity operator
  • simple regular expressions
  • spelling variants

Boolean queries Example
  • User information need
  • ? interested in learning about vitamins that are
  • User boolean query
  • ? antioxidant AND vitamin

The boolean model
  • Example of input collection (Shakespeares
  • Doc1
  • I did enact Julius Caesar
  • I was killed in the Capitol
  • Brutus killed me.
  • Doc2
  • So let it be with Caesar. The
  • noble Brutus hath told you Caesar
  • was ambitious

The boolean model index construction
  • First we build the list of pairs (keyword,

The boolean model index construction
  • Then the lists are sorted by keywords, frequency
    information is added

The boolean model index construction
  • Multiple occurences of keywords are then merged
    to create a dictionary file and a postings file

Processing Boolean queries
  • User boolean query Brutus AND Calpurnia
  • over the inverted index
  • Locate Brutus in the Dictionary
  • Retrieve its postings
  • Locate Calpurnia in the Dictionary
  • Retrieve its postings
  • Intersect the two postings lists
  • The intersection operation is the crucial one. It
    has to be we efficient so as to be able to
    quickly find documents that contain both terms.
  • sometimes referred to as merging postings lists
    because it uses a merge algorithm
  • Merge algortihm general family of algorithms
    that combine multiple sorted lists by interleaved
    advancing of pointers through each list

Extended boolean queries
  • Merging algorithm (from Manning et al., 07)

NB the posting lists HAVE to be sorted.
Extended boolean queries
  • Generalisation of the merging process
  • Imagine more than 2 keywords appear in the
  • (Brutus AND Caesar) AND NOT (Capitol)
  • Brutus AND Caesar AND Capitol
  • (Brutus OR Caesar) AND (Capitol
  • ...
  • Ideas
  • consider keywords with shorter posting lists
    first (to reduce the number of operations).
  • use the frequency information stored in the
  • ? See Manning et al., 07 for the algorithm

Extended boolean queries
retrieved docs D7, D5, D2
  • How would you process the following queries (main
  • Brutus AND NOT Caesar
  • Try your algorithm on

  • How would you process the following query (main
  • Brutus OR NOT Caesar

Remarks on the boolean model
  • The boolean model allows to express precise
    queries (you know what you get, BUT you do not
    have flexibility ? exact matches)
  • Boolean queries can be processed efficiently
    (time complexity of the merge algorithm is linear
    in the sum of the length of the lists to be
  • Has been a reference model in IR for a long time

Advantages of exact-match retrieval
  • Predictable, easy to explain
  • Structured queries
  • Works well when information need is clear and

Drawbacks of exact-match retrieval
  • Unintuitive for non experts adequate query
    formulation difficult for most users
  • no ranking of retrieved documents
  • exact matching may lead to too few or too many
    retrieved documents
  • too few if not using synonyms
  • difficulty increases with collection size
  • large results sets need to be compensated by
    interactive query refinement
  • No notion of partial relevance (useful if query
    is overrestrictive)
  • All terms have equal importance (no term
  • Ranking models consistently better

Boolean model The story so far
  • An inverted index associate keywords with posting
  • The postings lists contain document identifiers
    (and other useful information, such as total
    frequences, number of documents, etc.)
  • Boolean queries are processed by merging posting
    lists in order to find the documents satisfaying
    the query
  • The cost of this list merging is time linear in
    the total number of document Ids O(m n)
  • Question how to process phrase queries (i.e.
    taking the words context into account) ?

Dealing with phrases queries
  • Many complex or technical concepts and many
    organization and product names are multiword
    compounds or phrases.
  • Stanford University
  • Graph Theory
  • Natural Language Processing
  • ...
  • The user wants documents were the whole phrase
    appears, and not only some parts of it (i.e. The
    inventor Stanford Ovshinsky never went to
    university is not a match )
  • About 10 of the web queries are phrase queries
    (songs names, institutions...)
  • Such queries need either more complex dictionary
    terms, or more complex index (critical parameter
    size of the index)

Biword indexes
  • Use key-phrases of length 2, example
  • Text Natural Language Processing
  • Dictionary
  • Natural Language
  • Language Processing
  • The dictionary is made of biwords (notion of
  • Query Information retrieval in Natural Langage
  • (Information retrieval) and (retrieval Natural)
    and (Natural Language) and (Language Processing)
  • It might seem a better query to omit the middle
  • Better results can be obtained by using more
    precise part-of-speech patterns that define which
    extended biwords should be indexed

Positionnal indexes
  • Store positions in the inverted indexes, example
  • termID
  • doc1 position1, position2, ...
  • doc2 position1, position2, ..
  • ....
  • Processing then corresponds to an extension of
    the merging algorithm (additional checkings while
    traversing the lists)
  • NB such indexes can be used to process proximity
    queries (i.e. using constraints on proximity
    between words)

  • Positional indexes need an entry per occurence
    (NB classic inverted indexes need an entry per
    document Id)
  • The size of such indexes grows exponentially with
    the size of the document
  • The size of a positional index depends on the
    language being indexed and the type of document
    (books, articles, etc)
  • On average, a positional index is 2-4 times
    bigger than a inverted index, it can reach 35 to
    50 of the size of the original text (for
  • Positional indexes can be used in combination
    with classic indexes to save time and space (see
    Williams et al, 2005).

  • Which documents can contain the sentence to be
    or not to be considering the following
    (incomplete) indexes ?
  • be
  • 1 7, 18, 33, 72, 86, 231
  • 2 3, 149
  • 4 17, 191, 291, 430, 434
  • 5 363, 367
  • to
  • 2 1, 17, 74, 222, 551
  • 4 8, 16, 190, 429, 433
  • 7 13, 23, 191

  • Given the following positional indexes, give the
    documents Ids corresponding to the query world
    wide web
  • world
  • 1 7, 18, 33, 70, 85, 131
  • 2 3, 149
  • 4 17, 190, 291, 430, 434
  • wide
  • 1 12, 19, 40, 72, 86, 231
  • 2 2, 17, 74, 150, 551
  • 3 8, 16, 191, 429, 435
  • web
  • 1 20, 22, 41, 75, 87, 200
  • 2 18, 32, 45, 56, 77, 151
  • 4 25, 192, 300, 332, 440

  • The postings lists to access are to, be, or,
  • We will examine intersecting the postings lists
    for to and be.
  • We first look for documents that contain both
  • Then, we look for places in the lists where there
    is an occurrence of be with a token index one
    higher than a position of to
  • and then we look for another occurrence of each
    word with token index 4 higher than the first
  • In the above lists, the pattern of occurrences
    that is a possible
  • match is
  • to lt...4lt...,429,
  • Be lt...4lt...,430,

  • Consider the following index
  • Language ltd1,12gtltd2,23-32-43gtltd3,53gtltd5,36-42-48gt
  • Loria ltd1,25gt ltd2,34-40gt ltd5,38-51gt
  • Where dI refers to the document I, the other
    numbers being positions.
  • The infix operator NEAR/x refers to the proximity
    x between two term
  • Give the solutions to the query language NEAR/2
  • Give the pairs (x,docids) for each x such that
    language NEAR/x Loria has at least one solution
  • Propose an algorithm for retrieving matching
    document for this operator

  • Large commercial system that serves legal and
    professional market since 1974
  • legal materials (court opinions, statutes,
    regulations, ...)
  • news (newspapers, magazines, journals, ...)
  • financial (stock quotes, financial analyses,
  • Total collection size 5-7 Terabytes
  • 700 000 users (they claim 56 of legal searchers
    as of 2002)
  • Best match added in 1992

WESTLAW query language features
  • Boolean and proximity operators
  • Phrases West Publishing
  • Word Proximity West /5 Publishing
  • Same sentence Massachussets /s technology
  • Same paragraph - information retrieval /p
  • Restrictions DATE(AFTER 1992 BEFORE 1995)
  • Term expansion
  • wildcard (THOMSON) truncation (THOM!)
    automatic expansion of plurals, possessive
  • Document structure (fields)

WESTLAW query example
  • Information need Information on the legal
    theories involved in preventing the disclosure of
    trade secrets by employees formerly employed by a
  • ? Query "trade secret" /s disclos! /s prevent /s
  • Information need Requirements for disabled
    people to be able to access a workplace.
  • ? Query disab! /p access! /s work-site
    work-place (employment /3 place)
  • Information need Cases about a hosts
    responsibility for drunk guests.
  • ? Query host! /p (responsib! liab!) /p
    (intoxicat! drunk!) /p guest

Boolean query languages are not dead
  • Exact match still prevalent in the commercial
    market (but then includes some type of ranking)
  • Many users prefer Boolean
  • For some queries/collections, boolean may work
  • Boolean and free text queries find different
  • ? Need retrieval models that support both

The Vector Space Model
Best-Match retrieval
  • Boolean retrieval is the archetypal example of
    exact-match retrieval
  • Best-match or ranking models are now more common
  • Advantages
  • easier to use
  • similar efficiency
  • provides ranking
  • best match generally has better retrieval
  • most relevant documents appear at the top of the
  • But comparison best- and exact-match is difficult

  • Boolean model all documents matching the query
    are retrieved
  • The matching is binary yes or no
  • Extreme cases the list of retrieved documents
    can be empty, or huge
  • A ranking of the documents matching a query is
  • A score is computed for each pair (query,

Vector-space Retrieval
  • By far the most common retrieval systems
  • Key idea Everything (document, queries) is a
    vector in a high dimensional space
  • Vector coefficients for an object (document,
    query, term) represent the degree to which this
    object embodies each of the basic dimensions
  • Relevance is measured using vector similarity a
    document is relevant to a query if their
    representing vectors are similar

Vector-space Representation
  • Documents are vectors of terms
  • Terms are vectors of documents
  • A query is a vector of terms

Graphic Representation
  • Example
  • D1 2T1 3T2 5T3
  • D2 3T1 7T2 T3
  • Q 0T1 0T2 2T3

Similarity in the Vector-space
  • Vector can contain binary terms or weighted terms
  • Binary term vector 1 ? term present, 0 ? term
  • Weighted term vectorindicates relative
    importance of terms in a document
  • Vector similarity can be measured in several
  • Inner product (measure of overlap)
  • Cosine coefficient
  • Jacquard coefficient
  • Dice coefficient
  • Mikowski metric (dissimilarity)
  • Euclidian distance (dissimilarity)

Using the inner product similarity measure
  • Given a query vector q and a doct vector d, both
    of length n,
  • similarity between q and d is defined by the
    inner product q d of q and d
  • where qi (di ) is the value of the i -th position
    of q(d)
  • With binary values this amounts to counting the
    matching terms between q and d

Similarity an example in the Vector-space
The effect of varying document lengths
  • Problem
  • Longer documents will be represented with longer
    vectors, but that does not mean they are more
  • If two documents have the same score, the shorter
    one should be preferred
  • Solution the length of a document must be taken
    into account when computing the similarity score

Document length normalization
  • The length of a document euclidian length
  • If d (x1, x2, ... Xn) then dw
  • To normalize a document, we divide it by its own
    length d/dw
  • Similarity given by the cosine measure between
    normalized vectors
  • q?(d/dw)
  • One problem is solved shorter more focused
    documents receive a higher score than longer
    documents with the same matching terms
  • But shorter documents are generally preferred
    over longer one!
  • More sophisticated weighting schemes are
    generally used

Term weights
  • qi is the weight of the term i in q
  • Up to now, we only considered binary term weight
  • 0 term absent
  • 1 term present
  • Two shortcomings
  • Does not reflect how often a term occurs
  • All terms are equally important (president vs.
  • Remedy use non binary term weights
  • tf-score store the frequency of a term in the
    vector (e.g., 4 if the term occurs 4 times in the
  • idf-score to distinguish meaningful terms ie
    terms that occur only in a few documents

Term frequency
  • A document is treated as a set of words
  • Each word characterizes that document to some
  • When we have eliminated stop words, the most
    frequent words tend to be what the document is
  • Therefore fkd (Nb of occurrences of word k in
    document d) will be an important measure.
  • ? Also called the term frequency (tf)

Document frequency
  • What makes this document distinct from others in
    the corpus?
  • The terms which discriminate best are not those
    which occur with high document frequency!
  • Therefore dk (nb of documents in which word k
    occurs) will also be an important measure.
  • ? Also called the document frequency (idf)

  • This can all be summarized as
  • Words are best discriminators when
  • they occur often in this document (term
  • do not occur in a lot of documents (document
  • One very common measure of the importance of a
    word to a document is
  • TF.IDF term frequency x inverse document
  • There are multiple formulas for actually
    computing this. The underlying concept is the
    same in all of them.

Term weights
  • tf-score tfi,j frequency of term i in
    document j
  • idf-score idfi Inversed document frequency of
    term i
  • idfi log(N/ni) with
  • N, the size of the document collection (nb of
  • ni , the number of documents in which the term i
  • idfi Proportion of the document collection in
    which termi occurs
  • Term weight of term i in document j (TF-IDF)
  • tfi,j. idfi
  • the rarity of a term in the document collection

Boolean retrieval vs. Vector Space Retrieval
  • Boolean retrieval
  • Documents are not ranked
  • Boolean queries are not easy to manipulate
  • Vector space retrieval
  • Documents can be ranked
  • Issue 1 choice of comparison function. Usually
    cosine comparison.
  • Issue 2 choice of weighing scheme. Usuall
    variations on tfi,j. idfi

  • Issues
  • User-based evaluation
  • System-based evaluation
  • TREC
  • Precision and recall

Evaluation methods
  • Two types of evaluation methods
  • User-based measures the user satisfaction
  • System-based focuses on how well the system
    ranks the documents

User based evaluation
  • More direct
  • Expensive
  • Difficult to do correctly
  • Need sufficiently large, representative sample of
  • The compared systems must be equally well
    developed (complete with fully fonctional user
  • Each user must be trained to control learning
  • Information, information needs, relevance are
    intangible concepts

System based evaluation
  • Good system performance good document rankings
  • Allows for fair comparative testing
  • Less expensive can be reused
  • Test collection Topics, Documents, Relevance
  • System based evaluation goes back to Cranfields
    experiments (1960)
  • Rate relevance of retrieved bibliographic
    reference on a scale from 1 to 4

Recall and Precision
  • Three important performance metrics
  • Precision Proportion of retrieved documents
    that are relevant
  • ? No penalty for selecting too few item
  • Recall Proportion of relevant documents that
    have been retrieved
  • ? No penalty for selecting too many items (e.g.,

Standard Text Collections
  • Relevant documents must be identified
  • Given a document collection D and a set of
    queries Q, RELq is the set of document relevant
    to q
  • Whether a document d is relevant to a query q is
    decided by human judgement

Standard Text Collections
  • CACM (computer science) 3024 abstracts, 64
  • CF (medicine) 1239 abstracts, 100 queries
  • CISI (library science) 1460 abstracts, 112
  • CRANFIELD (aeronautics) 1400 abstracts, 225
  • LISA (library science) 6004 abstracts, 35
  • TIME (newspaper) 423 abstracts, 83 queries
  • Ohsumed (medicine) 348 566 abstracts, 106 queries

Building Test Collections
  • How to identify relevant documents?
  • How to assess relevance? (binary or
  • One vs several judges

  • Text REtrieval Conference
  • Proceedings at http//
  • Established in 1991 to evaluate large scale IR
  • Retrieving documents from a gigabyte collection
  • Organised by NIST and run continuously since 1991
  • Best known IR evaluation setting
  • 25 participants in 92
  • 109 participants from 4 continents in 2004
  • European (CLEF) and Asian counterparts (NTCIR)7

TREC Format
  • Several IR research tracks
  • ad-hoc retrieval
  • routing/filtering
  • cross languag
  • scanned document
  • spoken document
  • Video
  • Web
  • question answering
  • ...

TREC notion of relevance
  • If you were writing a report on the subject of
    the topic and would use the information contained
    in the document in the report, then the document
    is relevant
  • Pooling is used for identifying relevant
  • A set of possibly relevant documents is created
    automatically for each information need
  • The top 100 documents returned by each system are
    kept and inspected by judges who determine which
    documents are relevant
  • Inter-judge agreement is about 808

Improving Recall and Precision
  • The two big problems with short queries are
  • Synonymy Poor recall results from missing
    documents that contain synonyms of search terms,
    but not the terms themselves
  • Polysemy/Homonymy Poor precision results from
    search terms that have multiple meanings leading
    to the retrieval of non-relevant documents.

Query Expansion
  • Find a way to expand a users query to
    automatically include relevant terms (that they
    should have included themselves), in an effort to
    improve recall
  • Use a dictionary/thesaurus
  • Use relevance feedback

  • A thesaurus contains information about words
    (e.g., violin) such as
  • Synonyms similar words e.g., fiddle
  • Hyperonyms more general words e.g., instrument
  • Hyponyms more specific words e.g., Stradivari
  • Meronyms parts, e.g., strings
  • A very popular machine readable thesaurus is

Problems of Thesauri
  • Language dependent
  • Available only for a couple of languages

Cooccurence models
  • Semantically or syntactically related terms
  • Cooccurence vs. Thesauri
  • Easy to adapt to other languages/domains
  • Also covers relations not expressed in thesaur
  • Not as reliable as manually edited thesauri
  • Can introduce considerable noise
  • Selection criteria Mutual information, Expected
    mutual, information

Relevance feedback
  • Ask user to identify a few documents which appear
    to be related to their information need
  • Extract terms from those documents and add them
    to the original query.
  • Run the new query and present those results to
    the user.
  • Typically converges quickly

Blind feedback
  • Assume that first few documents returned are most
    relevant rather than having users identify them
  • Proceed as for relevance feedback
  • Tends to improve recall at the expense of

Post-Hoc Analysis
  • When a set of documents has been returned, they
    can be analyzed to improve usefulness in
    addressing information need
  • Grouped by meaning for polysemic queries (using
    N-Gram-type approaches)
  • Grouped by extracted information (Named entities,
    for instance)
  • Group into existing hierarchy if structured
    fields available Filtering (e.g., eliminate spam)

  • Introduction to Information Retrieval, by C.
    Manning, P. Raghavan, and H. Schütze. To appear
    at Cambridge University Press (chapters available
    at the book website).
  • Information Retrieval, Second Edition, by C.J.
    van Rijsbergen, Butterworths, London, 1979.
    Available here.