Challenges in Information Retrieval and Language Modeling - PowerPoint PPT Presentation


PPT – Challenges in Information Retrieval and Language Modeling PowerPoint presentation | free to download - id: 75c2b3-MjRkM


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Challenges in Information Retrieval and Language Modeling


Challenges in Information Retrieval and Language Modeling Report of a Workshop held at the Center for Intelligent Information Retrieval, University of Massachusetts ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 33
Provided by: pdx88
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Challenges in Information Retrieval and Language Modeling

Challenges in Information Retrieval and Language
  • Report of a Workshop held at the Center for
    Intelligent Information Retrieval, University of
    Massachusetts Amherst, September 2002

What is Information Retrieval?
  • Salton (circa 1960s) defines it as being a
    field concerned with the structure, analysis,
    organization, storage, searching, and retrieval
    of information.
  • Throughout the 70s and 80s, pretty much all of
    the research in this field was focused on
    document (textual) retrieval.
  • In the last two decades, due to the huge increase
    in the types of content found online, the field
    has expanded into many different topics.

Some of these include
  • Question Answering
  • Topic Detection and Tracking
  • Summarization
  • Multimedia Retrieval (image and music)
  • Software Engineering
  • Chemical and Biological Informatics
  • Text structuring
  • Text mining
  • Genomics

Similarity to other fields?
  • The text notes the similarities some might see
    between IR and Database Systems research. The
    main factor that distinguishes these two fields
    is that IR focuses solely on deriving information
    from unstructured data sources. However,
    recently this boundary has become blurred, when
    it comes to classifying marked up text (like
    HTML, XML), because it can fall in either

Google is quite advanced why continue research
into IR?
  • Web search and IR are not equivalent web search
    is a subset of IR.
  • Web queries do not represent all information
  • Web search engines are effective for some types
    of queries in some contexts.

  • Long term challenges
  • Cross-lingual retrieval
  • Web search
  • Summarization
  • Question Answering
  • Multimedia Retrieval
  • Information Extraction
  • Testbeds

Long Term Challenges
  • Global Information Access develop massively
    distributed, multi-lingual retrieval systems that
    would take as input an information need, encoded
    in any language, and return relevant results,
    encoded in any language.
  • Contextual Retrieval Combine search
    technologies and knowledge about query and user
    context into a single framework in order to
    provide the most appropriate anser for a users
    information needs (example from paper).

Topic Discussions
  • The following are topical discussions regarding
    subsets of the IR field. These were thought (by
    the committee) to be the most important areas to
  • Note that for many of these areas, one of the
    prime needs to further research in the field is
    a good set of test data. Many cite specific
    public domain test sets that can be used to
    verify new technologies and prove correctness.
    Sometimes these are just raw data such as
    newspaper archives. However, fields such as
    Summarization that seek to test themselves in
    genres outside of the news domain, sometimes need
    to compile together test sets of data that they
    know will provide results that are scaleable to
    the real world (see Testbeds section)

Cross-Lingual Information Retrieval (CLIR)
  • Purpose is to support queries in one language
    against a collection in other languages.
  • Recently achieved milestone cross-lingual
    document retrieval performs essentially as
    accurately as monolingual retrieval.
  • Challenges Effective User Functionality, New
    more complex applications, Languages with sparse

Web Search
  • Has moved beyond just searching for specific web
    sites. Real Estate, Cars, Music, and Movies are
    just a few examples of other types of media
    people search for.
  • Challenges develop a formal Web Structure,
    Crawling and Indexing (keeping cached search
    results fresh and up to date), Searching (develop
    more efficient methods that exploit the newly
    defined structure)

  • Shares some basic techniques with Indexing, since
    both are concerned with identifying the essence
    of a document.
  • Brings together techniques from different areas
  • Challenges Define clearly specified
    summarization task(s) in an IR setting, move to
    summarization in new genres/context
    (newswire/newspapers are trivial), Integrate
    users prior knowledge into models.

Question Answering
  • QA systems take as input a natural language
    question and a source collection. They produce a
    targeted, contextualized natural language answer.
    To build the answer, it gathers relevant data,
    summary statistics, and relations from the
  • Challenges Improve performance of factoid QA
    so the public would find it reliable, create
    systems that provide richer answers and leverage
    richer data sources, develop better interaction
    (UI) with the human user

Multimedia Retrieval
  • Large problem space because of the huge variance
    in the types of objects to be retrieved and the
    distinguishing factors for instance, the
    underlying representation for most of these
    content types is binary however, Movies, Music,
    and Pdf files compare VERY differently.
  • Challenges given a non-test media object, the
    following options are available. Text may be
    associated with the object (captions, etc.) part
    of the object might be converted to text (via
    speech recognition or OCR) metadata might be
    assigned manually or media-specific features
    might be extractable.

Information Extraction
  • IE fills slots in an ontology or database by
    selecting and normalizing sub-segments of human
    readable text. One example is finding names of
    entities and relationships between them in a
    source text.
  • Meant to be more of a sub-system, used by many of
    the other IR systems like QA, cross-lingual
    retrieval, and summarization.
  • Challenges improve accuracy so IE can be more
    easily used by other systems, ability to extract
    literal meaning from text, large scale reference
    matching, cross-lingual information extraction.

  • Over the previous decade, the IR research
    community has benefited from a set of annual US
    government sponsored TREC (Text Retrieval
    Conference) conferences that provided a level
    field for evaluation algorithms and systems for
    IR. In addition to the evaluation exercise, these
    conferences created a number of significant data
    sets that fueled further research in IR. However,
    these data sets are too small (by about a
    thousand-fold) to be representative of the real
  • Challenge A community based effort is needed to
    create a more realistic (in scale and function)
    common data set to fuel further research and
    increase the relevance of IR.

  • Near exponential growth in technology in the past
    twenty years, combined with the increase in raw
    data and the different representations it can
    have (beyond just text when the IR field was
    developed) requires the expansion and growth of
    the IR field to accurately perform its task
    information retrieval.

A Neural Network approach to Topic Spotting
  • Applying nonlinear neural networks to topic

What is Topic Spotting?
  • Topic Spotting is the problem of identifying
    which of a set of predefined topics are present
    in a natural language document. More formally,
    given a set of n topics and a document, the task
    is to output for each topic the probability that
    the topic is present.

Neural Network
  • The neural network we employ is essentially a
    non-linear regression model for fitting
    high-order interactions in some feature space to
    binary topic assignments. We have to limit the
    number of input variables to the neural network
    (so it can operate correctly), so we will
    approach reducing the dimensionality of the input
    space in two different methods term selection,
    which picks a subset of the original terms to use
    as features, and Latent Semantic Indexing (LSI),
    which constructs new features from combinations
    of a large number of the original terms.

The Corpus
  • Reuters-22173 corpus of Reuters newswire stories
    from 1987
  • 21450 stories in full collection
  • Only used stories that had at least one topic
    assigned narrowed it down to 9610 stories for
    training, and 3662 for testing.
  • Stories have mean length 90.0 words standard
    deviation is 91.6.

  • Term by document matrix containing word frequency
    information. The entries for each document
    vector, called a document profile, are computes
    as follows
  • Pdk vfdk / v?vector(vfdi)2
  • where f is word frequency.

Term Selection
  • Find the subset of the original terms which seem
    the most useful for the classification task.
  • Divide problem into 92 independent classification
    tasks and search for the set of terms for each
    topic which can best discriminate between
    documents with that topic and those without. This
    serves the neural network, as it is near
    impossible to select a set of terms that can
    adequately discriminate between 92 classes of
    documents while at the same time being small
    enough to serve as the feature set for a neural

Term Selection (contd)
  • We score all of the terms according to how well
    they serve as individual predictors of the topic,
    then pick the top scoring terms. This is called
    the relevancy score. It has a relation to the
    relevancy weight, which measures how unbalanced
    the term is across documents with and without the
  • Rk log (wtk/dt 1/6) / (wtk/dt 1/6)
  • where wtk is the number of documents with the
    topic that contain the term, dt is the total
    number of documents with the topic.

Term Selection (contd)
  • 20 terms was found to yield, on average, the best
    classification performance.
  • Performance falls off after 20 terms due to
    Overfitting. Overfitting occurs when the network
    starts to memorize the training patterns i.e.
    when it starts fitting the peculiarities of the
    training data, thus decreasing its performance on
    out-of-sample data.

  • Basic framework is as a regression model relating
    the input variables (features) to the output
    variables (binary topic assignments) which can be
    fit using training data.
  • p 1 / ( 1 e-? )
  • where ? ßTx is a linear combination of the
    input features.

Neural Network Classifiers
  • Three major components of a neural network model
    architecture, cost function, and search
  • Architecture defines functional forms relation
    the inputs to the outputs (in terms of network
    topology, unit connectivity, and activation
  • The Search in weight space for a set of weights
    which minimizes the cost function is the training

Neural Networks for Topic Spotting
  • Network outputs are estimates of the probability
    of topic presence given the feature vector for a
  • Advantage Neural Networks have over other
    techniques is that they can predict multiple
    topics simultaneously using a single model.
  • We use two different network architectures, Flat
    and Modular.

Flat Architecture
  • Use entire training set to train separate network
    for each topic.
  • To combat Overfitting, introduces a simple
    regularization scheme based on weight-elimination
    in which we add a term penalizing network
    complexity to the cross-entropy cost function.
    This term is given by
  • v?i,jwi,j2 / (1 wi,j2 )

Modular Architecture
  • Decomposes the learning problem into a set of
    smaller problems.
  • Outputs of meta-topic network are multiplied by
    outputs of individual topic networks to get final
    topic predictions.

Metal Group
Meta-Topic Network
Model All High Medium Low
Terms nonlinear .804 .834 .796 .784
LSI nonlinear .771 .862 .786 .664
REL/LSI linear .785 .850 .773 .735
TD/LSI linear .811 .856 .780 .798
Shows precision for four flat networks macro
averaged over four topic frequency ranges topics
1-54 (all), 1-18 (high freq), 19-36 (medium),
37-54 (low)
Results (contd)
Model All High Medium Low
Modular terms nonlinear .836 .882 .838 .813
Modular LSI nonlinear .823 .872 .791 .818
Modular hybrid nonlinear .835 .887 .807 .829
Terms Nonlinear .808 .860 .823 .774
LSI Nonlinear .752 .864 .812 .661
REL/LSI Linear .782 .860 .797 .736
TD/LSI Linear .811 .862 .785 .803
CD/LSI Linear .796 .857 .800 .764
Average precision for three modular networks, as
well as several flat networks for comparison.
  • Experiments show LSI representation is able to
    equal or exceed the performance of selected term
    representations for high frequency topics, but
    performs relatively poorly for low frequency
  • However, task directed LSI representations can
    improve performance if implemented (relevancy
    weighting and local LSI).