Information Retrieval and Extraction - PowerPoint PPT Presentation

Loading...

PPT – Information Retrieval and Extraction PowerPoint presentation | free to download - id: 1ff5b5-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Information Retrieval and Extraction

Description:

To retrieve information which might be useful or relevant to the user ... Q: What was the monetary value of the Nobel Peace Prize in 1989? A: $469,000. Hsin-Hsi Chen ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 65
Provided by: hsinhs
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Information Retrieval and Extraction


1
Information Retrieval and Extraction
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

2
Chapter 1 Introduction
  • Hsin-Hsi Chen (???)
  • ???????????

3
Motivation
  • Information retrieval
  • To retrieve information which might be useful or
    relevant to the user
  • Representation, Storage, Organization, Access
  • Information need (vs query)
  • Find all the pages containing information on
    college tennis teams which
  • (1) are maintained by an university in the USA
    and
  • (2) participate in the NCAA tennis tournament.
  • To be relevant, the page must include information
    on the national ranking of the team in the last
    three years and the email or phone number of the
    team coach.

4
???? Information Need
  • ?????/????????
  • ?
  • ???/????
  • ???/??????
  • ?????/??????
  • ???????????

5
?????????
6
??????????
7
(No Transcript)
8
????
  • ??????

9
Information versus Data Retrieval
  • Data retrieval
  • Determine which documents of a collection contain
    the keywords in the user query
  • Retrieve all objects which satisfy clearly
    defined conditions in regular expression or
    relational algebra expression
  • Data has a well defined structure and semantics
  • Solution to the user of a database system
  • Information retrieval

10
Database Management
  • A specified set of attributes is used to
    characterize each item. EMPLOYEE(NAME, SSN,
    BDATE, ADDR, SEX, SALARY, DNO)
  • Exact match between the attributes used in query
    formulations and those attached to the record.
    SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
    John Smith

11
Basic Concepts
  • Content identifiers (keywords, index terms,
    descriptors) characterize the stored texts.
  • degrees of coincidence between the sets of
    identifiers attached to queries and documents

Logical view of the documents
User task
query formulation
content analysis
12
The User Task
  • Convey the semantics of information need
  • Retrieval and browsing

Retrieval
Database
Browsing
13
Logical View of Documents
  • Full text representation
  • A set of index terms
  • Elimination of stop-words
  • The use of stemming
  • The identification of noun groups

14
From full text to a set of index terms
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
15
Indexing
  • indexing assign identifiers to text items.
  • assign manual vs. automatic indexing
  • identifiers
  • objective vs. nonobjective text identifiers
    cataloging rules define, e.g., author names,
    publisher names, dates of publications,
  • controlled vs. uncontrolled vocabularies instructi
    on manuals, terminological schedules,
  • single-term vs. term phrase

16
The retrieval process
Text
User Interface
user need
Text
Text Operations
logical view
logical view
DB Manager Module
Query Operations
Indexing
user feedback
query
Searching
Index
retrieved documents
Text Database
Ranking
ranked documents
17
Information Retrieval
  • generic information retrieval system select and
    return to the user desired documents from a large
    set of documents in accordance with criteria
    specified by the user
  • functions
  • document search the selection of documents from
    an existing collection of documents
  • document routing the dissemination of incoming
    documents to appropriate users on the basis of
    user interest profiles

18
Detection Need
  • Definition a set of criteria specified by the
    user which describes the kind of information
    desired.
  • queries in document search task
  • profiles in routing task
  • forms
  • keywords
  • keywords with Boolean operators
  • free text
  • example documents
  • ...

19
Example
ltheadgt Tipster Topic Description ltnumgt Number
033 ltdomgt Domain Science and Technology lttitlegt
Topic Companies Capable of Producing Document
Management ltdesgt Description Document must
identify a company who has the capability
to produce document management system by
obtaining a turnkey- system or by obtaining and
integrating the basic components. ltnarrgt
Narrative To be relevant, the document must
identify a turnkey document management system or
components which could be integrated to form a
document management system and the name of
either the company developing the system or the
company using the system. These components are
a computer, image scanner or optical character
recognition system, and an information
retrieval or text management system.
20
Example (Continued)
ltcongt Concepts 1. document management, document
processing, office automation electronic
imaging 2. image scanner, optical character
recognition (OCR) 3. text management, text
retrieval, text database 4. optical disk ltfacgt
Factors ltdefgt Definitions Document
Management-The creation, storage and retrieval of
documents containing, text, images, and
graphics. Image Scanner-A device that converts a
printed image into a video image, without
recognizing the actual content of the text or
pictures. Optical Disk-A disk that is written and
read by light, and are sometimes associated with
the storage of digital images because of their
high storage capacity.
21
search vs. routing
  • The search process matches a single Detection
    Need against the stored corpus to return a subset
    of documents.
  • Routing matches a single document against a group
    of Profiles to determine which users are
    interested in the document.
  • Profiles stand long-term expressions of user
    needs.
  • Search queries are ad hoc in nature.
  • A generic detection architecture can be used for
    both the search and routing.

22
Search
  • retrieval of desired documents from an existing
    corpus
  • Retrospective search is frequently interactive.
  • Methods
  • indexing the corpus by keyword, stem and/or
    phrase
  • apply statistical and/or learning techniques to
    better understand the content of the corpus
  • analyze free text Detection Needs to compare with
    the indexed corpus or a single document
  • ...

23
Document Detection Search
24
Document Detection Search(Continued)
  • Document Corpus
  • the content of the corpus may have significant
    the performance in some applications
  • Preprocessing of Document Corpus
  • stemming
  • a list of stop words
  • phrases, multi-term items
  • ...

25
Document Detection Search(Continued)
  • Building Index from Stems
  • key place for optimizing run-time performance
  • cost to build the index for a large corpus
  • Document Index
  • a list of terms, stems, phrases, etc.
  • frequency of terms in the document and corpus
  • frequency of the co-occurrence of terms within
    the corpus
  • index may be as large as the original document
    corpus

26
Document Detection Search(Continued)
  • Detection Need
  • the users criteria for a relevant document
  • Convert Detection Need to System Specific Query
  • first transformed into a detection query, and
    then a retrieval query.
  • detection query specific to the retrieval
    engine, but independent of the corpus
  • retrieval query specific to the retrieval
    engine, and to the corpus

27
Document Detection Search(Continued)
  • Compare Query with Index
  • Resultant Rank Ordered List of Documents
  • Return the top N documents
  • Rank the list of relevant documents from the most
    relevant to the query to the least relevant

28
Routing
29
Routing (Continued)
  • Profile of Multiple Detection Needs
  • A Profile is a group of individual Detection
    Needs that describes a users areas of interest.
  • All Profiles will be compared to each incoming
    document (via the Profile index).
  • If a document matches a Profile the user is
    notified about the existence of a relevant
    document.

30
Routing (Continued)
  • Convert Detection Need to System Specific Query
  • Building Index from Queries
  • similar to build the corpus index for searching
  • the quantify of source data (Profiles) is usually
    much less than a document corpus
  • Profiles may have more specific, structured data
    in the form of SGML tagged fields

31
Routing (Continued)
  • Routing Profile Index
  • The index will be system specific and will make
    use of all the preprocessing techniques employed
    by a particular detection system.
  • Document to be routed
  • A stream of incoming documents is handled one at
    a time to determine where each should be
    directed.
  • Routing implementation may handle multiple
    document streams and multiple Profiles.

32
Routing (Continued)
  • Preprocessing of Document
  • A document is preprocessed in the same manner
    that a query would be set-up in a search
  • The document and query roles are reversed
    compared with the search process
  • Compare Document with Index
  • Identify which Profiles are relevant to the
    document
  • Given a document, which of the indexed profiles
    match it?

33
Routing (Continued)
  • Resultant List of Profiles
  • The list of Profiles identify which user should
    receive the document

34
Summary
  • Generate a representation of the meaning or
    content of each object based on its description.
  • Generate a representation of the meaning of the
    information need.
  • Compare these two representations to select those
    objects that are most likely to match the
    information need.

35
Basic Architecture of an Information Retrieval
System
Documents
Queries
Document Representation
Query Representation
Comparison
??????????????????
36
Research Issues
  • Given a set of description for objects in the
    collection and a description of an information
    need, we must consider
  • Issue 1
  • What makes a good document representation?
  • What are retrievable units and how are they
    organized?
  • How can a representation be generated from a
    description of the document?

37
Research Issues (Continued)
  • Issue 2 How can we represent the information need
    and how can we acquire this representation either
    from a description of the information need or
    through interaction with the user?
  • Issue 3 How can we compare representations to
    judge likelihood that a document matches an
    information need?

38
Research Issues (Continued)
  • Issue 4 How can we evaluate the effectiveness of
    the retrieval process?

39
Text Data Mining Tasks
  • Information extraction -- facts, fill database
  • Summarization
  • Categorization
  • Clustering
  • Associations
  • Temporal analysis of document collection

40
Information Extraction Beyond Document Retrieval
  • Question and Answering
  • Q Who is the author of the book, "The Iron Lady
    A Biography of Margaret Thatcher"? A Hugo Young
  • Q What was the monetary value of the Nobel Peace
    Prize in 1989? A 469,000

41
Information Extraction
  • Generic Information Extraction System An
    information extraction system is a cascade of
    transducers or modules that at each step add
    structure and often lose information, hopefully
    irrelevant, by applying rules that are acquired
    manually and/or automatically.

42
Information Extraction (Continued)
  • What are the transducers or modules?
  • What are their input and output?
  • What structure is added?
  • What information is lost?
  • What is the form of the rules?
  • How are the rules applied?
  • How are the rules acquired?

43
Example Parser
  • transducer parser
  • input the sequence of words or lexical items
  • output a parse tree
  • information added predicate-argument and
    modification relations
  • information lost no
  • rule form unification grammars
  • application method chart parser
  • acquisition method manually

44
Modules
  • Text Zoner turn a text into a set of text
    segments
  • Preprocessor turn a text or text segment into a
    sequence of sentences, each of which is a
    sequence of lexical items, where a lexical item
    is a word together with its lexical attributes
  • Filter turn a set of sentences into a smaller set
    of sentences by filtering out the irrelevant ones
  • Preparser take a sequence of lexical items and
    try to identify various reliably determinable,
    small-scale structures

45
Modules (Continued)
  • Parser input a sequence of lexical items and
    perhaps small-scale structures (phrases) and
    output a set of parse tree fragments, possibly
    complete
  • Fragment Combiner turn a set of parse tree or
    logical form fragments into a parse tree or
    logical form for the whole sentence
  • Semantic Interpreter generate a semantic
    structure or logical form from a parse tree or
    from parse tree fragments

46
Modules (Continued)
  • Lexical Disambiguation turn a semantic structure
    with general or ambiguous predicates into a
    semantic structure with specific, unambiguous
    predicates
  • Co-reference Resolution, or Discourse
    Processing turn a tree-like structure into a
    network-like structure by identifying different
    descriptions of the same entity in different
    parts of the text
  • Template Generator derive the templates from the
    semantic structures

47
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
48
???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????

49
?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
50
ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
51
The Advanced Research and Development Activity
(ARDA)
  • a joint activity of the Intelligence Community
    (IC) and the Department of Defense (DOD) in late
    November 1998
  • intelligence community's (IC) center for
    conducting advanced research and development
    related extracting intelligence from and
    providing security for information stored,
    transmitted, or manipulated by electronic means

??
52
(No Transcript)
53
ARDA RD Programs
  • Information Exploitation
  • Pulling Information
  • Pushing Information
  • Visualizing and Navigating Information
  • Quantum Information Science Photonics
  • Digital Network Intelligence

54
Pulling Information
  • Providing answers to complex, multifaceted
    questions that analysts pose
  • The analyst seeks to "pull" the answer out of
    multiple, very large, heterogeneous data sources
    that may physically reside in diverse locations

55
Pulling Information (Continued)
  • Accepting complex questions in a form natural to
    the analyst.
  • Questions may include judgment terms and an
    acceptable answer may need to be based upon
    conclusions and decisions reached by the system
    and may require the summarization, fusion, and
    synthesis of information drawn from multiple
    sources.
  • Translating analytic questions into multiple
    queries appropriate to the various data sets to
    be searched.
  • Finding relevant information in distributed,
    multimedia, multilingual, multi-agency data sets.
  • Analyzing, fusing and summarizing information
    into a coherent answer.
  • Providing the answer to the analyst in the form
    that he/she want

56
Pushing Information
  • Providing information from multiple, very large,
    heterogeneous data sources that analysts do not
    ask
  • The system discovers information in some
    profiling, clustering, pattern recognition, data
    mining, or other fashion and "pushes" this
    information to analysts that the system
    determines might have an interest.

57
Pushing Information (Continued)
  • Profiling and blind clustering of new data.
  • Detecting anomalies, patterns and changes in
    large volumes of data.
  • Analyzing the nature and description of the
    anomalies, patterns, and changes.
  • Alerting the appropriate analyst(s) of the newly
    discovered information.

58
Topics
  • Introduction to Information Retrieval and
    Extraction
  • Modeling
  • Retrieval Evaluation
  • Query Languages
  • Query Operations
  • Text and Multimedia Languages and Properties
  • Text Operations
  • Indexing and Searching

59
Topics (Continued)
  • User Interfaces and Visualization
  • Multimedia IR Models and Languages
  • Multimedia IR Indexing and Searching
  • Searching the Web
  • Digital Libraries
  • Information Extraction (IJCAI 1999)
  • Text Data Mining (ACL 1999)
  • Text Web Mining (AIRS 2004)

60
Text IR
Applications for IR
Human-Computer Interaction for IR
Retrieval Models and Evaluation
Bibliographic Systems
Interfaces Visualization
Improvements On Retrieval
The Web
Multimedia IR
Multimedia Modeling Searching
Digital Libraries
Efficient Processing
61
Information Sources
  • http//www-csli.stanford.edu/schuetze/information
    -retrieval.html
  • Books
  • Ricardo Baeza-Yates and Berthier Riberiro-Neto
    (1999) Modern Information Retrieval,
    Addison-Wesley. ?????? ???? ?? (03)5720317
  • Salton, G. (1989) Automatic Text Processing. The
    Transformation, Analysis and Retrieval of
    Information by Computer. Reading, MA
    Addison-Wesley.
  • Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992)
    Information Retrieval Data Structures and
    Algorithms. Englewood Cliffs, NJ Prentice Hall.
  • Cheong, F. (1996) Internet Agents Spiders,
    Wanderers, Brokers, and Bots. Indianapolis, IN
    New Riders, 1996.

62
Information Sources
  • Karen Sparck Jones and Peter Willett (1997)
    Readings in Information Retrieval, CA Morgan
    Kaufmann Publishers.
  • Christopher D. Manning, et al. Introduction to
    Information Retrieval, Cambridge University
    Press. 2007. http//www-csli.stanford.edu/schuet
    ze/information-retrieval-book.html.
  • Sholom M. Weiss, Nitin Indurkhya, Tong Zhang,
    Fred J. Damerau, Text Mining Predictive Methods
    for Analyzing Unstructured Information, Springer,
    2005.

63
Information Sources
  • Conference Proceedings
  • ACM SIGIR Annual International Conference on
    Research and Development in Information Retrieval
    (1978-)
  • ACM International Conference on Digital Libraries
  • ACM Conference on Information Knowledge
    Management
  • Text Retrieval Conference

64
Information Sources (Continued)
  • Journals
  • ACM Transactions on Information Systems
  • Information Processing and Management
  • Journal of the American Society for Information
    Science and Technology
  • Journal of Documentation
  • Information Systems
  • Information Retrieval
  • Knowledge and Information Systems
About PowerShow.com