Information Retrieval from Complex Unstructured Data Sources - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Information Retrieval from Complex Unstructured Data Sources

Description:

'Data' living in a database are structured, controlled, retrievable ... Broadcast news transcripts. CNN Headline News. Voice of America. Medical Literature. PubMed ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 37
Provided by: daveei
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval from Complex Unstructured Data Sources


1
Information Retrieval from Complex Unstructured
Data Sources
  • David Eichmann
  • Director, School of Library and Information
    Science
  • (and Dept. of Computer Science)
  • david-eichmann_at_uiowa.edu

2
The Basic Problem
  • Data living in a database are structured,
    controlled, retrievable
  • Text is complicated, redundant, ambiguous
  • The vast majority of information lives in text,
    rather than databases

3
Extraction from a PDF File
  • GRAMMARS HAVE EXCEPTIONS
  • Valter Crescenzi 1 and Giansalvatore Mecca 2
  • 1 Dipartimento di Informatica e Automazione
    Universit?a di Roma Tre Via della Vasca Navale,
    84 - 00146 Roma tel 39 06 5517 3219, fax 39
    06 557 3030 crescenz_at_dia.uniroma3.it
  • 2D.I.F.A. Universit?a della Basilicata via
    della Tecnica, 3 85100 Potenza, Italy tel 39
    0971 474 638, fax 39 0971 56537
    mecca_at_dia.uniroma3.it http//www.difa.unibas.it/us
    ers/gmecca
  • Abstract Extending database-like techniques to
    semi-structured and Web data sources is becoming
    a prominent research field. These data sources
    are essentially collections of textual documents.

4
PDF Example, cont.
  • References
  • 11 S. Abiteboul, D. Quass, J. McHugh, J. Widom,
    and J. Wiener. The Lorel query language for
    semistructured data. Journal of Digital
    Libraries, 1(1)6888 (1997).
  • 12 A. V. Aho, R. Sethi, and J. D. Ullman.
    Compilers Principles, Techniques and Tools.
    Addison Wesley Publ. Co., Reading, Massachussetts
    (1985).
  • 13 G. O. Arocena and A. O. Mendelzon. WebOQL
    Restructuring documents, databases and Webs. In
    Fourteenth IEEE International Conference on Data
    Engineering (ICDE'98), Orlando, Florida (1998).

5
Semi-Structured Data
  • Reseach in the area of semi-structured data views
    the preceeding text as not truly structured (in
    the database sense), but not completely
    unstructured, either.
  • There are repeating patterns
  • Macro level - scholarly documents have titles,
    author(s), introduction, , references
  • Micro level - references have (e.g.) author(s),
    titles, journal, volume, issue, page(s), date.

6
Semi-Structured Data
  • Retrieval in this domain has prerequisites
  • Format recognition, translation (e.g., Word, PDF,
    etc.)
  • Heuristic assessment of document class
  • For a given class of document, heuristic
    retrieval of conjectured structure

7
PDF Extraction Example (Journal Paper)
  • Abiteboul, S. Quass, D., McHugh, J. Widom,
    J. Wiener, J.
  • The Lorel query language for semistructured data
  • Journal of Digital Libraries.
  • vol,issue 1, 1
  • pages 68, 88
  • location
  • 1997

8
PDF Extraction Examples (Book)
  • Aho, A. V. Sethi, R. Ullman, J. D.
  • Compilers Principles, Techniques and Tools
  • Addison Wesley Publ. Co., Reading,
    Massachussetts.
  • vol,issue ,
  • pages -1, -1
  • location
  • 1985

9
PDF Extraction Example (Conference Paper)
  • Arocena, G. O. Mendelzon, A. O.
  • WebOQL Restructuring documents, databases and
    Webs
  • Fourteenth IEEE International Conference on Data
    Engineering (ICDE'98), Orlando, Florida.
  • vol,issue ,
  • pages -1, -1
  • location
  • 1998

10
Citation Recognition Issues
  • Correct classification of these citations
    involves a number of heuristics (e.g.)
  • If there is a volume number and issue number,
    assume a journal paper
  • If there is a location or a date involving a
    specific day or range of days (e.g., Sept.
    23-25), assume a conference paper
  • Note that for these examples, were missing the
    dates and flubbed the location for the conference
    paper

11
Even So. . .
  • Scanning through a small set of PDFs retrieved
    for a search on semistructured data yields
  • Arocena, G. O.
  • WebOQL Restructuring documents, databases and
    Webs
  • Abiteboul, S.
  • Querying documents in object databases
  • Querying and updating the file
  • The Lorel query language for semistructured data
  • Mendelzon, A. O.
  • WebOQL Restructuring documents, databases and
    Webs
  • Querying the World Wide Web
  • Querying the World Wide Web
  • Formal models of Web queries

12
A Functioning Example
  • This technology forms the core of CiteSeer, a
    research project and search engine operated by
    NECs New Jersey research lab
  • http//citeseer.nj.nec.com
  • PostScript and PDF content is discovered with a
    standard Web crawler
  • Coverage is currently predominantly Computer
    Science, primarily because thats whats out
    there
  • Performance is significantly on a par with
    Science Citations

13
A PubMed Example
  • Document 89316080 - Multiple and repetitive uses
    of the extended hamstring V-Y myocutaneous flap .
  • An extended hamstring V-Y myocutaneous
    advancement flap is described that may be used to
    cover unusually large defects in the ischial
    region. Technical points that allow a large
    amount of flap advancement are discussed. Because
    of its large size, the flap can be raised and
    used on repeated occasions to repair defects from
    recurrent ischial pressure sores. Two patients
    are presented in whom the same flap was used
    repeatedly on multiple occasions, demonstrating
    the potential for preservation of future options
    in such patients when this flap is used.

14
A PubMed Example, cont.
  • Classification systems such MeSH provide for
    organization of such data, but populating the
    data is human-intensive

15
MeSH Terms for PubMed Ex.
  • Case Report
  • Decubitus Ulcer/SU
  • C17.800.893.289
  • Human
  • Male
  • Methods
  • E05.581
  • H01.770.370
  • Middle Age
  • M01.060.116.630

16
MeSH Terms, cont.
  • Reoperation
  • E04.690
  • Surgical Flaps/
  • A10.850.710
  • E07.862.710
  • Thigh
  • A01.378.592.867

17
Extracting Classification Terms
  • For this type of data, the additional challenge
    is to recognize and extract from the abstract or
    full paper text that could serve as automatically
    generation classification terms
  • Phrase recognition and matching against the
    classification hierarchy (MeSH)
  • From the MeSH terms themselves, or
  • From noun phrases generated with a part-of-speech
    tagger

18
Noun Phrases from the Example
  • Multiplerepetitive, uses
  • hamstring, V-Y, myocutaneous, flap
  • hamstring, V-Y, myocutaneous, advancement, flap
  • large, defects
  • ischial, region
  • Technical, points
  • large, amount
  • flap, advancement
  • large, size
  • . . .

19
Extraction Results
  • MeSH Terms (occurrences in parentheses, direct
    matches with human classifier in green)
  • Surgical Flaps (6)
  • A10.850.710
  • E07.862.710
  • Decubitus Ulcer (1)
  • C17.800.893.289
  • Patients (2)
  • M01.643
  • Forecasting (1)
  • I01.320

20
Extraction Results, cont.
  • Other Phrases
  • flap advancement (1)
  • future options (1)
  • hamstring V-Y myocutaneous advancement flap (1)
  • hamstring V-Y myocutaneous flap (1)
  • ischial pressure sores (1)
  • ischial region (1)
  • repetitive uses (1)

21
Broadening the Scope
  • So far, weve limited ourselves to extraction
    from sources with a fair amount of predictable
    structure, or at least a very specialized
    vocabulary
  • Expanding our extraction capabilities to more
    general categories of information requires
    additional tools. . .

22
Named Entity Extraction
  • Virtually all text and speech is rich with
    references to entities
  • Some categories of entities involve reasonably
    unique naming of the members of the category
  • Dave Eichmann
  • School of Library and Information Science
  • The University of Iowa
  • Iowa City, Iowa (actually two entities)

23
Named Entity Extraction work at the University of
Iowa
  • We have five categories currently being
    recognized
  • Persons
  • Organizations
  • Locations
  • Events (preliminary)
  • MeSH (medical terminology)
  • Plus generic noun phrases (e.g., health care)

24
Named Entity Recognition
  • All categories are driven through examination of
    noun phrases recognized by a part-of-speech
    tagger (with special handling of certain glue
    words and, of, the, etc.)
  • Named entity vectors are maintained separately
    from the regular word vector, weighted by their
    length and the frequency of the constituent terms

25
Person Recognition Resources
  • Various Web lists of cultural names
  • Anglo, Chinese, Arab, Hebrew, Hindi, Indian,
    Japanese, Latino, Muslim, Russian
  • World leaders
  • This is enriched with a set of pattern
    expressions for other instances
  • President
  • III

26
Organization Recognition Resources
  • International political organizations (from CIA
    Fact Book)
  • Fortune 500 company list
  • Global 500 company list
  • This is enriched with a set of pattern
    expressions for other instances
  • Incorporated
  • Sons

27
Location Recognition Resources
  • We mine the text of the CIA Fact Book for
    variants of country names, administrative
    divisions, capitals, harbors, etc.
  • Various Web lists of
  • World cities
  • U.S. Cities
  • Rivers
  • Lakes
  • This is enriched with a set of pattern
    expressions for other instances
  • Street
  • Mount

28
Example Document Sources
  • Newswires
  • Associated Press
  • Wall Street Journal
  • Financial Times of London
  • Los Angeles Times
  • Reuters
  • Broadcast news transcripts
  • CNN Headline News
  • Voice of America
  • Medical Literature
  • PubMed
  • The Web

29
Newswire Entity Recognition Sample 1
  • Persons
  • Bill Clinton (3)
  • Jonathan Pollard (8)
  • Moshe Fogel (2)
  • Benjamin Netanyahu (2)
  • Esther (1)
  • Israeli Embassy (1)
  • Organizations
  • Cabinet (1)
  • Places
  • Israel (16)
  • United States (5)
  • Washington (2)

30
Newswire Entity Recognition Sample 2
  • Persons
  • Vladimir Meciar (8)
  • Jozef Moravcik (2)
  • God (1)
  • Kalman Petocz (2)
  • Organizations
  • Slovak Democratic Coalition (2)
  • United States and Germany (1)
  • NATO (1)
  • European Union (1)
  • Hungarian Coalition Party (1)
  • Places
  • Slovakia (4)
  • Europe (1)

31
Some Performance Data
  • The chart on the next slide shows a set of topics
    (generated by information analysts) plotted by
  • of returns that were false alarms (X axis)
  • of good matches that were missed (Y axis)
  • The retrieval decision was based on the level of
    entity matching between the topic and a given
    document

32
Some Performance Data
33
Comments on Performance
  • Note that the false alarm rate is very low
  • The scale on the X-axis is 0.00 - 0.05
  • The extremely broad spread on miss rates (from
    missing everything to missing nothing) correlates
    roughly with the nature of the topic
  • Those involving actions of an individual or group
    (e.g., World Trade Organization talks) are quite
    good
  • Thos involving concepts (e.g., Federal monetary
    policy) are quite poor

34
Conclusions / Observations
  • Given the rate of increase in the generation of
    structured, semistructured and unstructured data,
    some form of automated extraction, analysis and
    retrieval technology can be quite valuable
  • Current technology performs well for some, but
    not all, categories of information request

35
The Cutting Edge
  • Moving beyond this involves even more complicated
    techniques
  • The current hot topic is question answering -
    given a question, provide not a document, but the
    actual answer
  • What is Colin Powell famous for?
  • How many cases of West Nile have been detected in
    Iowa this year?

36
The Cutting Edge
  • QA systems use much of the preceding approaches,
    adding significantly in the areas of natural
    language parsing and question classification.
  • Current state of the art
  • How much folic acid should an expectant mother
    get daily?
  • 400 micrograms
  • Answers are factoids, and anything in the corpus
    (real or not) is scored as correct
Write a Comment
User Comments (0)
About PowerShow.com