The New Bill of Rights of Information Society - PowerPoint PPT Presentation

About This Presentation
Title:

The New Bill of Rights of Information Society

Description:

a link to Yahoo email, one to MSN email, one to Gmail, one that compares them, etc. ... Labels may be topics such as Yahoo-categories. finance, sports, News ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 43
Provided by: red52
Category:

less

Transcript and Presenter's Notes

Title: The New Bill of Rights of Information Society


1
The New Bill of Rights of Information Society
  • Raj Reddy and Jaime Carbonell
  • Carnegie Mellon University
  • March 23, 2006
  • Talk at Google

2
New Bill of Rights
  • Get the right information
  • e.g. search engines
  • To the right people
  • e.g. categorizing, routing
  • At the right time
  • e.g. Just-in-Time (task modeling, planning)
  • In the right language
  • e.g. machine translation
  • With the right level of detail
  • e.g. summarization
  • In the right medium
  • e.g. access to information in non-textual media

3
Relevant Technologies
  • search engines
  • classification, routing
  • anticipatory analysis
  • machine translation
  • summarization
  • speech input and output
  • right information
  • right people
  • right time
  • right language
  • right level of detail
  • right medium

4
right information Search Engines

5
The Right Information
  • Right Information from future Search Engines
  • How to go beyond just relevance to query (all)
    and popularity
  • Eliminate massive redundancy e.g. web-based
    email
  • Should not result in
  • multiple links to different yahoo sites promoting
    their email, or even non-Yahoo sites discussing
    just Yahoo-email.
  • Should result in
  • a link to Yahoo email, one to MSN email, one to
    Gmail, one that compares them, etc.
  • First show trusted info sources and
    user-community-vetted sources
  • At least for important info (medical, financial,
    educational, ), I want to trust what I read,
    e.g.,
  • For new medical treatments
  • First info from hospitals, medical schools, the
    AMA, medical publications, etc. , and
  • NOT from Joe Shmos quack practice page or from
    the National Enquirer.
  • Maximum Marginal Relevance
  • Novelty Detection
  • Named Entity Extraction

6
Beyond Pure Relevance in IR
  • Current Information Retrieval Technology Only
    Maximizes Relevance to Query
  • What about information novelty, timeliness,
    appropriateness, validity, comprehensibility,
    density, medium,...??
  • Novelty is approximated by non-redundancy!
  • we really want to maximize relevance to the
    query, given the user profile and interaction
    history,
  • P(U(f i , ..., f n ) Q C U H)
  • where Q query, C collection set,
  • U user profile, H interaction history
  • ...but we dont yet know how. Darn.

7
Maximal Marginal Relevance vs. Standard
Information Retrieval
documents
query
MMR
Standard IR
IR
8
Novelty Detection
  • Find the first report of a new event
  • (Unconditional) Dissimilarity with Past
  • Decision threshold on most-similar story
  • (Linear) temporal decay
  • Length-filter (for teasers)
  • Cosine similarity with standard weights

9
New First Story Detection Directions
  • Topic-conditional models
  • e.g. airplane, investigation, FAA, FBI,
    casualties, ? topic, not event
  • TWA 800, March 12, 1997 ? event
  • First categorize into topic, then use
    maximally-discriminative terms within topic
  • Rely on situated named entities
  • e.g. Arcan as victim, Sharon as peacemaker

10
Link Detection in Texts
  • Find text (e.g. Newstories) that mention the same
    underlying events.
  • Could be combined with novelty (e.g. something
    new about interesting event.)
  • Techniques text similarity, NEs, situated NEs,
    relations, topic-conditioned models,

11
Named-Entity identification
  • Purpose to answer questions such as
  • Who is mentioned in these 100 Society articles?
  • What locations are listed in these 2000 web
    pages?
  • What companies are mentioned in these patent
    applications?
  • What products were evaluated by Consumer Reports
    this year?

12
Named Entity Identification
  • President Clinton decided to send special trade
    envoy Mickey Kantor to the special Asian economic
    meeting in Singapore this week. Ms. Xuemei Peng,
    trade minister from China, and Mr. Hideto Suzuki
    from Japans Ministry of Trade and Industry will
    also attend. Singapore, who is hosting the
    meeting, will probably be represented by its
    foreign and economic ministers. The Australian
    representative, Mr. Langford, will not attend,
    though no reason has been given. The parties hope
    to reach a framework for currency stabilization.

13
Methods for NE Extraction
  • Finite-State Transducers w/variables
  • Example output
  • FNAME Bill LNAME Clinton TITLE
    President
  • FSTs Learned from labeled data
  • Statistical learning (also from labeled data)
  • Hidden Markov Models (HMMs)
  • Exponential (maximum-entropy) models
  • Conditional Random Fields Lafferty et al

14
Named Entity Identification
  • Extracted Named Entities (NEs)
  • People Places
  • President Clinton Singapore
  • Mickey Kantor Japan
  • Ms. Xuemei Peng China
  • Mr. Hideto Suzuki Australia
  • Mr. Langford

15
Role Situated NEs
  • Motivation It is useful to know roles of NEs
  • Who participated in the economic meeting?
  • Who hosted the economic meeting?
  • Who was discussed in the economic meeting?
  • Who was absent from the the economic meeting?

16
Emerging Methods for Extracting Relations
  • Link Parsers at Clause Level
  • Based on dependency grammars
  • Probabilistic enhancements Lafferty, Venable
  • Island-Driven Parsers
  • GLR Lavie, Chart Nyberg, Placeway, LC-Flex
    Rose
  • Tree-bank-trained probabilistic CF parsers IBM,
    Collins
  • Herald the return of deep(er) NLP techniques.
  • Relevant to new Q/A from free-text initiative.
  • Too complex for inductive learning (today).

17
Relational NE Extraction
  • Example (Who does What to Whom)
  • "John Snell reporting for Wall Street. Today
    Flexicon Inc. announced a tender offer for
    Supplyhouse Ltd. for 30 per share, representing
    a 30 premium over Fridays closing price.
    Flexicon expects to acquire Supplyhouse by Q4
    2001 without problems from federal regulators"

18
Fact Extraction Application
  • Useful for relational DB filling, to prepare data
    for standard DM/machine-learning methods
  • Acquirer Acquiree Sh.price Year
  • __________________________________
  • Flexicon Logi-truck 18 1999
  • Flexicon Supplyhouse 30 2001
  • buy.com reel.com 10 2000
  • ... ... ... ...

19
right peopleText Categorization

20
The Right People
  • User-focused search is key
  • If a 7-year old is working on a school project
  • taking good care of ones heart and types in
    heart care, she will want links to pages like
  • You and your friendly heart,
  • Tips for taking good care of your heart,
  • Intro to how the heart works etc.
  • NOT the latest New England Journal of Medicine
    article on Cardiological implications of
    immuo-active proteases.
  • If a cardiologist issues the query, exactly the
    opposite is desired
  • Search engines must know their users better, and
    the user tasks
  • Social affiliation groups for search and for
    automatically categorizing, prioritizing and
    routing incoming info or search results. New
    machine learning technology allows for scalable
    high-accuracy hierarchical categorization.
  • Family group
  • Organization group
  • Country group
  • Disaster affected group
  • Stockholder group

21
Text Categorization
  • Assign labels to each document or web-page
  • Labels may be topics such as Yahoo-categories
  • finance, sports, News?World?Asia?Business
  • Labels may be genres
  • editorials, movie-reviews, news
  • Labels may be routing codes
  • send to marketing, send to customer service

22
Text Categorization
Methods
  • Manual assignment
  • as in Yahoo
  • Hand-coded rules
  • as in Reuters
  • Machine Learning (dominant paradigm)
  • Words in text become predictors
  • Category labels become to be predicted
  • Predictor-feature reduction (SVD, ?2, )
  • Apply any inductive method kNN, NB, DT,

23
Multi-tier Event Classification
24
right timeframeJust-in-Time - no sooner or
later

25
Just in Time Information
  • Get the information to user exactly when it is
    needed
  • Immediately when the information is requested
  • Prepositioned if it requires time to fetch
    download (eg HDTV video)
  • requires anticipatory analysis and pre-fetching
  • How about push technology for, e.g. stock
    alerts, reminders, breaking news?
  • Depends on user activity
  • Sleeping or Dont Disturb or in Meeting ? wait
    your chance
  • Reading email ? now if info is urgent, later
    otherwise
  • Group info before delivering (e.g. show 3 stock
    alerts together)
  • Info directly relevant to users current task ?
    immediately

26
right languageTranslation

27
Access to Multilingual Information
  • Language Identification (from text, speech,
    handwriting)
  • Trans-lingual retrieval (query in 1 language,
    results in multiple languages)
  • Requires more than query-word out-of-context
    translation (see Carbonell et al 1997 IJCAI
    paper) to do it well
  • Full translation (e.g. of web page, of search
    results snippets, )
  • General reading quality (as targeted now)
  • Focused on getting entities right (who, what,
    where, when mentioned)
  • Partial on-demand translation
  • Reading assistant translation in context while
    reading an original document, by highlighting
    unfamiliar words, phrases, passages.
  • On-demand Text to Speech
  • Transliteration

28
in the Right Language
  • Knowledge-Engineered MT
  • Transfer rule MT (commercial systems)
  • High-Accuracy Interlingual MT (domain focused)
  • Parallel Corpus-Trainable MT
  • Statistical MT (noisy channel, exponential
    models)
  • Example-Based MT (generalized G-EBMT)
  • Transfer-rule learning MT (corpus informants)
  • Multi-Engine MT
  • Omnivorous approach combines the above to
    maximize coverage minimize errors

29
Types of Machine Translation
Interlingua
Semantic Analysis
Sentence Planning
Transfer Rules
Syntactic Parsing
Text Generation
Source (Arabic)
Target (English)
Direct EBMT
30
EBMT example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is
my father. Mapudungun Chi doy fütra chi
wentru fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
31
Multi-Engine Machine Translation
  • MT Systems have different strengths
  • Rapidly adaptable Statistical, example-based
  • Good grammar Rule-Based (linguisitic) MT
  • High precision in narrow domains KBMT
  • Minority Language MT Learnable from informant
  • Combine results of parallel-invoked MT
  • Select best of multiple translations
  • Selection based on optimizing combination of
  • Target language joint-exponential model
  • Confidence scores of individual MT engines

32
Illustration of Multi-Engine MT
33
State of the Art in MEMTfor New Hot Languages
  • We can do now
  • Gisting MT for any new language in 2-3 weeks
    (given parallel text)
  • Medium quality MT in 6 months (given more
    parallel text, informant, bi-lingual dictionary)
  • Improve-as-you-go MT
  • Field MT system in PCs
  • We cannot do yet
  • High-accuracy MT for open domains
  • Cope with spoken-only languages
  • Reliable speech-speech MT (but BABYLON is coming)
  • MT on your wristwatch

34
right level of detailSummarization

35
Right Level of Detail
  • Automate summarization with hyperlink one-click
    drilldown on user selected section(s).
  • Purpose Driven summaries are in service of an
    information need, not one-size fits all (as in
    Shaoms outline and the DUC NIST evaluations)
  • EXAMPLE A summary of a 650-page clinical study
    can focus on
  • effectiveness of the new drug for target disease
  • methodology of the study (control group,
    statistical rigor,)
  • deleterious side effects if any
  • target population of study (e.g. acne-suffering
    teens, not eczema suffering adults .depending on
    the users task or information query

36
Information Structuring and Summarization
  • Hierarchical multi-level pre-computed summary
    structure, or on-the-fly drilldown expansion of
    info.
  • Headline
  • Abstract 1 or 1 page
  • Summary 5-10 or 10 pages
  • Document 100
  • Scope of Summary
  • Single big document (e.g. big clinical study)
  • Tight cluster of search results (e.g. vivisimo)
  • Related set of clusters (e.g. conflicting
    opinions on how to cope with Irans nuclear
    capabilities)
  • Focused area of knowledge (e.g. Whats known
    about Pluto? Lycos has good project in this via
    Hotbot)
  • Specific kinds of commonly asked information(e.g.
    synthesize a bio on person X from any
    web-accessible info)

37
Document Summarization
  • Types of Summaries

38
right mediumFinding information in
Non-textual Media

39
Indexing and Searching Non-textual (Analog)
Content
  • Speech ? text (speech recognition)
  • Text ? speech
  • TTS FESTVOX by far most popular high-quality
    system
  • Handwriting ? text (handwriting recognition)
  • Printed text ? electronic text (OCR)
  • Picture ? caption key words (automatically) for
    indexing and searching
  • Diagram, tables, graphs, maps ? caption key words
    (automatically)

40
Conclusion

41
What is Text Mining
  • Search documents, web, news
  • Categorize by topic, taxonomy
  • Enables filtering, routing, multi-text summaries,
  • Extract names, relations,
  • Summarize text, rules, trends,
  • Detect redundancy, novelty, anomalies,
  • Predict outcomes, behaviors, trends,

Who did what to whom and where?
42
Data Mining vs. Text Mining
  • Data relational tables
  • DM universe huge
  • DM tasks
  • DB cleanup
  • Taxonomic classification
  • Supervised learning with predictive classifiers
  • Unsupervised learning clustering, anomaly
    detection
  • Visualization of results
  • Text HTML, free form
  • TM universe 103X DM
  • TM tasks
  • All the DM tasks,
  • plus
  • Extraction of roles, relations and facts
  • Machine translation for multi-lingual sources
  • Parse NL-query (vs. SQL)
  • NL-generation of results
Write a Comment
User Comments (0)
About PowerShow.com