Natural Language Processing using Wikipedia - PowerPoint PPT Presentation

About This Presentation
Title:

Natural Language Processing using Wikipedia

Description:

Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas Text Wikification Finding key terms in documents and linking them to relevant ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 55
Provided by: Andras6
Learn more at: https://www.cse.unt.edu
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing using Wikipedia


1
Natural Language Processingusing Wikipedia
  • Rada Mihalcea
  • University of North Texas

2
Text Wikification
  • Finding key terms in documents and linking them
    to relevant encyclopedic information.

3
Text Wikification (continued)
  • Motivation
  • Help Wikipedia contributors
  • NLP applications (summarization, text
    categorization, metadata annotation, text
    similarity)
  • Enrich educational materials
  • Annotating web pages (semantic web)
  • Combined problem
  • Finding the important concepts
  • Keyword extraction
  • Finding the correct article
  • Word sense disambiguation

4
Wikification pipeline
5
Keyword Extraction
  • Finding important words/phrases in raw text
  • Two-stage process
  • Candidate extraction
  • Typical methods n-grams, noun phrases
  • Candidate ranking
  • Rank the candidates by importance
  • Typical methods
  • Unsupervised information theoretic
  • Supervised machine learning using positional and
    linguistic features

6
Keyword Extraction using Wikipedia
  • 1. Candidate extraction
  • Semi-controlled vocabulary
  • Wikipedia article titles and anchor texts
    (surface forms).
  • E.g. USA, U.S. United States of America
  • More than 2,000,000 terms/phrases
  • Vocabulary is broad (e.g., the, a are included)

7
Keyword Extraction using Wikipedia
  • 2. Candidate ranking
  • tf idf
  • Wikipedia articles as document collection
  • Chi-squared independence of phrase and text
  • The degree to which it appeared more times than
    expected by chance
  • Keyphraseness

8
Evaluations
  • Gold standard
  • 85 documents containing 7.286 links
  • Links selected by Wikipedia users
  • Have undergone the continuous editorial process
    of Wikipedia
  • Extract N keywords from the ranking
  • N6 of number of words

9
Results
10
Example Keyword Extraction
Automatically extracted Wikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific. The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
11
Wikification Pipeline
12
Word Sense Disambiguation
13
  • Aida (café) In most shops a quick coffee while
    standing up at the bar is possible.

14
Wikipedia as a Sense Tagged Corpus
Wikipedia links Sense annotations
  • In most shops a quick coffee while standing up at
    the bar (counter) bar is possible.
  • A channel is also the natural or man-made deeper
    course through a reef, bar (landform) bar,
    bay, or any shallow body of water.
  • Each bar (music) bar has a 2-beat unit, a
    5-beat unit, and a 3-beat unit, with a stress at
    the beginning of each unit.

15
Sense Inventory
  • Alternative 1 disambiguation webpages
  • Does not include all possible annotations
  • measure (music) bar measure (music) not
    listed
  • Inconsistent
  • identifier of disambiguation page paper
    (disambiguation) vs. paper
  • Alternative 2 extract all link annotations
  • bar (counter), bar (music), bar (landform)
  • map them to WordNet senses

16
Building a Sense Tagged Corpus
  • Given ambiguous word W
  • Extract all the paragraphs in Wikipedia
    containing the ambiguous word W inside a link
  • Collect all the possible Wikipedia labels
    leftmost component of each link
  • Map the Wikipedia labels to WordNet senses

17
An Example
  • Given ambiguous word W BAR
  • Extract all the paragraphs in Wikipedia
    containing the ambiguous word W inside a link
  • 1,217 paragraphs
  • remove examples with bar (ambiguous) 1,108
    examples
  • Collect all the possible Wikipedia labels
    leftmost component of each link
  • 40 Wikipedia labels
  • bar (music) measure music musical notation
  • Map the Wikipedia labels to WordNet senses
  • 9 WordNet senses

18
WordNet definition
Wikipedia label
Word sense
Wikipedia definition
bar (counter) bar_(counter) The counter from which drinks are dispensed A counter where you can obtain food or drink
bar (music) bar_(music), measure_music, musical_notation A period of music Musical notation for a repeating pattern of musical beats
bar (landform) bar_(landform) A type of beach behind which lies a lagoon A submerged (or partly submerged) ridge in a river or along a shore
19
Supervised Word Sense Disambiguation
  • Local and topical features in a Naïve Bayes
    classifier
  • Good performance on Senseval-2 and Senseval-3
    data
  • Local features
  • Current word and part-of-speech
  • Surrounding context of three words
  • Collocational features
  • Topical features
  • Five keywords per sense, occurring at least three
    times
  • (Ng Lee, 1996), (Lee Ng, 2002)

20
Experiments on Senseval-2 / Senseval-3
  • Lexical sample WSD
  • 49 ambiguous nouns from Senseval-2 (29),
    Senseval-3 (20)
  • Remove the words with one Wikipedia sense
  • detention
  • Remove the words with all Wikipedia senses mapped
    to one WordNet sense
  • Roman church, Catholic church ? Catholic church
  • Final set 30 nouns with Wikipedia labels mapped
    to at least two WordNet senses

21
  • Ten-fold cross validations
  • WSD Supervised word sense disambiguation on
    Wikipedia sense tagged corpora
  • MFS Most frequent sense choose the most
    frequent sense by default
  • Similarity Similarity between current example
    and training data available for each sense

22
Results on Senseval-2 / Senseval-3
23
Some Notes
  • Words with no improvement
  • Small number of examples in Wikipedia
  • restraint (9), shelter (17)
  • Skewed sense distributions
  • bank 1044 occurrences as financial
    institution, 30 occurrences as river bank
  • Different granularity
  • Coarser grained senses in Wikipedia
  • Missing senses atmosphere ambiance
  • Coarse distinctions grasp act of grasping (1)
    hold (2)
  • Exceptions dance performance, theatre
    performance

24
Experiments on Wikipedia
  • All-words WSD
  • Link disambiguation
  • Find the link assigned by the Wikipedia
    annotators
  • Data set
  • The same data set used in keyword evaluation
  • 85 documents containing 7.286 links
  • Three methods
  • Supervised
  • Similarity
  • Unsupervised measure similarity of context and
    candidate article
  • Combined voting

25
Results
26
Wikification
27
Wikify! system (http//lit.csci.unt.edu/wikify/
or www.wikifyer.com)
28
Overall System Evaluation
  • Turing-like test
  • Annotation of educational materials

29
Turing-like Test
  • Given a Wikipedia article, decide if it was
    annotated by humans or our automated system

Automatically extracted Wikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific. The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
30
Turing-like Test
  • 20 test subjects (mixed background)
  • 10 document pairs for each subject (side by side)
  • Average accuracy 57
  • Ideal case 50 success rate (total confusion)

31
Likert Scale
32
Annotation of Educational Materials
  • Studies in cognitive science
  • An important part of the learning process is the
    ability to connect the learning material to the
    prior knowledge of the learner (Walter Kinsch,
    1998)
  • Amount of required background material
  • Depends on the level of explicitness of the text
  • Knowledge of the learner
  • Low-knowledge vs. high-knowledge learners
  • Use the text wikifier to facilitate access to
    background knowledge

33
A History Test
  • A test consisting of 14 questions from a quiz
    from an online history course at UNT
  • Multiple-choice questions
  • Half the questions linked to Wikipedia, half left
    in their original format
  • 60 students taking the test
  • Randomly either the first or the last 7 questions
    were wikified
  • Students were instructed
  • they were allowed to use any information they
    wanted to answer the questions
  • they were not required to use the Wikipedia links

34
(No Transcript)
35
Results
(plt0.1)
(plt0.05)
36
Lessons Learned
  • Wikipedia can be used as a source of evidence for
    text processing tasks
  • Keyword extraction
  • Word sense disambiguation
  • Text wikification linking documents to
    encyclopedic knowledge
  • Enrich educational materials
  • Annotation of web pages (semantic web)
  • NLP applications
  • summarization, information retrieval, text
    categorization
  • text adaptation, topic identification,
    multilingual semantic networks

37
Ongoing Work Text Adaptation
  • Planning for a Long Trip (Magellans Stories)
  • Serraos letters helped build in my mind the
    location of the Spice Islands, which later became
    the destination for my great voyage. I asked the
    King of Portugal to support my journey, but he
    refused. After that, I begged the King of Spain.
    He was interested in my plan since Spain was
    looking for a better sea route to Asia than the
    Portuguese route around the southern tip of
    Africa. It was going to be hard to find sailors,
    though. None of the Spanish sailors wanted to
    sail with me because I was Portuguese.

Def long trip with a specific objective, esp. by
sea or air En trip, journey Es travesia, viaje
Def to travel by boat En navigate Es salir,
navigar
Funded by the National Science Foundation under
CAREER IIS-0747340, 2008-2013 Collaboration with
Educational Testing Service (ETS)
38
Ongoing Work Topic Identification
  • Automatic identification of the topic/category of
    a text (e.g., computer science, psychology)
  • Books
  • Learning objects

Vietnam War 0.0023
Cat Wars Involving the United States 0.00779
United States 0.3793
World War I 0.0023
Ronald Reagan 0.0027
Communism 0.0027
Cat Global Conflicts 0.00779
Cold War 0.3111
Michail Gorbachev 0.0023
The United States was involved in the Cold War.
38
Funded by the Texas Higher Education Coord.
Board, Google 2008-2010
39
Ongoing WorkMultilingual Semantic Networks
MUSICIAN En musician Fr musicien De Musiket
ORCHESTRA En orchestra Fr orchestre De
orchester
isA
instanceOf
partOf
isA
COMPOSER En composer Fr compositeur De
komponist
PIANIST En pianist Fr pianiste De pianist
CONDUCTOR En conductor Fr chef dorchestre De
Dirigent
BOSTON POPS ORCHESTRA En Boston Pops
Orchestra Fr Orchestre Boston Pops
instanceOf
isA
instanceOf
partOf
instanceOf
JOHN WILLIAMS En John Williams, Williams Fr
John Williams De John Williams
CONDUCTOR OF THE BOSTON POPS ORCHESTRA En
conductor of the Boston Pops Orchestra Fr chef
dorchestre de lOrchestre Boston Pops De
Dirigent de Boston Pops Orchestra
instanceOf
John Williams served as the principal conductor
of the Boston Pops Orchestra
Funded by the National Science Foundation under
IIS-1018613, 2010-2013
40
Thank You!Questions?
41
Wikipedia for Natural Language Processing
  • Word similarity
  • (Strube Ponzetto, 2006)
  • (Gabrilovich Markovitch, 2007)
  • Text categorization
  • (Gabrilovich Markovitch, 2006)
  • Named entity disambiguation
  • (Bunescu Pasca, 2006)

42
Wikipedia vs. WordNet (Senseval)
  • Different granularity
  • Coarser grained senses in Wikipedia
  • Missing senses atmosphere ambiance
  • Coarse distinctions grasp act of grasping (1)
    hold (2)
  • Exceptions dance performance, theatre
    performance
  • Wikipedia vs. Senseval different sense
    distribution
  • Low sense distribution correlation r 0.51

43
Sense Disambiguation Learning Curve
  • Disambiguation accuracy using 10, 20 100 of
    the data

44
Text Wikification
  • Finding key terms in documents and linking them
    to relevant encyclopedic information.

45
Text Wikification
  • Finding key terms in documents and linking them
    to relevant encyclopedic information.

46
Text Wikification
  • Finding key terms in documents and linking them
    to relevant encyclopedic information.

47
Text Wikification
  • Finding key terms in documents and linking them
    to relevant encyclopedic information.

48
Lexical Semantics
  • Find the meaning of all-words in unrestricted
    text
  • Required for automatic machine translation,
    information retrieval, text understanding
  • SenseLearner minimally supervised learning
  • Senseval-2, Senseval-3, Semeval (Semeval _at_ ACL
    2007)
  • Publicly available http//lit.csci.unt.edu/sensel
    earner
  • GWSD unsupervised graph-based algorithms
  • Random walks on text structures
  • Find the most central meanings in a text
  • http//lit.csci.unt.edu/index.php/Downloads

48
Funded by the National Science Foundation
49
Lexical Semantics
  • Lexical substitution SubFinder
  • Find semantically-equivalent substitutes for a
    target word in a given context
  • Combine corpus-based and knowledge-based
    approaches
  • Combine monolingual and multilingual resources
  • Wordnet, Encarta, bilingual dictionaries, large
    corpora
  • Faired well in the Semeval 2007 lexical
    substitution task
  • TransFinder
  • Find the translation of a target word in a given
    context
  • Assist Hispanic students with the understanding
    of English texts
  • Task at Semeval 2010

49
Funded by the National Science Foundation
50
Lexical Semantics
  • Text-to-text semantic similarity
  • Find if two pieces of text contain the same
    information
  • Useful for information retrieval (search
    engines), text summarization
  • Focus on automatic student answer grading
  • Given the instructor answer and the student
    answer, assign a grade and identify potential
    misunderstandings and areas that need
    clarifications

Funded by the National Science Foundation
51
Metadata Annotation for Learning Object
Repositories
  • Learning object repositories support sharing and
    reuse of educational materials
  • Identify keywords and related concepts for the
    automatic annotation of learning object
    repositories
  • Keyword extraction using
  • Graph-based algorithms
  • Knowledge drawn from Wikipedia

51
Funded by the Texas Higher Education Coordinating
Board (THECB)
52
Sentiment and Subjectivity
  • Add subjectivity and sentiment labels to word
    senses
  • Important for automatic analysis of political
    opinions, product reviews, market research
  • Collaboration with Jan Wiebe, U. Pittsburgh
  • Automatic assignment of subjectivity to word
    senses
  • Projection of subjectivity annotations and
    resources to other languages
  • Via parallel texts / bilingual dictionaries
  • Via machine translation
  • Bootstrapping of subjectivity / sentiment seeds
    using propagation on graphs and word similarity

52
Funded by the National Science Foundation
53
Sentiment and Subjectivity
  • Affective text
  • Automatic annotation of emotions in text
  • Anger, disgust, fear, joy, sadness, surprise
  • Collaboration with Carlo Strapparava, IRST
  • Large data sets constructed
  • Computational humour
  • Learning to recognize humour
  • Identification of connections with
  • other linguistic properties affect,
  • valence, semantic classes

53
54
Text-to-image Synthesis
  • Language learning
  • Children
  • Second (foreign) language
  • People with language disorders
  • International language-independent knowledge base
  • Pictures are transparent to languages
  • Applications
  • Pictorial translations (Letters to my cousin)
  • Bridge the gap between research in image and text
    processing
  • Image retrieval/classification, natural language

55
pictorial representations
  • Typical entry in
  • a dictionary
  • pipe, tobacco pipe
  • a tube with a small bowl at one end used for
    smoking tobacco
  • pipe, pipage, piping
  • a long tube made of metal or plastic that is used
    to carry water or oil or gas etc.)
  • pipe, tabor pipe
  • a tubular wind instrument
Write a Comment
User Comments (0)
About PowerShow.com