Title: Natural Language Processing using Wikipedia
1Natural Language Processingusing Wikipedia
- Rada Mihalcea
- University of North Texas
2Text Wikification
- Finding key terms in documents and linking them
to relevant encyclopedic information.
3Text Wikification (continued)
- Motivation
- Help Wikipedia contributors
- NLP applications (summarization, text
categorization, metadata annotation, text
similarity) - Enrich educational materials
- Annotating web pages (semantic web)
- Combined problem
- Finding the important concepts
- Keyword extraction
- Finding the correct article
- Word sense disambiguation
4Wikification pipeline
5Keyword Extraction
- Finding important words/phrases in raw text
- Two-stage process
- Candidate extraction
- Typical methods n-grams, noun phrases
- Candidate ranking
- Rank the candidates by importance
- Typical methods
- Unsupervised information theoretic
- Supervised machine learning using positional and
linguistic features
6Keyword Extraction using Wikipedia
- 1. Candidate extraction
- Semi-controlled vocabulary
- Wikipedia article titles and anchor texts
(surface forms). - E.g. USA, U.S. United States of America
- More than 2,000,000 terms/phrases
- Vocabulary is broad (e.g., the, a are included)
7Keyword Extraction using Wikipedia
- 2. Candidate ranking
- tf idf
- Wikipedia articles as document collection
- Chi-squared independence of phrase and text
- The degree to which it appeared more times than
expected by chance - Keyphraseness
8Evaluations
- Gold standard
- 85 documents containing 7.286 links
- Links selected by Wikipedia users
- Have undergone the continuous editorial process
of Wikipedia - Extract N keywords from the ranking
- N6 of number of words
9Results
10Example Keyword Extraction
Automatically extracted Wikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific. The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
11Wikification Pipeline
12Word Sense Disambiguation
13- Aida (café) In most shops a quick coffee while
standing up at the bar is possible.
14Wikipedia as a Sense Tagged Corpus
Wikipedia links Sense annotations
- In most shops a quick coffee while standing up at
the bar (counter) bar is possible. - A channel is also the natural or man-made deeper
course through a reef, bar (landform) bar,
bay, or any shallow body of water. - Each bar (music) bar has a 2-beat unit, a
5-beat unit, and a 3-beat unit, with a stress at
the beginning of each unit.
15Sense Inventory
- Alternative 1 disambiguation webpages
- Does not include all possible annotations
- measure (music) bar measure (music) not
listed - Inconsistent
- identifier of disambiguation page paper
(disambiguation) vs. paper - Alternative 2 extract all link annotations
- bar (counter), bar (music), bar (landform)
- map them to WordNet senses
16Building a Sense Tagged Corpus
- Given ambiguous word W
- Extract all the paragraphs in Wikipedia
containing the ambiguous word W inside a link - Collect all the possible Wikipedia labels
leftmost component of each link - Map the Wikipedia labels to WordNet senses
17An Example
- Given ambiguous word W BAR
- Extract all the paragraphs in Wikipedia
containing the ambiguous word W inside a link - 1,217 paragraphs
- remove examples with bar (ambiguous) 1,108
examples - Collect all the possible Wikipedia labels
leftmost component of each link - 40 Wikipedia labels
- bar (music) measure music musical notation
- Map the Wikipedia labels to WordNet senses
- 9 WordNet senses
18WordNet definition
Wikipedia label
Word sense
Wikipedia definition
bar (counter) bar_(counter) The counter from which drinks are dispensed A counter where you can obtain food or drink
bar (music) bar_(music), measure_music, musical_notation A period of music Musical notation for a repeating pattern of musical beats
bar (landform) bar_(landform) A type of beach behind which lies a lagoon A submerged (or partly submerged) ridge in a river or along a shore
19Supervised Word Sense Disambiguation
- Local and topical features in a Naïve Bayes
classifier - Good performance on Senseval-2 and Senseval-3
data - Local features
- Current word and part-of-speech
- Surrounding context of three words
- Collocational features
- Topical features
- Five keywords per sense, occurring at least three
times - (Ng Lee, 1996), (Lee Ng, 2002)
20Experiments on Senseval-2 / Senseval-3
- Lexical sample WSD
- 49 ambiguous nouns from Senseval-2 (29),
Senseval-3 (20) - Remove the words with one Wikipedia sense
- detention
- Remove the words with all Wikipedia senses mapped
to one WordNet sense - Roman church, Catholic church ? Catholic church
- Final set 30 nouns with Wikipedia labels mapped
to at least two WordNet senses
21- Ten-fold cross validations
- WSD Supervised word sense disambiguation on
Wikipedia sense tagged corpora - MFS Most frequent sense choose the most
frequent sense by default - Similarity Similarity between current example
and training data available for each sense
22Results on Senseval-2 / Senseval-3
23Some Notes
- Words with no improvement
- Small number of examples in Wikipedia
- restraint (9), shelter (17)
- Skewed sense distributions
- bank 1044 occurrences as financial
institution, 30 occurrences as river bank - Different granularity
- Coarser grained senses in Wikipedia
- Missing senses atmosphere ambiance
- Coarse distinctions grasp act of grasping (1)
hold (2) - Exceptions dance performance, theatre
performance
24Experiments on Wikipedia
- All-words WSD
- Link disambiguation
- Find the link assigned by the Wikipedia
annotators - Data set
- The same data set used in keyword evaluation
- 85 documents containing 7.286 links
- Three methods
- Supervised
- Similarity
- Unsupervised measure similarity of context and
candidate article - Combined voting
25Results
26Wikification
27Wikify! system (http//lit.csci.unt.edu/wikify/
or www.wikifyer.com)
28Overall System Evaluation
- Turing-like test
- Annotation of educational materials
29Turing-like Test
- Given a Wikipedia article, decide if it was
annotated by humans or our automated system
Automatically extracted Wikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific. The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
30Turing-like Test
- 20 test subjects (mixed background)
- 10 document pairs for each subject (side by side)
- Average accuracy 57
- Ideal case 50 success rate (total confusion)
31Likert Scale
32Annotation of Educational Materials
- Studies in cognitive science
- An important part of the learning process is the
ability to connect the learning material to the
prior knowledge of the learner (Walter Kinsch,
1998) - Amount of required background material
- Depends on the level of explicitness of the text
- Knowledge of the learner
- Low-knowledge vs. high-knowledge learners
- Use the text wikifier to facilitate access to
background knowledge
33A History Test
- A test consisting of 14 questions from a quiz
from an online history course at UNT - Multiple-choice questions
- Half the questions linked to Wikipedia, half left
in their original format - 60 students taking the test
- Randomly either the first or the last 7 questions
were wikified - Students were instructed
- they were allowed to use any information they
wanted to answer the questions - they were not required to use the Wikipedia links
34(No Transcript)
35Results
(plt0.1)
(plt0.05)
36Lessons Learned
- Wikipedia can be used as a source of evidence for
text processing tasks - Keyword extraction
- Word sense disambiguation
- Text wikification linking documents to
encyclopedic knowledge - Enrich educational materials
- Annotation of web pages (semantic web)
- NLP applications
- summarization, information retrieval, text
categorization - text adaptation, topic identification,
multilingual semantic networks
37Ongoing Work Text Adaptation
- Planning for a Long Trip (Magellans Stories)
- Serraos letters helped build in my mind the
location of the Spice Islands, which later became
the destination for my great voyage. I asked the
King of Portugal to support my journey, but he
refused. After that, I begged the King of Spain.
He was interested in my plan since Spain was
looking for a better sea route to Asia than the
Portuguese route around the southern tip of
Africa. It was going to be hard to find sailors,
though. None of the Spanish sailors wanted to
sail with me because I was Portuguese.
Def long trip with a specific objective, esp. by
sea or air En trip, journey Es travesia, viaje
Def to travel by boat En navigate Es salir,
navigar
Funded by the National Science Foundation under
CAREER IIS-0747340, 2008-2013 Collaboration with
Educational Testing Service (ETS)
38Ongoing Work Topic Identification
- Automatic identification of the topic/category of
a text (e.g., computer science, psychology) - Books
- Learning objects
Vietnam War 0.0023
Cat Wars Involving the United States 0.00779
United States 0.3793
World War I 0.0023
Ronald Reagan 0.0027
Communism 0.0027
Cat Global Conflicts 0.00779
Cold War 0.3111
Michail Gorbachev 0.0023
The United States was involved in the Cold War.
38
Funded by the Texas Higher Education Coord.
Board, Google 2008-2010
39Ongoing WorkMultilingual Semantic Networks
MUSICIAN En musician Fr musicien De Musiket
ORCHESTRA En orchestra Fr orchestre De
orchester
isA
instanceOf
partOf
isA
COMPOSER En composer Fr compositeur De
komponist
PIANIST En pianist Fr pianiste De pianist
CONDUCTOR En conductor Fr chef dorchestre De
Dirigent
BOSTON POPS ORCHESTRA En Boston Pops
Orchestra Fr Orchestre Boston Pops
instanceOf
isA
instanceOf
partOf
instanceOf
JOHN WILLIAMS En John Williams, Williams Fr
John Williams De John Williams
CONDUCTOR OF THE BOSTON POPS ORCHESTRA En
conductor of the Boston Pops Orchestra Fr chef
dorchestre de lOrchestre Boston Pops De
Dirigent de Boston Pops Orchestra
instanceOf
John Williams served as the principal conductor
of the Boston Pops Orchestra
Funded by the National Science Foundation under
IIS-1018613, 2010-2013
40Thank You!Questions?
41Wikipedia for Natural Language Processing
- Word similarity
- (Strube Ponzetto, 2006)
- (Gabrilovich Markovitch, 2007)
- Text categorization
- (Gabrilovich Markovitch, 2006)
- Named entity disambiguation
- (Bunescu Pasca, 2006)
42Wikipedia vs. WordNet (Senseval)
- Different granularity
- Coarser grained senses in Wikipedia
- Missing senses atmosphere ambiance
- Coarse distinctions grasp act of grasping (1)
hold (2) - Exceptions dance performance, theatre
performance - Wikipedia vs. Senseval different sense
distribution - Low sense distribution correlation r 0.51
43Sense Disambiguation Learning Curve
- Disambiguation accuracy using 10, 20 100 of
the data
44Text Wikification
- Finding key terms in documents and linking them
to relevant encyclopedic information.
45Text Wikification
- Finding key terms in documents and linking them
to relevant encyclopedic information.
46Text Wikification
- Finding key terms in documents and linking them
to relevant encyclopedic information.
47Text Wikification
- Finding key terms in documents and linking them
to relevant encyclopedic information.
48Lexical Semantics
- Find the meaning of all-words in unrestricted
text - Required for automatic machine translation,
information retrieval, text understanding - SenseLearner minimally supervised learning
- Senseval-2, Senseval-3, Semeval (Semeval _at_ ACL
2007) - Publicly available http//lit.csci.unt.edu/sensel
earner - GWSD unsupervised graph-based algorithms
- Random walks on text structures
- Find the most central meanings in a text
- http//lit.csci.unt.edu/index.php/Downloads
48
Funded by the National Science Foundation
49Lexical Semantics
- Lexical substitution SubFinder
- Find semantically-equivalent substitutes for a
target word in a given context - Combine corpus-based and knowledge-based
approaches - Combine monolingual and multilingual resources
- Wordnet, Encarta, bilingual dictionaries, large
corpora - Faired well in the Semeval 2007 lexical
substitution task - TransFinder
- Find the translation of a target word in a given
context - Assist Hispanic students with the understanding
of English texts - Task at Semeval 2010
49
Funded by the National Science Foundation
50Lexical Semantics
- Text-to-text semantic similarity
- Find if two pieces of text contain the same
information - Useful for information retrieval (search
engines), text summarization - Focus on automatic student answer grading
- Given the instructor answer and the student
answer, assign a grade and identify potential
misunderstandings and areas that need
clarifications
Funded by the National Science Foundation
51Metadata Annotation for Learning Object
Repositories
- Learning object repositories support sharing and
reuse of educational materials - Identify keywords and related concepts for the
automatic annotation of learning object
repositories - Keyword extraction using
- Graph-based algorithms
- Knowledge drawn from Wikipedia
51
Funded by the Texas Higher Education Coordinating
Board (THECB)
52Sentiment and Subjectivity
- Add subjectivity and sentiment labels to word
senses - Important for automatic analysis of political
opinions, product reviews, market research - Collaboration with Jan Wiebe, U. Pittsburgh
- Automatic assignment of subjectivity to word
senses - Projection of subjectivity annotations and
resources to other languages - Via parallel texts / bilingual dictionaries
- Via machine translation
- Bootstrapping of subjectivity / sentiment seeds
using propagation on graphs and word similarity
52
Funded by the National Science Foundation
53Sentiment and Subjectivity
- Affective text
- Automatic annotation of emotions in text
- Anger, disgust, fear, joy, sadness, surprise
- Collaboration with Carlo Strapparava, IRST
- Large data sets constructed
- Computational humour
- Learning to recognize humour
- Identification of connections with
- other linguistic properties affect,
- valence, semantic classes
53
54Text-to-image Synthesis
- Language learning
- Children
- Second (foreign) language
- People with language disorders
- International language-independent knowledge base
- Pictures are transparent to languages
- Applications
- Pictorial translations (Letters to my cousin)
- Bridge the gap between research in image and text
processing - Image retrieval/classification, natural language
55 pictorial representations
- Typical entry in
- a dictionary
- pipe, tobacco pipe
- a tube with a small bowl at one end used for
smoking tobacco - pipe, pipage, piping
- a long tube made of metal or plastic that is used
to carry water or oil or gas etc.) - pipe, tabor pipe
- a tubular wind instrument