Knowledge Discovery - PowerPoint PPT Presentation


Title: Knowledge Discovery


1
Knowledge Discovery
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
  • Marko Grobelnik, Dunja Mladenic
  • J.Stefan Institute
  • Slovenia

2
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

3
Why is Knowledge Discovery appropriate for
Semantic Web?
  • Idea let a computer search for knowledge whereas
    the humans give just broad directions about where
    and how to search
  • Knowledge discovery (KD) could be defined as a
    research area with several subfields Machine
    Learning, Data Mining and Data bases (Mitchell,
    1997 Fayyad et al., 1996 Witten and Frank,
    1999 Hand et al., 2001)
  • KD techniques
  • mainly about discovering structure in the data
  • can serve as one of the key mechanisms for
    structuring knowledge into an ontological
    structure being further used in Knowledge
    management process
  • Data and corresponding semantic structures change
    in time
  • sub-field of KD called stream mining deals with
    these kinds of problems
  • Semantic Web is ultimately concerned with
    real-life data on the web which have exponential
    growth
  • scalability is one of the central issues in KD

4
Machine Learning view to Ontology Generation
5
Knowledge Discovery Techniques
  • Knowledge discovery technologies can be used to
    support different phases and scenarios for
    ontology generation
  • Observations
  • Completely automatic construction of ontologies
    is in general not possible for
  • theoretical reasons (e.g., information
    bottleneck) and
  • practical reasons (e.g., the soft nature of the
    knowledge being conceptualized).
  • Human interventions are necessary but costly in
    terms of resources
  • therefore the technology should help in
    efficient utilization of human interventions.
  • Document databases are the most common data type
    conceptualized in the form of ontologies

6
What is Ontology?
  • In most ML contexts we can refer to an ontology
    as being a graph/network structure consisting
    from
  • a set of concepts (vertices in a graph)
  • each concept Ci is described by a
    membership-function ci(x)
  • a set of relations connecting concepts (directed
    edges in a graph)
  • each relation Ri is described by a
    membership-function ri(Ci, Cj)
  • a set of instances (data records assigned to
    concepts or relations)
  • each instance Ii is described by a set of
    features Fi,j

7
We have 7 concepts (C1C7), and 3 relations
(R1R3) each of the concept and relation is
populated by a number of instances (data records)
R1
C2
C1
R3
C4
C3
R3
R2
R1
R3
R2
C5
C7
R1
C6
8
Ontology Definition
  • Ontology is defined as a tuple with 5 sets of
    objects
  • OntologyltClasses, Relations, Instances,
    Class-Definitions, Relation-Definitionsgt
  • in short OltC, R, I, CD, RDgt
  • where
  • Classes set of labels Ci
  • Relations set of labels Ri
  • Instances set of instance feature vectors Ii
  • Class-Definitions set of class membership
    functions CDi
  • Relation-Definitions set of relation membership
    functions RDi
  • the idea is to describe ontology learning
    tasks in above terms

9
Ontology Learning
  • Ontology learning is a set of tasks based on the
    previous ontology definition
  • We define ontology learning tasks in terms of
    mappings between ontology components where some
    of the components are given and some are missing
    and we want to induce the missing ones
  • Some typical scenarios
  • Inducing classes/Clustering of instances
  • C, CDf(I)
  • Ontology population
  • CD, RDf(C, R, I)
  • Ontology generation
  • C, R, CD, RDf(I) (hardest task)

10
Representational language
  • When performing learning of function f, we need
    to select language for representation of
    membership function f
  • Examples
  • Linear functions (Support-Vector-Machines, )
  • Propositional logic (decision trees, rules, )
  • First order logic (Inductive Logic programming)
  • by selecting different representation languages
    we decide about
  • the power of the descriptions
  • complexity of computation

11
Ontology Quality
  • For the same set of instances I we can have
    multiple ontologies OltC, R, I, CD, RDgtI
  • We need a function q for measuring the quality of
    a given ontology OI
  • function q returns numerical value
  • the best ontology is the one with the highest
    quality
  • Possible evaluation measures
  • (1) analysis of statistical properties of
    structured data,
  • (2) agreement to the properties derived from
    manually built ontologies,
  • (3) optimization of efficiency of the user's
    behaviour when using an ontology,
  • (4) using background knowledge, and
  • (5) building hybrid measures (combination of
    various approaches).

12
Search for optimal Ontology
  • Given set of instances I, we develop a series of
    ontologies
  • O1, O2, O3,
  • where we have set of transformation operators
    (refinement operators) going from Oi to Oi1
  • Good search procedure would select such
    transformations which would lead efficiently
    towards the highest quality q(Oi)
  • this formulation is in line with machine
    learning with structured output
  • we could use human in the loop by using active
    learning techniques

13
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

14
Large Scale Topic Ontology population
15
Text categorization into large topic ontology
  • Categorization of documents into large topic
    ontology is one of the problems in text mining
  • needs to be scalable
  • e.g. being able to handle DMozs 600K categories
    and 4M docs.
  • needs to be accurate
  • having accuracy on the level of inter-human
    agreement (60-80)
  • needs to be robust
  • taking into account nature of web pages
    (typically mixed quality content and often high
    quality context)

16
Approaches for handling hierarchy of categories
  • There are several topic ontologies (taxonomies)
    of textual documents
  • Yahoo, DMoz, Medline,
  • Different people use different approaches
  • series of hierarchically organized classifiers
  • set of independent classifiers just for leaves
  • set of independent classifiers for all nodes

17
Yahoo! topic ontology (taxonomy)
  • human constructed hierarchy of Web-documents
  • exists in several languages
  • easy to access and regularly updated
  • captures most of the Web topics
  • English version includes over 2M pages
    categorized into 50,000 categories
  • contains about 250Mb of HTML-files

18
Document to categorize CFP for CoNLL-2000
19
Some predicted categories
20
System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
unlabeled document
category (label)
??
Document Classifier
21
Content categories
  • For each content category generate a separate
    classifier that predicts probability for a new
    document to belong to its category

22
Summary of experimental results on Yahoo!
23
DMoz / ODP is largest topic ontology on the
web 4M sites 68k editors 600k concepts
24
Categorization into DMoz
  • On input we take DMoz RDF taxonomy data
  • from http//rdf.dmoz.org/
  • we preprocess it into efficient binary structure
  • next, we build a classification model consisting
    from models for individual categories
  • We take hierarchical nature into account
  • Using classification model we classify new
    documents into taxonomy
  • On output we get for a given document text and
    URL
  • Set of most relevant categories from DMoz
  • Set of most relevant keywords calculated from
    DMoz category names (segments from the path names)

25
What is used for learning?
  • Currently the system uses hierarchical nearest
    neighbor
  • in the past we experimented with Naïve Bayes
    for Yahoo taxonomy (http//kt.ijs.si/Dunja/yplanet
    .html)
  • heavy feature selection was needed
  • we plan to experiment with Support Vector
    Machine (SVM) algorithms
  • we plan to use this for ACM KDD Cup 2005
    Challenge
  • Scalability is a problem for learning and
    classification when dealing with 600K classes and
    4M documents
  • Approaches still needs to be properly evaluated

26
Performance issues
  • Preprocessing of DMoz (from RDF to classification
    model) takes approx. 1h
  • For classification into the whole DMoz we need
    Win64 with at least 6Gb memory
  • subsets of DMoz run on Win32 with 2Gb
  • Classification into DMoz is fast
  • 20 document classifications per second
  • e.g. whole Wikipedia was classified into DMoz in
    several hours

27
Demos
  • Demo software for classification into
    http//dmoz.org/Science/ available at
    http//agava.ijs.si/marko/DMozClassifyDemo.zip
    (40Mb)
  • includes AVI file with demo movie
  • demo runs at http//alchemist.ijs.si11111/
  • Demo for classification into the whole DMoz (all
    600K classes) runs at http//alchemist.ijs.si2222
    2/

28
Example classification of URL of a web page
keywords
categories
classification of Hubble telescope web page
29
Example classification of URL text of a web page
30
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

31
Extracting Semantic Graph from text
32
Summarization with semantic graph (Leskovec,
Grobelnik, Milic-Frayling 2005)
  • Idea extract semantic network from text
    documents and identify relevant parts of the
    semantic network to represent summary
  • Semantic graph representation is used for
    summarization task (DUC Challenge)
  • The main research result is the finding that
    topology of extracted semantic graph helps in
    determining importance of the content triples
    (which Subject-Predicate-Object triple is
    relevant)
  • joint collaboration with Microsoft Research,
    Cambridge

33
Approach Description
  • Approach
  • Learn a machine learning model for selecting
    sentences
  • Use information about semantic structure of the
    document (concepts and relations among concepts)
  • Results are promising
  • achieved 70 recall of and 25 precision on
    extracted Subject-Predicate-Object triples on DUC
    (Document understanding conference) data

34
Summarization
Human built document summary
Original Document
  • Cracks Appear in U.N. Trade Embargo Against
    Iraq.
  • Cracks appeared Tuesday in the U.N. trade
    embargo against Iraq as Saddam Hussein sought to
    circumvent the economic noose around his country.
    Japan, meanwhile, announced it would increase its
    aid to countries hardest hit by enforcing the
    sanctions. Hoping to defuse criticism that it is
    not doing its share to oppose Baghdad, Japan said
    up to 2 billion in aid may be sent to nations
    most affected by the U.N. embargo on Iraq.
    President Bush on Tuesday night promised a joint
    session of Congress and a nationwide radio and
    television audience that Saddam Hussein will
    fail'' to make his conquest of Kuwait permanent.
    America must stand up to aggression, and we
    will,'' said Bush, who added that the U.S.
    military may remain in the Saudi Arabian desert
    indefinitely. I cannot predict just how long it
    will take to convince Iraq to withdraw from
    Kuwait,'' Bush said. More than 150,000 U.S.
    troops have been sent to the Persian Gulf region
    to deter a possible Iraqi invasion of Saudi
    Arabia. Bush's aides said the president would
    follow his address to Congress with a televised
    message for the Iraqi people, declaring the world
    is united against their government's invasion of
    Kuwait. Saddam had offered Bush time on Iraqi TV.
    The Philippines and Namibia, the first of the
    developing nations to respond to an offer Monday
    by Saddam of free oil _ in exchange for sending
    their own tankers to get it _ said no to the
    Iraqi leader. Saddam's offer was seen as a
    none-too-subtle attempt to bypass the U.N.
    embargo, in effect since four days after Iraq's
    Aug. 2 invasion of Kuwait, by getting poor
    countries to dock their tankers in Iraq. But
    according to a State Department survey, Cuba and
    Romania have struck oil deals with Iraq and
    companies elsewhere are trying to continue trade
    with Baghdad, all in defiance of U.N. sanctions.
    Romania denies the allegation. The report, made
    available to The Associated Press, said some
    Eastern European countries also are trying to
    maintain their military sales to Iraq. A
    well-informed source in Tehran told The
    Associated Press that Iran has agreed to an Iraqi
    request to exchange food and medicine for up to
    200,000 barrels of refined oil a day and cash
    payments. There was no official comment from
    Tehran or Baghdad on the reported food-for-oil
    deal. But the source, who requested anonymity,
    said the deal was struck during Iraqi Foreign
    Minister Tariq Aziz's visit Sunday to Tehran, the
    first by a senior Iraqi official since the
    1980-88 gulf war. After the visit, the two
    countries announced they would resume diplomatic
    relations. Well-informed oil industry sources in
    the region, contacted by The AP, said that
    although Iran is a major oil exporter itself, it
    currently has to import about 150,000 barrels of
    refined oil a day for domestic use because of
    damages to refineries in the gulf war. Along
    similar lines, ABC News reported that following
    Aziz's visit, Iraq is apparently prepared to give
    Iran all the oil it wants to make up for the
    damage Iraq inflicted on Iran during their
    conflict. Secretary of State James A. Baker III,
    meanwhile, met in Moscow with Soviet Foreign
    Minister Eduard Shevardnadze, two days after the
    U.S.-Soviet summit that produced a joint demand
    that Iraq withdraw from Kuwait. During the
    summit, Bush encouraged Mikhail Gorbachev to
    withdraw 190 Soviet military specialists from
    Iraq, where they remain to fulfill contracts.
    Shevardnadze told the Soviet parliament Tuesday
    the specialists had not reneged on those
    contracts for fear it would jeopardize the 5,800
    Soviet citizens in Iraq. In his speech, Bush said
    his heart went out to the families of the
    hundreds of Americans held hostage by Iraq, but
    he declared, Our policy cannot change, and it
    will not change. America and the world will not
    be blackmailed.'' The president added Vital
    issues of principle are at stake. Saddam Hussein
    is literally trying to wipe a country off the
    face of the Earth.'' In other developments _A
    U.S. diplomat in Baghdad said Tuesday up to 800
    Americans and Britons will fly out of
    Iraqi-occupied Kuwait this week, most of them
    women and children leaving their husbands behind.
    Saddam has said he is keeping foreign men as
    human shields against attack. On Monday, a
    planeload of 164 Westerners arrived in Baltimore
    from Iraq. Evacuees spoke of food shortages in
    Kuwait, nighttime gunfire and Iraqi roundups of
    young people suspected of involvement in the
    resistance. There is no law and order,'' said
    Thuraya, 19, who would not give her last name.
    A soldier can rape a father's daughter in front
    of him and he can't do anything about it.'' _The
    State Department said Iraq had told U.S.
    officials that American males residing in Iraq
    and Kuwait who were born in Arab countries will
    be allowed to leave. Iraq generally has not let
    American males leave. It was not known how many
    men the Iraqi move could affect. _A Pentagon
    spokesman said some increase in military
    activity'' had been detected inside Iraq near its
    borders with Turkey and Syria. He said there was
    little indication hostilities are imminent.
    Defense Secretary Dick Cheney said the cost of
    the U.S. military buildup in the Middle East was
    rising above the 1 billion-a-month estimate
    generally used by government officials. He said
    the total cost _ if no shooting war breaks out _
    could total 15 billion in the next fiscal year
    beginning Oct. 1. Cheney promised disgruntled
    lawmakers a significant increase'' in help from
    Arab nations and other U.S. allies for Operation
    Desert Shield. Japan, which has been accused of
    responding too slowly to the crisis in the gulf,
    said Tuesday it may give 2 billion to Egypt,
    Jordan and Turkey, hit hardest by the U.N.
    prohibition on trade with Iraq. The pressure
    from abroad is getting so strong,'' said Hiroyasu
    Horio, an official with the Ministry of
    International Trade and Industry. Local news
    reports said the aid would be extended through
    the World Bank and International Monetary Fund,
    and 600 million would be sent as early as
    mid-September. On Friday, Treasury Secretary
    Nicholas Brady visited Tokyo on a world tour
    seeking 10.5 billion to help Egypt, Jordan and
    Turkey. Japan has already promised a 1 billion
    aid package for multinational peacekeeping forces
    in Saudi Arabia, including food, water, vehicles
    and prefabricated housing for non-military uses.
    But critics in the United States have said Japan
    should do more because its economy depends
    heavily on oil from the Middle East. Japan
    imports 99 percent of its oil. Japan's
    constitution bans the use of force in settling
    international disputes and Japanese law restricts
    the military to Japanese territory, except for
    ceremonial occasions. On Monday, Saddam offered
    developing nations free oil if they would send
    their tankers to pick it up. The first two
    countries to respond Tuesday _ the Philippines
    and Namibia _ said no. Manila said it had already
    fulfilled its oil requirements, and Namibia said
    it would not sell its sovereignty'' for Iraqi
    oil. Venezuelan President Carlos Andres Perez
    dismissed Saddam's offer of free oil as a
    propaganda ploy.'' Venezuela, an OPEC member,
    has led a drive among oil-producing nations to
    boost production to make up for the shortfall
    caused by the loss of Iraqi and Kuwaiti oil from
    the world market. Their oil makes up 20 percent
    of the world's oil reserves. Only Saudi Arabia
    has higher reserves. But according to the State
    Department, Cuba, which faces an oil deficit
    because of reduced Soviet deliveries, has
    received a shipment of Iraqi petroleum since U.N.
    sanctions were imposed five weeks ago. And
    Romania, it said, expects to receive oil
    indirectly from Iraq. Romania's ambassador to the
    United States, Virgil Constantinescu, denied that
    claim Tuesday, calling it absolutely false and
    without foundation.''.

Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Manual summarization
Creation of semantic network
Semantic net of Subj-Pred-Obj triples
Automatically built document summary (not done
by us)
70 recall, 40 precision of selected triples
according to human generated summaries
Automatic summarization by selecting relevant
triples
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Nat. Lang. Generation
Mapping between graphs learned with ML methods
Semantic net of Subj-Pred-Obj triples
35
Detailed Summarization Procedure
  • Linguistic analysis of the text
  • - Deep parsing of sentences
  • Refinement of the text parse
  • - Named-entity consolidation
  • Determine that George Bush Bush
  • U.S. president
  • - Anaphora resolution
  • Link pronouns with name-entities
  • Extract SubjectPredicateObject triples

Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
36
Named entities consolidation
  • Consolidating different surface forms that refer
    to the same entities only for names of people,
    places, companies, etc.
  • Example
  • Hillary Rodham Clinton, Hillary Clinton, Hillary
    Rodham, Mrs. Clinton ? Hillary Clinton
  • Heuristic based on the overlap in the surface
    form of name variances
  • Accuracy on a subset of the data set 90.

37
Pronomial anaphora resolution
  • Link pronouns with their references
  • Mary likes Paul. She went to buy him a
    present.
  • ? Mary likes Paul. She Mary went to buy him
    Paul a present.
  • Method
  • restrict to 5 pronouns she, he, who, I, they.
  • from the pronoun, traverse the text searching for
    candidate references and assign a score
  • the score is based on the distance from the
    pronoun and semantic information
  • assume that pronouns refer only to named entities
    found in the document
  • Problem
  • One passenger in King's car said they had been
    drinking liquor.
  • Average accuracy on 1,500 hand labeled pronouns
    81.2

38
Anaphora resolution evaluation
Pronoun Frequency Frequency Accuracy
He 681 45.22 86.9
They 244 16.20 67.2
It 204 13.55
I 64 4.25 82.8
You 50 3.32
We 44 2.92
That 44 2.92
What 27 1.79
She 24 1.59 62.5
This 22 1.46
Who 11 0.73 63.6

Total 1506 100 81.2
Accuracy on 5 selected 81.2 (55.2 if counting
all pronouns)
39
Extracting triples
  • Enhanced parse tree is traversed to identify
    SubjectPredicateObject triples
  • Example
  • Conservatives embraced the nomination while
    liberals were cautious or hostile
  • Resulting triples
  • conservative ? embrace ? nomination
  • liberal ? is ? cautious
  • liberal ? is ? hostile

40
Detailed Summarization Procedure
  • Linguistic analysis of the text
  • - Deep parsing of sentences
  • Refinement of the text parse
  • - Named-entity consolidation
  • Determine that George Bush Bush
  • U.S. president
  • - Anaphora resolution
  • Link pronouns with name-entities
  • Extract Subject Predicate Object triples

Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
41
Training of summarization model
  • Model ranks Subject-Predicate-Object triples
    according to their importance

Document Semantic network
Summary semantic network
42
Composing a graph
  • Graph consists of nodes, referred as concepts,
    which can be subjects or objects and edges which
    are predicates and capture relations among
    concepts.
  • Use Word net to identify and compact synonym
    nodes as they correspond to the same concepts.

43
Feature construction
  • Features used in the learning process include
    triples described by the following attributes
  • Positional information
  • Of the sentence from which the triple was derived
    relative to the document text
  • Of the triple relative to the beginning of the
    sentence
  • Linguistic attributes of the nodes in the triple
    (NLP)
  • 18 syntactic attributes
  • 100 semantic attributes
  • 14 graph attributes PageRank, In/Out Degree,
    reachable neighbours, etc.
  • Dataset this yield
  • TOTAL of 466 attributes
  • On average 72 non-zero attributes per triple.

44
Experiments
  • Machine learning with Linear SVM to classify
    triples into relevant or not-relevant for the
    summary
  • Positive examples are triples from the sentences
    which were marked as summary sentences by experts
  • Negative examples are all other triples
  • Data
  • 147 documents from the DUC 2002 for which we had
    extracted summaries.
  • Evaluation
  • Report microaveraged values of precision, recall
    and F1 for the extracted triples using 10-fold
    cross validation.

45
Performance for various attribute sets
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
46
Performance for various attribute sets
Baseline performance (sentence position
selected terms from the sentence) F132.46 is
lower than in any of the other runs, except for
only linguistic attributes (F130.29). only
linguistic run includes only generic syntactic
and semantic labels - not expected to be good
discriminators on their own.
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
47
Performance for various attribute sets
Adding generic linguistic attributes reduces
precision Position of triples and sentences ?
P31.05 Adding linguistic attributes ?
P28.67 but consistently increases recall
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 32.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
48
Performance for various attribute sets
Information about the graph structure helps
Position of triples and sentences ? F139.05
Adding structure information ? F143.07
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
49
Insights
We determine the median and quartiles of the
ranks across 10 runs.
  • Most highly ranked features in SVM normal

Attribute 1st quartile Median 3rd quartile
Object Authority weight 1 1 2
Object size of weakly connected component 2 2.5 3
Object degree of a node 2 3 3
Object is name of a country 4 5 5
Subject size of weakly connected component 6 7 9
Subject degree of a node 6 10.5 12
Object PageRank weight 6 11 12
Object is name of a geographical location 8 13 16
Subject Authority weight 13 18.5 23
50
Example of summarization
  • Cracks Appear in U.N. Trade Embargo Against
    Iraq.
  • Cracks appeared Tuesday in the U.N. trade
    embargo against Iraq as Saddam Hussein sought to
    circumvent the economic noose around his country.
    Japan, meanwhile, announced it would increase its
    aid to countries hardest hit by enforcing the
    sanctions. Hoping to defuse criticism that it is
    not doing its share to oppose Baghdad, Japan said
    up to 2 billion in aid may be sent to nations
    most affected by the U.N. embargo on Iraq.
    President Bush on Tuesday night promised a joint
    session of Congress and a nationwide radio and
    television audience that Saddam Hussein will
    fail'' to make his conquest of Kuwait permanent.
    America must stand up to aggression, and we
    will,'' said Bush, who added that the U.S.
    military may remain in the Saudi Arabian desert
    indefinitely. I cannot predict just how long it
    will take to convince Iraq to withdraw from
    Kuwait,'' Bush said. More than 150,000 U.S.
    troops have been sent to the Persian Gulf region
    to deter a possible Iraqi invasion of Saudi
    Arabia. Bush's aides said the president would
    follow his address to Congress with a televised
    message for the Iraqi people, declaring the world
    is united against their government's invasion of
    Kuwait. Saddam had offered Bush time on Iraqi TV.
    The Philippines and Namibia, the first of the
    developing nations to respond to an offer Monday
    by Saddam of free oil _ in exchange for sending
    their own tankers to get it _ said no to the
    Iraqi leader. Saddam's offer was seen as a
    none-too-subtle attempt to bypass the U.N.
    embargo, in effect since four days after Iraq's
    Aug. 2 invasion of Kuwait, by getting poor
    countries to dock their tankers in Iraq. But
    according to a State Department survey, Cuba and
    Romania have struck oil deals with Iraq and
    companies elsewhere are trying to continue trade
    with Baghdad, all in defiance of U.N. sanctions.
    Romania denies the allegation. The report, made
    available to The Associated Press, said some
    Eastern European countries also are trying to
    maintain their military sales to Iraq. A
    well-informed source in Tehran told The
    Associated Press that Iran has agreed to an Iraqi
    request to exchange food and medicine for up to
    200,000 barrels of refined oil a day and cash
    payments. There was no official comment from
    Tehran or Baghdad on the reported food-for-oil
    deal. But the source, who requested anonymity,
    said the deal was struck during Iraqi Foreign
    Minister Tariq Aziz's visit Sunday to Tehran, the
    first by a senior Iraqi official since the
    1980-88 gulf war. After the visit, the two
    countries announced they would resume diplomatic
    relations. Well-informed oil industry sources in
    the region, contacted by The AP, said that
    although Iran is a major oil exporter itself, it
    currently has to import about 150,000 barrels of
    refined oil a day for domestic use because of
    damages to refineries in the gulf war. Along
    similar lines, ABC News reported that following
    Aziz's visit, Iraq is apparently prepared to give
    Iran all the oil it wants to make up for the
    damage Iraq inflicted on Iran during their
    conflict. Secretary of State James A. Baker III,
    meanwhile, met in Moscow with Soviet Foreign
    Minister Eduard Shevardnadze, two days after the
    U.S.-Soviet summit that produced a joint demand
    that Iraq withdraw from Kuwait. During the
    summit, Bush encouraged Mikhail Gorbachev to
    withdraw 190 Soviet military specialists from
    Iraq, where they remain to fulfill contracts.
    Shevardnadze told the Soviet parliament Tuesday
    the specialists had not reneged on those
    contracts for fear it would jeopardize the 5,800
    Soviet citizens in Iraq. In his speech, Bush said
    his heart went out to the families of the
    hundreds of Americans held hostage by Iraq, but
    he declared, Our policy cannot change, and it
    will not change. America and the world will not
    be blackmailed.'' The president added Vital
    issues of principle are at stake. Saddam Hussein
    is literally trying to wipe a country off the
    face of the Earth.'' In other developments _A
    U.S. diplomat in Baghdad said Tuesday up to 800
    Americans and Britons will fly out of
    Iraqi-occupied Kuwait this week, most of them
    women and children leaving their husbands behind.
    Saddam has said he is keeping foreign men as
    human shields against attack. On Monday, a
    planeload of 164 Westerners arrived in Baltimore
    from Iraq. Evacuees spoke of food shortages in
    Kuwait, nighttime gunfire and Iraqi roundups of
    young people suspected of involvement in the
    resistance. There is no law and order,'' said
    Thuraya, 19, who would not give her last name.
    A soldier can rape a father's daughter in front
    of him and he can't do anything about it.'' _The
    State Department said Iraq had told U.S.
    officials that American males residing in Iraq
    and Kuwait who were born in Arab countries will
    be allowed to leave. Iraq generally has not let
    American males leave. It was not known how many
    men the Iraqi move could affect. _A Pentagon
    spokesman said some increase in military
    activity'' had been detected inside Iraq near its
    borders with Turkey and Syria. He said there was
    little indication hostilities are imminent.
    Defense Secretary Dick Cheney said the cost of
    the U.S. military buildup in the Middle East was
    rising above the 1 billion-a-month estimate
    generally used by government officials. He said
    the total cost _ if no shooting war breaks out _
    could total 15 billion in the next fiscal year
    beginning Oct. 1. Cheney promised disgruntled
    lawmakers a significant increase'' in help from
    Arab nations and other U.S. allies for Operation
    Desert Shield. Japan, which has been accused of
    responding too slowly to the crisis in the gulf,
    said Tuesday it may give 2 billion to Egypt,
    Jordan and Turkey, hit hardest by the U.N.
    prohibition on trade with Iraq. The pressure
    from abroad is getting so strong,'' said Hiroyasu
    Horio, an official with the Ministry of
    International Trade and Industry. Local news
    reports said the aid would be extended through
    the World Bank and International Monetary Fund,
    and 600 million would be sent as early as
    mid-September. On Friday, Treasury Secretary
    Nicholas Brady visited Tokyo on a world tour
    seeking 10.5 billion to help Egypt, Jordan and
    Turkey. Japan has already promised a 1 billion
    aid package for multinational peacekeeping forces
    in Saudi Arabia, including food, water, vehicles
    and prefabricated housing for non-military uses.
    But critics in the United States have said Japan
    should do more because its economy depends
    heavily on oil from the Middle East. Japan
    imports 99 percent of its oil. Japan's
    constitution bans the use of force in settling
    international disputes and Japanese law restricts
    the military to Japanese territory, except for
    ceremonial occasions. On Monday, Saddam offered
    developing nations free oil if they would send
    their tankers to pick it up. The first two
    countries to respond Tuesday _ the Philippines
    and Namibia _ said no. Manila said it had already
    fulfilled its oil requirements, and Namibia said
    it would not sell its sovereignty'' for Iraqi
    oil. Venezuelan President Carlos Andres Perez
    dismissed Saddam's offer of free oil as a
    propaganda ploy.'' Venezuela, an OPEC member,
    has led a drive among oil-producing nations to
    boost production to make up for the shortfall
    caused by the loss of Iraqi and Kuwaiti oil from
    the world market. Their oil makes up 20 percent
    of the world's oil reserves. Only Saudi Arabia
    has higher reserves. But according to the State
    Department, Cuba, which faces an oil deficit
    because of reduced Soviet deliveries, has
    received a shipment of Iraqi petroleum since U.N.
    sanctions were imposed five weeks ago. And
    Romania, it said, expects to receive oil
    indirectly from Iraq. Romania's ambassador to the
    United States, Virgil Constantinescu, denied that
    claim Tuesday, calling it absolutely false and
    without foundation.''.

Human written summary
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
7800 chars, 1300 words
51
Full document semantic graph
52
Automatically generated summary graph
53
Findings on summarization with semantic graphs
  • Experiments show that attributes that
    characterize the document semantic graph improve
    selection of triples for summarization
  • This results need to be verified on additional
    data sets
  • Need to perform comparison with additional
    summarization methods
  • Explore various strategies for extracting and
    generating summaries based on extracted triples.
  • No combination of features that was examined lead
    to good separation of positive and negative
    triples in the feature space
  • Opportunity for further investigations and
    improvements.

54
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

55
Active Learning /Dealing with unlabeled data
56
The idea of Active Learning
  • The idea of Active Learning is if a student asks
    smart questions, it comes faster to the required
    model of knowledge as by asking random questions
  • The goal is to use Active Learning algorithms for
    semiautomatic
  • construction of models for labeling data and
  • for ontology learning

57
Quick Intro to Active Learning
Data labels
Teacher
passive student
  • We use this methods whenever hand-labeled data
    are rare or expensive to obtain
  • Interactive method
  • Requests only labeling of interesting objects
  • Much less human work needed for the same result
    compared to arbitrary labeling examples

query
Teacher
active student
label
Active student asking smart questions
performance
Passive student asking random questions
number of questions
58
Algorithms tested
  • Uncertainty sampling (efficient)
  • select example closest to the decision hyperplane
    (or the one with classification probability
    closest to P0.5) (Tong Koller 2000 Stanford)
  • Maximum margin ratio change
  • select example with the largest predicted impact
    on the margin size if selected (Tong Koller
    2000 Stanford)
  • Monte Carlo Estimation of Error Reduction
  • select example that reinforces our current
    beliefs (Roy McCallum 2001, CMU)
  • Random sampling as baseline
  • Experimental evaluation (using F1-measure) of the
    four listed approaches shown on three categories
    from Reuters-2000 dataset
  • average over 10 random samples of 5000 training
    (out of 500k) and 10k testing (out of
    300k)examples
  • the last two methods a rather time consuming,
    thus we run them for including the first 50
    unlabeled examples
  • experiments show that active learning is
    especially useful for unbalanced data

59
Category with balanced class distribution having
47 of positive examples Limited advantage over
random sampling
60
Category with fairly unbalanced class
distribution having 20 of positive examples Best
performance with Uncertainty and MarginRatio,
Uncertainty is simpler and much more efficient
61
Category with very unbalanced class distribution
having 2.7 of positive examples Uncertainty
seems to outperform MarginRatio
62
Illustration of Active learning
  • starting with one labeled example from each class
    (red and blue)
  • select one example for labeling (green circle)
  • request label and add re-generate the model using
    the extended labeled data
  • Illustration of linear SVM model using
  • arbitrary selection of unlabeled examples
    (random)
  • active learning selecting the most uncertain
    examples (closest to the decision hyperplane)

63
Uncertainty sampling of unlabeled example
64
(No Transcript)
65
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

66
Methods Addressing Different Aspects of Ontology
Construction
67
Methods addressing different aspects of ontology
construction
  • Collecting data
  • focused crawling with Google and DMoz in the loop
  • Dealing with different natural languages
  • map the documents into a language-independent
    semantic-space
  • Going directly from the data
  • semi-automatic creation of an ontology directly
    from the data under predefined conditions/scenario
    s
  • Annotation of text

68
Focused Crawler
  • Focused crawler which finds in a relatively short
    time web pages related to the given web page
  • The solution uses DMoz topic ontology to get
    content context, and Google to get web linkage
    context
  • the main idea is to use browse web-graph as
    bi-directional graph using link query in
    Google
  • Algorithm
  • For efficient initial set of candidate pages we
    use Google and DMoz
  • From initial set pages are crawled in
    breadth-first fashion
  • priority in the crawler-queue is given to more
    similar pages
  • after some stopping condition is met, the
    crawler returns the list of candidate web pages
  • Usage serves as a technique for collecting the
    data for the next stages of data processing such
    as building and populating ontologies for the
    Semantic Web, improved knowledge access

69
Example Focused Crawl
  • Focused crawl for the BT home page
    (http//www.bt.com)
  • 1. www.bt.co.uk/ - BT
  • 2. www.yell.com/ucs/HomePageAction.do - UK's
    local search engine
  • 3. www.att.com/ - ATT The World's Networking
    Company
  • 4. www.cisco.com/ - Cisco Systems, Inc
  • 5. www.microsoft.com/ - Microsoft Corporation
  • 6. www.bbc.co.uk/ - BBC
  • 7. www.hp.com/ - HP United States
  • 8. www.ntl.com/ - Broadband cable internet access
  • 9. www.telekom.de/ - Deutsche Telekom
  • 10. www.epsrc.ac.uk/ - EPSRC
  • 11. www.cw.com/ - Cable Wireless
  • 12. www.royalmail.com/ - Royal Mail
  • 13. www.ericsson.com/ - Ericsson
  • 14. www.bp.com/home.do?categoryId1 - BP Global
  • 15. www.telewest.co.uk/ - Telewest Broadband PLC
  • 16. www.verizon.com/ - Verizon
  • 17. www.nokia.com/ - Nokia
  • 18. www.bt.com/at_home.jsp - BT.com At Home

70
Language-independent document representation
  • From aligned corpora we learn mappings between
    documents into language independent
    representation using Kernel Canonical
    Correlation Analysis method
  • such representation could be used for
    multilingual classification, multilingual IR,
  • On-going work on learning mappings between all
    European languages using CELEX corpus of European
    legislation in 21 lang

71
Two views of the same data find the direction
with maximal correlation
  • View 1
  • View 2

72
Corelation 0.17
  • View 1

View 2
73
Correlation 0.44
  • View 1

View 2
74
Correlation 0.97
  • View 1

View 2
75
Correlated directions found with KCCA when
applied to financial news articles
ZENTRALBANK
BP
MILLIARDE
DOLLAR
VERLUST
EINKOMMEN
FIRMA
VIERTEL
BANK
BP
CENTRAL
DOLLAR
LOSS
INCOME
COMPANY
QUARTER
GESCHICHTEN
MILLION
SAGT
BORSEN
ZAHLUNG
VOLLE
GEWERK-SCHAFT
VERHAND-LUNGSRUNDE
STORIES
MILLION
SAYS
EXCHANGES
WAGE
PAYMENT
NEGOTIATI-ONS
UNION
76
Modelling directly from the data getting
semantic classes with LSI
CELLS
GENE
CANCER
GENOMIC
MOLECULAR
SERVICES
GRID
USER
MOBILIZATION
CONTENT
CELLS
STEM_CELLS
STEM
VACCINES
WEB
CONTENT
MEDIA
GRID
MULTIMEDIA
DIGITAL
ENERGY
OPTICS
WASTE
FUEL
NUCLEAR
SECURITY
ROBOT
EMBEDDED
BIOMETRICS
VECTOR
WEB
WEB_SERVICES
SEMANTIC
CONTENT
MEDIA
ROBOT
LEARNING
COGNITIVE
HUMAN
INTERACTIVE
77
Visualization of 6FP IST project (English)
78
Modeling relationships between companies from the
news
79
Annotation of text
  • Annotation based on examples
  • Annotation using clustering
  • Annotation based on thesaurus

80
Annotate text based on examples
  • Problem Annotation of text by assigning
    predefined labels to text fragments
  • Given examples of annotated text fragments
  • learn annotation rules from already annotated
    documents (.xml, ...) similar to learning IE
  • learn to classify sentences into semantic roles

81
Annotate text using clustering
  • Problem Annotation of text by finding labels and
    assigning them to to text fragments
  • Given text to annotate
  • split documents into sentences, represent each
    sentence as word-vector
  • cluster sentences and label them by the most
    characteristic words from the sentences
  • e.g., using local frequency of words, clustering
    with SOM and using neural network weights of
    words

82
Annotate text based on thesaurus
  • Problem Annotation of text by finding labels and
    assigning them to to text fragments
  • Given text to annotate, thesaurus
  • a) apply NLP on text to find noun-groups and map
    them upon concepts of (medical) thesaurus
  • b) split document into sentences, cluster them
    and map clusters upon concepts of a general
    thesaurus (WordNet)
  • the concepts are used as semantic labels (XML
    tags) for annotating documents

83
Ontology evaluation directions
  • Analysis of information-theoretic properties of
    structured data instances
  • Measure of the agreement to the characteristics
    derived from manually built ontologies
  • Optimization of efficiency of the user's
    behaviour when using an ontology (e.g.,
    minimizing the number of user clicks)

84
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

85
Ontology Learning Challenge
  • Academic challenge on DMoz data (Science part)
    for 3 tasks
  • Taxonomy Population
  • Given taxonomy with documents, the task is to
    classify new documents into taxonomic categories
  • Naming Categories
  • Given taxonomic categories with documents, the
    task is to (semi)automatically propose names for
    categories
  • Constructing Taxonomy from Documents
  • Given a set of documents, the task is to
    (semi)automatically propose taxonomic structure
  • The goal is to model human skills when dealing
    with large amounts of data
  • Data
  • DMoz/Science (10k concepts, 100k instances)
  • Tourist ontology (from KU) (70 concepts, 1000
    instances)
  • The challenge will be funded through PASCAL
    Network of Excellence European project
    (http//www.pascal-network.org/)

86
Ideas / Future plans (1)
  • DMoz categories as standard web meta-data
    dictionary
  • the idea is to use DMoz categories/keywords as a
    standardized dictionary for meta-data labeling of
    general Web pages
  • because of dynamic and adaptive nature of DMoz
    categorization (reflecting all major topics on
    the web) this could be interesting as a baseline
    for semantic web style annotation
  • e.g. could be deployed as a tool for
    (semi)automatic generation of ltMETAgt tags for web
    pages

87
Ideas / Future plans (2)
  • DMoz classifier as an annotation tool
  • the idea is to use DMoz-classifier tool for
    meta-data (keyword) generation
  • some other popular databases (e.g. Wikipedia)
    could have attached automatically generated DMoz
    categories
  • could be accessible as a web service (e.g. SOAP
    interface)

88
Ideas / Future plans (3)
  • DMoz Visualizer
  • the idea is to create a tool for visualization
    and browsing through DMoz structure
  • browsing tools could combine other public and
    commercial sources (such as Wikipedia, Google,
    Amazon, eBay, )
  • could appear as e.g. web-browser toolbar

89
Ideas / Future plans (4)
  • Analysis of DMoz Dynamics
  • Future research plan is to model dynamics of DMoz
    taxonomy based on data from DMoz Archive
    (http//rdf.dmoz.org/rdf/archive/)
  • the idea is to model decision process when and
    how the editors decide to split the category
    nodes
  • currently the repository includes 120 snapshots
    of DMoz from year 2000 on

90
Ideas / Future plans (5)
  • Focused crawling for DMoz
  • the idea is to use focused crawler for proposing
    new web sites for particular categories (as
    editorial tool)
  • at JSI we developed focused crawler for fast and
    efficient crawling for a focused content, can be
    further extended
  • to use Google and DMoz in the loop
  • to use user-hints (positive negative examples
    of content pages)
  • based on Corpus-Builder project at CMU
  • http//www-2.cs.cmu.edu/afs/cs/project/theo-4/text
    -learning/www/corpusbuilder/

91
Ideas / Future plans (6)
  • Classification of non English documents
  • we use string kernels for avoiding problems with
    morphology
  • submitted paper at ECML/PKDD2005 (Fortuna
    Mladenic) for classification into major Slovenian
    and Croatian taxonomies
  • we plan to use use Canonical Correlation
    Analysis (CCA) for efficient identification of
    similar content written in different languages

92
Text-Garden software library(in development over
the last 5 years)
93
Text-Garden data
  • Set of C classes for industrial strength text
    mining problem solving
  • Currently organized in 50 command line utilities
    covering
  • Machine learning/Data mining on text
  • Web related functionality
  • Profiling, Visualization,
  • Currently works on Windows, to be ported to Linux

94
Text Garden Architecture of clustering,
visualization, classification
95
Text Garden Web sitewww.textmining.net
View by Category
About This Presentation
Title:

Knowledge Discovery

Description:

Title: Data mining and decision support Author: Opteron Last modified by: Tina Anzic Created Date: 4/15/2005 11:41:49 AM Document presentation format – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 96
Provided by: Opteron
Learn more at: http://translectures.videolectures.net
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Knowledge Discovery


1
Knowledge Discovery
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
  • Marko Grobelnik, Dunja Mladenic
  • J.Stefan Institute
  • Slovenia

2
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

3
Why is Knowledge Discovery appropriate for
Semantic Web?
  • Idea let a computer search for knowledge whereas
    the humans give just broad directions about where
    and how to search
  • Knowledge discovery (KD) could be defined as a
    research area with several subfields Machine
    Learning, Data Mining and Data bases (Mitchell,
    1997 Fayyad et al., 1996 Witten and Frank,
    1999 Hand et al., 2001)
  • KD techniques
  • mainly about discovering structure in the data
  • can serve as one of the key mechanisms for
    structuring knowledge into an ontological
    structure being further used in Knowledge
    management process
  • Data and corresponding semantic structures change
    in time
  • sub-field of KD called stream mining deals with
    these kinds of problems
  • Semantic Web is ultimately concerned with
    real-life data on the web which have exponential
    growth
  • scalability is one of the central issues in KD

4
Machine Learning view to Ontology Generation
5
Knowledge Discovery Techniques
  • Knowledge discovery technologies can be used to
    support different phases and scenarios for
    ontology generation
  • Observations
  • Completely automatic construction of ontologies
    is in general not possible for
  • theoretical reasons (e.g., information
    bottleneck) and
  • practical reasons (e.g., the soft nature of the
    knowledge being conceptualized).
  • Human interventions are necessary but costly in
    terms of resources
  • therefore the technology should help in
    efficient utilization of human interventions.
  • Document databases are the most common data type
    conceptualized in the form of ontologies

6
What is Ontology?
  • In most ML contexts we can refer to an ontology
    as being a graph/network structure consisting
    from
  • a set of concepts (vertices in a graph)
  • each concept Ci is described by a
    membership-function ci(x)
  • a set of relations connecting concepts (directed
    edges in a graph)
  • each relation Ri is described by a
    membership-function ri(Ci, Cj)
  • a set of instances (data records assigned to
    concepts or relations)
  • each instance Ii is described by a set of
    features Fi,j

7
We have 7 concepts (C1C7), and 3 relations
(R1R3) each of the concept and relation is
populated by a number of instances (data records)
R1
C2
C1
R3
C4
C3
R3
R2
R1
R3
R2
C5
C7
R1
C6
8
Ontology Definition
  • Ontology is defined as a tuple with 5 sets of
    objects
  • OntologyltClasses, Relations, Instances,
    Class-Definitions, Relation-Definitionsgt
  • in short OltC, R, I, CD, RDgt
  • where
  • Classes set of labels Ci
  • Relations set of labels Ri
  • Instances set of instance feature vectors Ii
  • Class-Definitions set of class membership
    functions CDi
  • Relation-Definitions set of relation membership
    functions RDi
  • the idea is to describe ontology learning
    tasks in above terms

9
Ontology Learning
  • Ontology learning is a set of tasks based on the
    previous ontology definition
  • We define ontology learning tasks in terms of
    mappings between ontology components where some
    of the components are given and some are missing
    and we want to induce the missing ones
  • Some typical scenarios
  • Inducing classes/Clustering of instances
  • C, CDf(I)
  • Ontology population
  • CD, RDf(C, R, I)
  • Ontology generation
  • C, R, CD, RDf(I) (hardest task)

10
Representational language
  • When performing learning of function f, we need
    to select language for representation of
    membership function f
  • Examples
  • Linear functions (Support-Vector-Machines, )
  • Propositional logic (decision trees, rules, )
  • First order logic (Inductive Logic programming)
  • by selecting different representation languages
    we decide about
  • the power of the descriptions
  • complexity of computation

11
Ontology Quality
  • For the same set of instances I we can have
    multiple ontologies OltC, R, I, CD, RDgtI
  • We need a function q for measuring the quality of
    a given ontology OI
  • function q returns numerical value
  • the best ontology is the one with the highest
    quality
  • Possible evaluation measures
  • (1) analysis of statistical properties of
    structured data,
  • (2) agreement to the properties derived from
    manually built ontologies,
  • (3) optimization of efficiency of the user's
    behaviour when using an ontology,
  • (4) using background knowledge, and
  • (5) building hybrid measures (combination of
    various approaches).

12
Search for optimal Ontology
  • Given set of instances I, we develop a series of
    ontologies
  • O1, O2, O3,
  • where we have set of transformation operators
    (refinement operators) going from Oi to Oi1
  • Good search procedure would select such
    transformations which would lead efficiently
    towards the highest quality q(Oi)
  • this formulation is in line with machine
    learning with structured output
  • we could use human in the loop by using active
    learning techniques

13
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

14
Large Scale Topic Ontology population
15
Text categorization into large topic ontology
  • Categorization of documents into large topic
    ontology is one of the problems in text mining
  • needs to be scalable
  • e.g. being able to handle DMozs 600K categories
    and 4M docs.
  • needs to be accurate
  • having accuracy on the level of inter-human
    agreement (60-80)
  • needs to be robust
  • taking into account nature of web pages
    (typically mixed quality content and often high
    quality context)

16
Approaches for handling hierarchy of categories
  • There are several topic ontologies (taxonomies)
    of textual documents
  • Yahoo, DMoz, Medline,
  • Different people use different approaches
  • series of hierarchically organized classifiers
  • set of independent classifiers just for leaves
  • set of independent classifiers for all nodes

17
Yahoo! topic ontology (taxonomy)
  • human constructed hierarchy of Web-documents
  • exists in several languages
  • easy to access and regularly updated
  • captures most of the Web topics
  • English version includes over 2M pages
    categorized into 50,000 categories
  • contains about 250Mb of HTML-files

18
Document to categorize CFP for CoNLL-2000
19
Some predicted categories
20
System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
unlabeled document
category (label)
??
Document Classifier
21
Content categories
  • For each content category generate a separate
    classifier that predicts probability for a new
    document to belong to its category

22
Summary of experimental results on Yahoo!
23
DMoz / ODP is largest topic ontology on the
web 4M sites 68k editors 600k concepts
24
Categorization into DMoz
  • On input we take DMoz RDF taxonomy data
  • from http//rdf.dmoz.org/
  • we preprocess it into efficient binary structure
  • next, we build a classification model consisting
    from models for individual categories
  • We take hierarchical nature into account
  • Using classification model we classify new
    documents into taxonomy
  • On output we get for a given document text and
    URL
  • Set of most relevant categories from DMoz
  • Set of most relevant keywords calculated from
    DMoz category names (segments from the path names)

25
What is used for learning?
  • Currently the system uses hierarchical nearest
    neighbor
  • in the past we experimented with Naïve Bayes
    for Yahoo taxonomy (http//kt.ijs.si/Dunja/yplanet
    .html)
  • heavy feature selection was needed
  • we plan to experiment with Support Vector
    Machine (SVM) algorithms
  • we plan to use this for ACM KDD Cup 2005
    Challenge
  • Scalability is a problem for learning and
    classification when dealing with 600K classes and
    4M documents
  • Approaches still needs to be properly evaluated

26
Performance issues
  • Preprocessing of DMoz (from RDF to classification
    model) takes approx. 1h
  • For classification into the whole DMoz we need
    Win64 with at least 6Gb memory
  • subsets of DMoz run on Win32 with 2Gb
  • Classification into DMoz is fast
  • 20 document classifications per second
  • e.g. whole Wikipedia was classified into DMoz in
    several hours

27
Demos
  • Demo software for classification into
    http//dmoz.org/Science/ available at
    http//agava.ijs.si/marko/DMozClassifyDemo.zip
    (40Mb)
  • includes AVI file with demo movie
  • demo runs at http//alchemist.ijs.si11111/
  • Demo for classification into the whole DMoz (all
    600K classes) runs at http//alchemist.ijs.si2222
    2/

28
Example classification of URL of a web page
keywords
categories
classification of Hubble telescope web page
29
Example classification of URL text of a web page
30
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

31
Extracting Semantic Graph from text
32
Summarization with semantic graph (Leskovec,
Grobelnik, Milic-Frayling 2005)
  • Idea extract semantic network from text
    documents and identify relevant parts of the
    semantic network to represent summary
  • Semantic graph representation is used for
    summarization task (DUC Challenge)
  • The main research result is the finding that
    topology of extracted semantic graph helps in
    determining importance of the content triples
    (which Subject-Predicate-Object triple is
    relevant)
  • joint collaboration with Microsoft Research,
    Cambridge

33
Approach Description
  • Approach
  • Learn a machine learning model for selecting
    sentences
  • Use information about semantic structure of the
    document (concepts and relations among concepts)
  • Results are promising
  • achieved 70 recall of and 25 precision on
    extracted Subject-Predicate-Object triples on DUC
    (Document understanding conference) data

34
Summarization
Human built document summary
Original Document
  • Cracks Appear in U.N. Trade Embargo Against
    Iraq.
  • Cracks appeared Tuesday in the U.N. trade
    embargo against Iraq as Saddam Hussein sought to
    circumvent the economic noose around his country.
    Japan, meanwhile, announced it would increase its
    aid to countries hardest hit by enforcing the
    sanctions. Hoping to defuse criticism that it is
    not doing its share to oppose Baghdad, Japan said
    up to 2 billion in aid may be sent to nations
    most affected by the U.N. embargo on Iraq.
    President Bush on Tuesday night promised a joint
    session of Congress and a nationwide radio and
    television audience that Saddam Hussein will
    fail'' to make his conquest of Kuwait permanent.
    America must stand up to aggression, and we
    will,'' said Bush, who added that the U.S.
    military may remain in the Saudi Arabian desert
    indefinitely. I cannot predict just how long it
    will take to convince Iraq to withdraw from
    Kuwait,'' Bush said. More than 150,000 U.S.
    troops have been sent to the Persian Gulf region
    to deter a possible Iraqi invasion of Saudi
    Arabia. Bush's aides said the president would
    follow his address to Congress with a televised
    message for the Iraqi people, declaring the world
    is united against their government's invasion of
    Kuwait. Saddam had offered Bush time on Iraqi TV.
    The Philippines and Namibia, the first of the
    developing nations to respond to an offer Monday
    by Saddam of free oil _ in exchange for sending
    their own tankers to get it _ said no to the
    Iraqi leader. Saddam's offer was seen as a
    none-too-subtle attempt to bypass the U.N.
    embargo, in effect since four days after Iraq's
    Aug. 2 invasion of Kuwait, by getting poor
    countries to dock their tankers in Iraq. But
    according to a State Department survey, Cuba and
    Romania have struck oil deals with Iraq and
    companies elsewhere are trying to continue trade
    with Baghdad, all in defiance of U.N. sanctions.
    Romania denies the allegation. The report, made
    available to The Associated Press, said some
    Eastern European countries also are trying to
    maintain their military sales to Iraq. A
    well-informed source in Tehran told The
    Associated Press that Iran has agreed to an Iraqi
    request to exchange food and medicine for up to
    200,000 barrels of refined oil a day and cash
    payments. There was no official comment from
    Tehran or Baghdad on the reported food-for-oil
    deal. But the source, who requested anonymity,
    said the deal was struck during Iraqi Foreign
    Minister Tariq Aziz's visit Sunday to Tehran, the
    first by a senior Iraqi official since the
    1980-88 gulf war. After the visit, the two
    countries announced they would resume diplomatic
    relations. Well-informed oil industry sources in
    the region, contacted by The AP, said that
    although Iran is a major oil exporter itself, it
    currently has to import about 150,000 barrels of
    refined oil a day for domestic use because of
    damages to refineries in the gulf war. Along
    similar lines, ABC News reported that following
    Aziz's visit, Iraq is apparently prepared to give
    Iran all the oil it wants to make up for the
    damage Iraq inflicted on Iran during their
    conflict. Secretary of State James A. Baker III,
    meanwhile, met in Moscow with Soviet Foreign
    Minister Eduard Shevardnadze, two days after the
    U.S.-Soviet summit that produced a joint demand
    that Iraq withdraw from Kuwait. During the
    summit, Bush encouraged Mikhail Gorbachev to
    withdraw 190 Soviet military specialists from
    Iraq, where they remain to fulfill contracts.
    Shevardnadze told the Soviet parliament Tuesday
    the specialists had not reneged on those
    contracts for fear it would jeopardize the 5,800
    Soviet citizens in Iraq. In his speech, Bush said
    his heart went out to the families of the
    hundreds of Americans held hostage by Iraq, but
    he declared, Our policy cannot change, and it
    will not change. America and the world will not
    be blackmailed.'' The president added Vital
    issues of principle are at stake. Saddam Hussein
    is literally trying to wipe a country off the
    face of the Earth.'' In other developments _A
    U.S. diplomat in Baghdad said Tuesday up to 800
    Americans and Britons will fly out of
    Iraqi-occupied Kuwait this week, most of them
    women and children leaving their husbands behind.
    Saddam has said he is keeping foreign men as
    human shields against attack. On Monday, a
    planeload of 164 Westerners arrived in Baltimore
    from Iraq. Evacuees spoke of food shortages in
    Kuwait, nighttime gunfire and Iraqi roundups of
    young people suspected of involvement in the
    resistance. There is no law and order,'' said
    Thuraya, 19, who would not give her last name.
    A soldier can rape a father's daughter in front
    of him and he can't do anything about it.'' _The
    State Department said Iraq had told U.S.
    officials that American males residing in Iraq
    and Kuwait who were born in Arab countries will
    be allowed to leave. Iraq generally has not let
    American males leave. It was not known how many
    men the Iraqi move could affect. _A Pentagon
    spokesman said some increase in military
    activity'' had been detected inside Iraq near its
    borders with Turkey and Syria. He said there was
    little indication hostilities are imminent.
    Defense Secretary Dick Cheney said the cost of
    the U.S. military buildup in the Middle East was
    rising above the 1 billion-a-month estimate
    generally used by government officials. He said
    the total cost _ if no shooting war breaks out _
    could total 15 billion in the next fiscal year
    beginning Oct. 1. Cheney promised disgruntled
    lawmakers a significant increase'' in help from
    Arab nations and other U.S. allies for Operation
    Desert Shield. Japan, which has been accused of
    responding too slowly to the crisis in the gulf,
    said Tuesday it may give 2 billion to Egypt,
    Jordan and Turkey, hit hardest by the U.N.
    prohibition on trade with Iraq. The pressure
    from abroad is getting so strong,'' said Hiroyasu
    Horio, an official with the Ministry of
    International Trade and Industry. Local news
    reports said the aid would be extended through
    the World Bank and International Monetary Fund,
    and 600 million would be sent as early as
    mid-September. On Friday, Treasury Secretary
    Nicholas Brady visited Tokyo on a world tour
    seeking 10.5 billion to help Egypt, Jordan and
    Turkey. Japan has already promised a 1 billion
    aid package for multinational peacekeeping forces
    in Saudi Arabia, including food, water, vehicles
    and prefabricated housing for non-military uses.
    But critics in the United States have said Japan
    should do more because its economy depends
    heavily on oil from the Middle East. Japan
    imports 99 percent of its oil. Japan's
    constitution bans the use of force in settling
    international disputes and Japanese law restricts
    the military to Japanese territory, except for
    ceremonial occasions. On Monday, Saddam offered
    developing nations free oil if they would send
    their tankers to pick it up. The first two
    countries to respond Tuesday _ the Philippines
    and Namibia _ said no. Manila said it had already
    fulfilled its oil requirements, and Namibia said
    it would not sell its sovereignty'' for Iraqi
    oil. Venezuelan President Carlos Andres Perez
    dismissed Saddam's offer of free oil as a
    propaganda ploy.'' Venezuela, an OPEC member,
    has led a drive among oil-producing nations to
    boost production to make up for the shortfall
    caused by the loss of Iraqi and Kuwaiti oil from
    the world market. Their oil makes up 20 percent
    of the world's oil reserves. Only Saudi Arabia
    has higher reserves. But according to the State
    Department, Cuba, which faces an oil deficit
    because of reduced Soviet deliveries, has
    received a shipment of Iraqi petroleum since U.N.
    sanctions were imposed five weeks ago. And
    Romania, it said, expects to receive oil
    indirectly from Iraq. Romania's ambassador to the
    United States, Virgil Constantinescu, denied that
    claim Tuesday, calling it absolutely false and
    without foundation.''.

Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Manual summarization
Creation of semantic network
Semantic net of Subj-Pred-Obj triples
Automatically built document summary (not done
by us)
70 recall, 40 precision of selected triples
according to human generated summaries
Automatic summarization by selecting relevant
triples
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Nat. Lang. Generation
Mapping between graphs learned with ML methods
Semantic net of Subj-Pred-Obj triples
35
Detailed Summarization Procedure
  • Linguistic analysis of the text
  • - Deep parsing of sentences
  • Refinement of the text parse
  • - Named-entity consolidation
  • Determine that George Bush Bush
  • U.S. president
  • - Anaphora resolution
  • Link pronouns with name-entities
  • Extract SubjectPredicateObject triples

Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
36
Named entities consolidation
  • Consolidating different surface forms that refer
    to the same entities only for names of people,
    places, companies, etc.
  • Example
  • Hillary Rodham Clinton, Hillary Clinton, Hillary
    Rodham, Mrs. Clinton ? Hillary Clinton
  • Heuristic based on the overlap in the surface
    form of name variances
  • Accuracy on a subset of the data set 90.

37
Pronomial anaphora resolution
  • Link pronouns with their references
  • Mary likes Paul. She went to buy him a
    present.
  • ? Mary likes Paul. She Mary went to buy him
    Paul a present.
  • Method
  • restrict to 5 pronouns she, he, who, I, they.
  • from the pronoun, traverse the text searching for
    candidate references and assign a score
  • the score is based on the distance from the
    pronoun and semantic information
  • assume that pronouns refer only to named entities
    found in the document
  • Problem
  • One passenger in King's car said they had been
    drinking liquor.
  • Average accuracy on 1,500 hand labeled pronouns
    81.2

38
Anaphora resolution evaluation
Pronoun Frequency Frequency Accuracy
He 681 45.22 86.9
They 244 16.20 67.2
It 204 13.55
I 64 4.25 82.8
You 50 3.32
We 44 2.92
That 44 2.92
What 27 1.79
She 24 1.59 62.5
This 22 1.46
Who 11 0.73 63.6

Total 1506 100 81.2
Accuracy on 5 selected 81.2 (55.2 if counting
all pronouns)
39
Extracting triples
  • Enhanced parse tree is traversed to identify
    SubjectPredicateObject triples
  • Example
  • Conservatives embraced the nomination while
    liberals were cautious or hostile
  • Resulting triples
  • conservative ? embrace ? nomination
  • liberal ? is ? cautious
  • liberal ? is ? hostile

40
Detailed Summarization Procedure
  • Linguistic analysis of the text
  • - Deep parsing of sentences
  • Refinement of the text parse
  • - Named-entity consolidation
  • Determine that George Bush Bush
  • U.S. president
  • - Anaphora resolution
  • Link pronouns with name-entities
  • Extract Subject Predicate Object triples

Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
41
Training of summarization model
  • Model ranks Subject-Predicate-Object triples
    according to their importance

Document Semantic network
Summary semantic network
42
Composing a graph
  • Graph consists of nodes, referred as concepts,
    which can be subjects or objects and edges which
    are predicates and capture relations among
    concepts.
  • Use Word net to identify and compact synonym
    nodes as they correspond to the same concepts.

43
Feature construction
  • Features used in the learning process include
    triples described by the following attributes
  • Positional information
  • Of the sentence from which the triple was derived
    relative to the document text
  • Of the triple relative to the beginning of the
    sentence
  • Linguistic attributes of the nodes in the triple
    (NLP)
  • 18 syntactic attributes
  • 100 semantic attributes
  • 14 graph attributes PageRank, In/Out Degree,
    reachable neighbours, etc.
  • Dataset this yield
  • TOTAL of 466 attributes
  • On average 72 non-zero attributes per triple.

44
Experiments
  • Machine learning with Linear SVM to classify
    triples into relevant or not-relevant for the
    summary
  • Positive examples are triples from the sentences
    which were marked as summary sentences by experts
  • Negative examples are all other triples
  • Data
  • 147 documents from the DUC 2002 for which we had
    extracted summaries.
  • Evaluation
  • Report microaveraged values of precision, recall
    and F1 for the extracted triples using 10-fold
    cross validation.

45
Performance for various attribute sets
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
46
Performance for various attribute sets
Baseline performance (sentence position
selected terms from the sentence) F132.46 is
lower than in any of the other runs, except for
only linguistic attributes (F130.29). only
linguistic run includes only generic syntactic
and semantic labels - not expected to be good
discriminators on their own.
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
47
Performance for various attribute sets
Adding generic linguistic attributes reduces
precision Position of triples and sentences ?
P31.05 Adding linguistic attributes ?
P28.67 but consistently increases recall
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 32.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
48
Performance for various attribute sets
Information about the graph structure helps
Position of triples and sentences ? F139.05
Adding structure information ? F143.07
Attribute set Training Set Training Set Training Set Test Set Test Set Test Set
Attribute set Precision Recall F1 Precision Recall F1
Sentence Position Terms 65.87 92.48 76.94 28.87 37.08 32.46
only Position (triple sentence) 31.21 52.49 39.15 31.05 52.58 39.05
only Graph 27.78 57.46 37.46 27.25 56.90 36.85
only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29
Position Linguistic 31.16 67.00 42.54 28.67 62.57 39.33
Position Graph 33.51 63.85 43.95 42.71 63.02 43.07
Position Graph Linguistic 35.82 72.69 47.99 31.41 64.88 42.33
49
Insights
We determine the median and quartiles of the
ranks across 10 runs.
  • Most highly ranked features in SVM normal

Attribute 1st quartile Median 3rd quartile
Object Authority weight 1 1 2
Object size of weakly connected component 2 2.5 3
Object degree of a node 2 3 3
Object is name of a country 4 5 5
Subject size of weakly connected component 6 7 9
Subject degree of a node 6 10.5 12
Object PageRank weight 6 11 12
Object is name of a geographical location 8 13 16
Subject Authority weight 13 18.5 23
50
Example of summarization
  • Cracks Appear in U.N. Trade Embargo Against
    Iraq.
  • Cracks appeared Tuesday in the U.N. trade
    embargo against Iraq as Saddam Hussein sought to
    circumvent the economic noose around his country.
    Japan, meanwhile, announced it would increase its
    aid to countries hardest hit by enforcing the
    sanctions. Hoping to defuse criticism that it is
    not doing its share to oppose Baghdad, Japan said
    up to 2 billion in aid may be sent to nations
    most affected by the U.N. embargo on Iraq.
    President Bush on Tuesday night promised a joint
    session of Congress and a nationwide radio and
    television audience that Saddam Hussein will
    fail'' to make his conquest of Kuwait permanent.
    America must stand up to aggression, and we
    will,'' said Bush, who added that the U.S.
    military may remain in the Saudi Arabian desert
    indefinitely. I cannot predict just how long it
    will take to convince Iraq to withdraw from
    Kuwait,'' Bush said. More than 150,000 U.S.
    troops have been sent to the Persian Gulf region
    to deter a possible Iraqi invasion of Saudi
    Arabia. Bush's aides said the president would
    follow his address to Congress with a televised
    message for the Iraqi people, declaring the world
    is united against their government's invasion of
    Kuwait. Saddam had offered Bush time on Iraqi TV.
    The Philippines and Namibia, the first of the
    developing nations to respond to an offer Monday
    by Saddam of free oil _ in exchange for sending
    their own tankers to get it _ said no to the
    Iraqi leader. Saddam's offer was seen as a
    none-too-subtle attempt to bypass the U.N.
    embargo, in effect since four days after Iraq's
    Aug. 2 invasion of Kuwait, by getting poor
    countries to dock their tankers in Iraq. But
    according to a State Department survey, Cuba and
    Romania have struck oil deals with Iraq and
    companies elsewhere are trying to continue trade
    with Baghdad, all in defiance of U.N. sanctions.
    Romania denies the allegation. The report, made
    available to The Associated Press, said some
    Eastern European countries also are trying to
    maintain their military sales to Iraq. A
    well-informed source in Tehran told The
    Associated Press that Iran has agreed to an Iraqi
    request to exchange food and medicine for up to
    200,000 barrels of refined oil a day and cash
    payments. There was no official comment from
    Tehran or Baghdad on the reported food-for-oil
    deal. But the source, who requested anonymity,
    said the deal was struck during Iraqi Foreign
    Minister Tariq Aziz's visit Sunday to Tehran, the
    first by a senior Iraqi official since the
    1980-88 gulf war. After the visit, the two
    countries announced they would resume diplomatic
    relations. Well-informed oil industry sources in
    the region, contacted by The AP, said that
    although Iran is a major oil exporter itself, it
    currently has to import about 150,000 barrels of
    refined oil a day for domestic use because of
    damages to refineries in the gulf war. Along
    similar lines, ABC News reported that following
    Aziz's visit, Iraq is apparently prepared to give
    Iran all the oil it wants to make up for the
    damage Iraq inflicted on Iran during their
    conflict. Secretary of State James A. Baker III,
    meanwhile, met in Moscow with Soviet Foreign
    Minister Eduard Shevardnadze, two days after the
    U.S.-Soviet summit that produced a joint demand
    that Iraq withdraw from Kuwait. During the
    summit, Bush encouraged Mikhail Gorbachev to
    withdraw 190 Soviet military specialists from
    Iraq, where they remain to fulfill contracts.
    Shevardnadze told the Soviet parliament Tuesday
    the specialists had not reneged on those
    contracts for fear it would jeopardize the 5,800
    Soviet citizens in Iraq. In his speech, Bush said
    his heart went out to the families of the
    hundreds of Americans held hostage by Iraq, but
    he declared, Our policy cannot change, and it
    will not change. America and the world will not
    be blackmailed.'' The president added Vital
    issues of principle are at stake. Saddam Hussein
    is literally trying to wipe a country off the
    face of the Earth.'' In other developments _A
    U.S. diplomat in Baghdad said Tuesday up to 800
    Americans and Britons will fly out of
    Iraqi-occupied Kuwait this week, most of them
    women and children leaving their husbands behind.
    Saddam has said he is keeping foreign men as
    human shields against attack. On Monday, a
    planeload of 164 Westerners arrived in Baltimore
    from Iraq. Evacuees spoke of food shortages in
    Kuwait, nighttime gunfire and Iraqi roundups of
    young people suspected of involvement in the
    resistance. There is no law and order,'' said
    Thuraya, 19, who would not give her last name.
    A soldier can rape a father's daughter in front
    of him and he can't do anything about it.'' _The
    State Department said Iraq had told U.S.
    officials that American males residing in Iraq
    and Kuwait who were born in Arab countries will
    be allowed to leave. Iraq generally has not let
    American males leave. It was not known how many
    men the Iraqi move could affect. _A Pentagon
    spokesman said some increase in military
    activity'' had been detected inside Iraq near its
    borders with Turkey and Syria. He said there was
    little indication hostilities are imminent.
    Defense Secretary Dick Cheney said the cost of
    the U.S. military buildup in the Middle East was
    rising above the 1 billion-a-month estimate
    generally used by government officials. He said
    the total cost _ if no shooting war breaks out _
    could total 15 billion in the next fiscal year
    beginning Oct. 1. Cheney promised disgruntled
    lawmakers a significant increase'' in help from
    Arab nations and other U.S. allies for Operation
    Desert Shield. Japan, which has been accused of
    responding too slowly to the crisis in the gulf,
    said Tuesday it may give 2 billion to Egypt,
    Jordan and Turkey, hit hardest by the U.N.
    prohibition on trade with Iraq. The pressure
    from abroad is getting so strong,'' said Hiroyasu
    Horio, an official with the Ministry of
    International Trade and Industry. Local news
    reports said the aid would be extended through
    the World Bank and International Monetary Fund,
    and 600 million would be sent as early as
    mid-September. On Friday, Treasury Secretary
    Nicholas Brady visited Tokyo on a world tour
    seeking 10.5 billion to help Egypt, Jordan and
    Turkey. Japan has already promised a 1 billion
    aid package for multinational peacekeeping forces
    in Saudi Arabia, including food, water, vehicles
    and prefabricated housing for non-military uses.
    But critics in the United States have said Japan
    should do more because its economy depends
    heavily on oil from the Middle East. Japan
    imports 99 percent of its oil. Japan's
    constitution bans the use of force in settling
    international disputes and Japanese law restricts
    the military to Japanese territory, except for
    ceremonial occasions. On Monday, Saddam offered
    developing nations free oil if they would send
    their tankers to pick it up. The first two
    countries to respond Tuesday _ the Philippines
    and Namibia _ said no. Manila said it had already
    fulfilled its oil requirements, and Namibia said
    it would not sell its sovereignty'' for Iraqi
    oil. Venezuelan President Carlos Andres Perez
    dismissed Saddam's offer of free oil as a
    propaganda ploy.'' Venezuela, an OPEC member,
    has led a drive among oil-producing nations to
    boost production to make up for the shortfall
    caused by the loss of Iraqi and Kuwaiti oil from
    the world market. Their oil makes up 20 percent
    of the world's oil reserves. Only Saudi Arabia
    has higher reserves. But according to the State
    Department, Cuba, which faces an oil deficit
    because of reduced Soviet deliveries, has
    received a shipment of Iraqi petroleum since U.N.
    sanctions were imposed five weeks ago. And
    Romania, it said, expects to receive oil
    indirectly from Iraq. Romania's ambassador to the
    United States, Virgil Constantinescu, denied that
    claim Tuesday, calling it absolutely false and
    without foundation.''.

Human written summary
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
7800 chars, 1300 words
51
Full document semantic graph
52
Automatically generated summary graph
53
Findings on summarization with semantic graphs
  • Experiments show that attributes that
    characterize the document semantic graph improve
    selection of triples for summarization
  • This results need to be verified on additional
    data sets
  • Need to perform comparison with additional
    summarization methods
  • Explore various strategies for extracting and
    generating summaries based on extracted triples.
  • No combination of features that was examined lead
    to good separation of positive and negative
    triples in the feature space
  • Opportunity for further investigations and
    improvements.

54
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

55
Active Learning /Dealing with unlabeled data
56
The idea of Active Learning
  • The idea of Active Learning is if a student asks
    smart questions, it comes faster to the required
    model of knowledge as by asking random questions
  • The goal is to use Active Learning algorithms for
    semiautomatic
  • construction of models for labeling data and
  • for ontology learning

57
Quick Intro to Active Learning
Data labels
Teacher
passive student
  • We use this methods whenever hand-labeled data
    are rare or expensive to obtain
  • Interactive method
  • Requests only labeling of interesting objects
  • Much less human work needed for the same result
    compared to arbitrary labeling examples

query
Teacher
active student
label
Active student asking smart questions
performance
Passive student asking random questions
number of questions
58
Algorithms tested
  • Uncertainty sampling (efficient)
  • select example closest to the decision hyperplane
    (or the one with classification probability
    closest to P0.5) (Tong Koller 2000 Stanford)
  • Maximum margin ratio change
  • select example with the largest predicted impact
    on the margin size if selected (Tong Koller
    2000 Stanford)
  • Monte Carlo Estimation of Error Reduction
  • select example that reinforces our current
    beliefs (Roy McCallum 2001, CMU)
  • Random sampling as baseline
  • Experimental evaluation (using F1-measure) of the
    four listed approaches shown on three categories
    from Reuters-2000 dataset
  • average over 10 random samples of 5000 training
    (out of 500k) and 10k testing (out of
    300k)examples
  • the last two methods a rather time consuming,
    thus we run them for including the first 50
    unlabeled examples
  • experiments show that active learning is
    especially useful for unbalanced data

59
Category with balanced class distribution having
47 of positive examples Limited advantage over
random sampling
60
Category with fairly unbalanced class
distribution having 20 of positive examples Best
performance with Uncertainty and MarginRatio,
Uncertainty is simpler and much more efficient
61
Category with very unbalanced class distribution
having 2.7 of positive examples Uncertainty
seems to outperform MarginRatio
62
Illustration of Active learning
  • starting with one labeled example from each class
    (red and blue)
  • select one example for labeling (green circle)
  • request label and add re-generate the model using
    the extended labeled data
  • Illustration of linear SVM model using
  • arbitrary selection of unlabeled examples
    (random)
  • active learning selecting the most uncertain
    examples (closest to the decision hyperplane)

63
Uncertainty sampling of unlabeled example
64
(No Transcript)
65
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

66
Methods Addressing Different Aspects of Ontology
Construction
67
Methods addressing different aspects of ontology
construction
  • Collecting data
  • focused crawling with Google and DMoz in the loop
  • Dealing with different natural languages
  • map the documents into a language-independent
    semantic-space
  • Going directly from the data
  • semi-automatic creation of an ontology directly
    from the data under predefined conditions/scenario
    s
  • Annotation of text

68
Focused Crawler
  • Focused crawler which finds in a relatively short
    time web pages related to the given web page
  • The solution uses DMoz topic ontology to get
    content context, and Google to get web linkage
    context
  • the main idea is to use browse web-graph as
    bi-directional graph using link query in
    Google
  • Algorithm
  • For efficient initial set of candidate pages we
    use Google and DMoz
  • From initial set pages are crawled in
    breadth-first fashion
  • priority in the crawler-queue is given to more
    similar pages
  • after some stopping condition is met, the
    crawler returns the list of candidate web pages
  • Usage serves as a technique for collecting the
    data for the next stages of data processing such
    as building and populating ontologies for the
    Semantic Web, improved knowledge access

69
Example Focused Crawl
  • Focused crawl for the BT home page
    (http//www.bt.com)
  • 1. www.bt.co.uk/ - BT
  • 2. www.yell.com/ucs/HomePageAction.do - UK's
    local search engine
  • 3. www.att.com/ - ATT The World's Networking
    Company
  • 4. www.cisco.com/ - Cisco Systems, Inc
  • 5. www.microsoft.com/ - Microsoft Corporation
  • 6. www.bbc.co.uk/ - BBC
  • 7. www.hp.com/ - HP United States
  • 8. www.ntl.com/ - Broadband cable internet access
  • 9. www.telekom.de/ - Deutsche Telekom
  • 10. www.epsrc.ac.uk/ - EPSRC
  • 11. www.cw.com/ - Cable Wireless
  • 12. www.royalmail.com/ - Royal Mail
  • 13. www.ericsson.com/ - Ericsson
  • 14. www.bp.com/home.do?categoryId1 - BP Global
  • 15. www.telewest.co.uk/ - Telewest Broadband PLC
  • 16. www.verizon.com/ - Verizon
  • 17. www.nokia.com/ - Nokia
  • 18. www.bt.com/at_home.jsp - BT.com At Home

70
Language-independent document representation
  • From aligned corpora we learn mappings between
    documents into language independent
    representation using Kernel Canonical
    Correlation Analysis method
  • such representation could be used for
    multilingual classification, multilingual IR,
  • On-going work on learning mappings between all
    European languages using CELEX corpus of European
    legislation in 21 lang

71
Two views of the same data find the direction
with maximal correlation
  • View 1
  • View 2

72
Corelation 0.17
  • View 1

View 2
73
Correlation 0.44
  • View 1

View 2
74
Correlation 0.97
  • View 1

View 2
75
Correlated directions found with KCCA when
applied to financial news articles
ZENTRALBANK
BP
MILLIARDE
DOLLAR
VERLUST
EINKOMMEN
FIRMA
VIERTEL
BANK
BP
CENTRAL
DOLLAR
LOSS
INCOME
COMPANY
QUARTER
GESCHICHTEN
MILLION
SAGT
BORSEN
ZAHLUNG
VOLLE
GEWERK-SCHAFT
VERHAND-LUNGSRUNDE
STORIES
MILLION
SAYS
EXCHANGES
WAGE
PAYMENT
NEGOTIATI-ONS
UNION
76
Modelling directly from the data getting
semantic classes with LSI
CELLS
GENE
CANCER
GENOMIC
MOLECULAR
SERVICES
GRID
USER
MOBILIZATION
CONTENT
CELLS
STEM_CELLS
STEM
VACCINES
WEB
CONTENT
MEDIA
GRID
MULTIMEDIA
DIGITAL
ENERGY
OPTICS
WASTE
FUEL
NUCLEAR
SECURITY
ROBOT
EMBEDDED
BIOMETRICS
VECTOR
WEB
WEB_SERVICES
SEMANTIC
CONTENT
MEDIA
ROBOT
LEARNING
COGNITIVE
HUMAN
INTERACTIVE
77
Visualization of 6FP IST project (English)
78
Modeling relationships between companies from the
news
79
Annotation of text
  • Annotation based on examples
  • Annotation using clustering
  • Annotation based on thesaurus

80
Annotate text based on examples
  • Problem Annotation of text by assigning
    predefined labels to text fragments
  • Given examples of annotated text fragments
  • learn annotation rules from already annotated
    documents (.xml, ...) similar to learning IE
  • learn to classify sentences into semantic roles

81
Annotate text using clustering
  • Problem Annotation of text by finding labels and
    assigning them to to text fragments
  • Given text to annotate
  • split documents into sentences, represent each
    sentence as word-vector
  • cluster sentences and label them by the most
    characteristic words from the sentences
  • e.g., using local frequency of words, clustering
    with SOM and using neural network weights of
    words

82
Annotate text based on thesaurus
  • Problem Annotation of text by finding labels and
    assigning them to to text fragments
  • Given text to annotate, thesaurus
  • a) apply NLP on text to find noun-groups and map
    them upon concepts of (medical) thesaurus
  • b) split document into sentences, cluster them
    and map clusters upon concepts of a general
    thesaurus (WordNet)
  • the concepts are used as semantic labels (XML
    tags) for annotating documents

83
Ontology evaluation directions
  • Analysis of information-theoretic properties of
    structured data instances
  • Measure of the agreement to the characteristics
    derived from manually built ontologies
  • Optimization of efficiency of the user's
    behaviour when using an ontology (e.g.,
    minimizing the number of user clicks)

84
Contents
  • Knowledge Discovery
  • Large Scale Topic Ontology population
  • Extraction of Semantic Networks from Text
  • Active Learning for efficient using of human
    interventions
  • Methods Addressing Different Aspects of Ontology
    Construction
  • Final Remarks

85
Ontology Learning Challenge
  • Academic challenge on DMoz data (Science part)
    for 3 tasks
  • Taxonomy Population
  • Given taxonomy with documents, the task is to
    classify new documents into taxonomic categories
  • Naming Categories
  • Given taxonomic categories with documents, the
    task is to (semi)automatically propose names for
    categories
  • Constructing Taxonomy from Documents
  • Given a set of documents, the task is to
    (semi)automatically propose taxonomic structure
  • The goal is to model human skills when dealing
    with large amounts of data
  • Data
  • DMoz/Science (10k concepts, 100k instances)
  • Tourist ontology (from KU) (70 concepts, 1000
    instances)
  • The challenge will be funded through PASCAL
    Network of Excellence European project
    (http//www.pascal-network.org/)

86
Ideas / Future plans (1)
  • DMoz categories as standard web meta-data
    dictionary
  • the idea is to use DMoz categories/keywords as a
    standardized dictionary for meta-data labeling of
    general Web pages
  • because of dynamic and adaptive nature of DMoz
    categorization (reflecting all major topics on
    the web) this could be interesting as a baseline
    for semantic web style annotation
  • e.g. could be deployed as a tool for
    (semi)automatic generation of ltMETAgt tags for web
    pages

87
Ideas / Future plans (2)
  • DMoz classifier as an annotation tool
  • the idea is to use DMoz-classifier tool for
    meta-data (keyword) generation
  • some other popular databases (e.g. Wikipedia)
    could have attached automatically generated DMoz
    categories
  • could be accessible as a web service (e.g. SOAP
    interface)

88
Ideas / Future plans (3)
  • DMoz Visualizer
  • the idea is to create a tool for visualization
    and browsing through DMoz structure
  • browsing tools could combine other public and
    commercial sources (such as Wikipedia, Google,
    Amazon, eBay, )
  • could appear as e.g. web-browser toolbar

89
Ideas / Future plans (4)
  • Analysis of DMoz Dynamics
  • Future research plan is to model dynamics of DMoz
    taxonomy based on data from DMoz Archive
    (http//rdf.dmoz.org/rdf/archive/)
  • the idea is to model decision process when and
    how the editors decide to split the category
    nodes
  • currently the repository includes 120 snapshots
    of DMoz from year 2000 on

90
Ideas / Future plans (5)
  • Focused crawling for DMoz
  • the idea is to use focused crawler for proposing
    new web sites for particular categories (as
    editorial tool)
  • at JSI we developed focused crawler for fast and
    efficient crawling for a focused content, can be
    further extended
  • to use Google and DMoz in the loop
  • to use user-hints (positive negative examples
    of content pages)
  • based on Corpus-Builder project at CMU
  • http//www-2.cs.cmu.edu/afs/cs/project/theo-4/text
    -learning/www/corpusbuilder/

91
Ideas / Future plans (6)
  • Classification of non English documents
  • we use string kernels for avoiding problems with
    morphology
  • submitted paper at ECML/PKDD2005 (Fortuna
    Mladenic) for classification into major Slovenian
    and Croatian taxonomies
  • we plan to use use Canonical Correlation
    Analysis (CCA) for efficient identification of
    similar content written in different languages

92
Text-Garden software library(in development over
the last 5 years)
93
Text-Garden data
  • Set of C classes for industrial strength text
    mining problem solving
  • Currently organized in 50 command line utilities
    covering
  • Machine learning/Data mining on text
  • Web related functionality
  • Profiling, Visualization,
  • Currently works on Windows, to be ported to Linux

94
Text Garden Architecture of clustering,
visualization, classification
95
Text Garden Web sitewww.textmining.net
About PowerShow.com