Data Mining on Symbolic Knowledge Extracted from the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining on Symbolic Knowledge Extracted from the Web

Description:

Data Mining on Symbolic Knowledge Extracted from the Web Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 20
Provided by: cedarBuff
Category:

less

Transcript and Presenter's Notes

Title: Data Mining on Symbolic Knowledge Extracted from the Web


1
Data Mining on Symbolic Knowledge Extracted from
the Web
  • Changho Choi
  • Source http//www.cs.cmu.edu/dunja/WshKDD2000.ht
    ml
  • Carnegie Mellon University, J.Stefan Institute

2
Abstract
  • This paper gives a case study of combining
    information
  • Unstructured Information
  • an errorful source of large amounts of
    potentially useful information
  • Structured Information
  • less up-to-date, but reliable as facts
  • Using information from two kinds of sources
  • Improves the reliability of data-mined rules

3
Introduction (1/2)
  • Challenge
  • not only gather and represent knowledge existing
    on the Web,
  • but also use that knowledge for planning, acting,
    and creating new knowledge

4
Introduction (2/2)
  • First stage
  • integrating three types of information gathering
  • Extracting propositional knowledge from
    highly-structured automatically-generated web
    pages
  • Extracting propositional knowledge from
    free-form, unstructured data sources
  • Extracting relational knowledge existing on the
    Web through a combination of web pages and their
    hyperlink structure
  • Aim
  • identify patterns of knowledge that were not
    explicitly represented as facts on the Web

5
Data sources and features
  • Extracted features
  • come directly from crawling the company Web sites
  • e.g. performs-activity, links-to, officers,
    sector, location, ...
  • Wrapper features from secondary sources
  • rely on a mostly regular format
  • e.g. hoovers-sector, hoovers-industry,
    hoovers-type, address, ...
  • Abstracted features
  • describe relationships between companies
  • discretize our continuous features
  • e.g. same-state, same-city, share-officers,
    mentions-same, ...

6
Process of acquiring potentially interesting
information about companies from the Web
4312 web sites50 pages on each siteswww.3com.com
Data Mining
The Web
Extracting fromcorp. Web sites
KB
New knowledge
Wrapping fromcorp. info.
Company informationfrom www.hoovers.com
Abstracting features
7
Extracted Features
Feature Values Description Extracting Method
Performs-activity 8 The types of activity this company engages in. Looking for keywords associated with each type of activity.
Links-to Companies whose web sites are pointed to by this company. Simple text search on all the web pages.
mentions Companies whose name occurs on this companys Web site. ,,
officers Officers of this company. On the pages containing officer, director.
sector 200 Naïve Bayes predicted economic sector of company. Text classification by a Naïve Bayesian model.
Coarse-sector 12 Naïve Bayes predicted coarse-grained economic sector. ,,
locations Derived from a naïve Bayes classifier on small regions of text surrounding country names, and autoslog-based rules. Advanced Information Extraction technique.
url-country 39 Inferred from the URL domain name where applicable. Country domain of the URL
8
Wrapped Features
Description
Values
Feature
Sector listed on the companys Hoovers page.
28
hoovers-sector
Industry listed on the companys Hoovers page.
298
hoovers-industry
Public, private, school etc.
18
hoovers-type
Address as listed on hoovers.
address
Extracted form address.
City, state
Companies that compete with this company.
competitor
Companies listed as subsidiaries of this company.
subsidiary
Product categories extracted from the products
page.
4648
products
Officers listed on the Hoovers page.
officers
Company auditors.
266
auditors
Revenue data for up to the last 10 years.
revenue
Net Income data for up to the last 10 years.
Net-income
Net Profit data for up to the last 10 years.
Net-profit
Number of employees each year for up to the last
10 years.
employees
9
Abstracted Features
Feature Values Description
Same-state Companies in the same state as this company.
Same-city Companies in the same city as this company.
Share-officers Companies that have officers in common with this company.
Mentions-same Companies that mention some company also mentioned by this company.
Links-to-same Companies that link to some company also linked to by this company.
Reciprocally-mentions Companies mentioned by this company, who link to this company.
Reciprocally-links Companies linked to by this company, who link to this company.
Reciprocally-competes Companies listed as a competitor of this company, who list this company as a competitor.
Revenue-binned 10 Revenues for each of up to 10 years binned into 10 equal sized bins.
Net-profit-binned 10 Net profits similarly binned.
Net-income-binned 10 Net income similarly binned.
employees 10 Employees similarly binned.
10
Data mining algorithms
  • Discovering associations
  • by applying the Apriori algorithm
  • Learning propositional rules
  • by using the C5.0 algorithm
  • , which generates a decision tree for the given
    dataset
  • Learning relational rules
  • by using Quinlans FOIL system
  • , which can use patterns in the relationship
    between companies

11
Experimental results
  • Apriori Experiments
  • discover associations in the data using
    association rules
  • Decision Trees
  • generate propositional rules using Decision trees
  • FOIL Experiments
  • generate first order rules using the first order
    rule learning system

12
ResultApriori Experiments (1/2)
  • Threshold
  • minimal support10, minimal confidence 80
  • Some Examples
  • Highest confidence rule gtintuitively be
    understood
  • performs-activity sell - locations
    united-states,links-to adobe-systems-incorporat
    ed (10.8, 93.0)performs-activity sell -
    performs-activity technical-assistance,links-to
    adobe-systems-incorporated (11.8, 91.1)

13
ResultApriori Experiments (2/2)
  • Some Examples
  • Normal rule
  • performs-activity sell - locations japan
    (14.5, 90.8)performs-activity research -
    locations japan (14.5, 90.8)
  • Lower support or conficence rule
  • performs-activity research - locations
    united-states (26.9, 72.5)
  • hoovers-sector food-beverage--tobacco -
    competitor conagra-inc (1.0,
    89.8)hoovers-sector retail - competitor
    kmart-corporation (1.0, 75.0)hoovers-sector
    energy - competitor bp-amoco-p.l.c. (1.1,
    73.0)

Meaningful?
14
Result Decision Trees
  • Example Predict the economic sector
  • city atlanta revenue1996 lt 0.1 gt Diversified
    Services (28, 0.179) revenue1996 gt 0.1 gt
    Computer Software Services (20, 0.2)city
    Houston coarse-sector basic-materials,
    capital-goods, transportation gt Manufacturing
    (10, 0.3) coarse-sector financial, healthcare,
    technlogy gt Computer Software Services (21,
    0.238) coarse-sector conglomerates,
    consumer-cyclical, consumer-non-cyclical, energy,
    services, utilities gt Energy (49, 0.49)city
    Dallas net_income1999 lt 19 gt Health Products
    Services (25, 0.2) net_income1999 gt 19 gt
    Leisure (25, 0.2)...

Based on NaïveBayes Classification
For cities,differentfeatures
15
Result FOIL Experiments(Fist Order Inductive
Logic)
  • Example
  • computer-software--services(A) -
    hq-city(A,B),Bltgtfremont, competitor(A,C),hq-city
    (C, Islandia), not(employees_binned(A,?,?)).
  • It means thatcompanies headquartered somewhere
    other than Fremont competing with Computer
    Associates International are in the computer
    software services sector.(Computer Associates
    International is the only company in our
    knowledge base headquartered in Islandia.)

16
Discussion
  • Difficulties
  • data cleaning
  • errorful nature of our facts
  • feature selection
  • Pleased result
  • the interaction between the symbolic features and
    the statistically-derived(naïve Bayes) features

17
Further Work
  • This paper suggests
  • a number of research directions
  • , impacting each of information extraction,
    machine learning, and data-mining from text
  • Further work
  • Extracting information from wrapped web-sites as
    a source of training data
  • Automatic data-cleaning of tracted features
  • Extending the information extraction

18
Reference(1/2)
  • FOIL
  • Three companions for first order data mining
  • http//www.cs.kuleuven.ac.be/ml/Doc/Tutorial_Summ
    er/tutorial_summer.html

19
Reference(2/2)
Sample URL
Feature
http//www.hoovers.com/sector/
hoovers-sector
http//www.hoovers.com/industry/list/
hoovers-industry
http//www.hoovers.com/company/dir/0,2116,15694,00
.html
hoovers-type
http//www.hoovers.com/co/capsule/5/0,2163,12475,0
0.html
address
same
City, state
same
competitor
http//www.hoovers.com/premium/profile/5/0,2147,12
475,00.html
subsidiary
same
products
same
officers
same
auditors
http//www.hoovers.com/hoov/join/sample_historical
.html
revenue
same
Net-income
same
Net-profit
same
employees
Write a Comment
User Comments (0)
About PowerShow.com