Data Mining on Symbolic Knowledge Extracted from the Web - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Data Mining on Symbolic Knowledge Extracted from the Web

Description:

Data Mining on Symbolic Knowledge Extracted from the Web ... also have some outstanding competitors: Bellsouth Corporation, MCI Worldcom Inc. ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 22
Provided by: Dun6
Category:

less

Transcript and Presenter's Notes

Title: Data Mining on Symbolic Knowledge Extracted from the Web


1
Data Mining on Symbolic Knowledge Extracted from
the Web
  • R.Ghani, R. Jones, D.Mladenic, K.Nigam,
    S.Slattery
  • Carnegie Mellon University and J.Stefan Institute

2
Data Mining
Wrappers, Info. Extraction
Gather info.
KB
The Web
3
Gathering info. from the Web
  • Wrappers - extract from highly structured,
    automatically generated Web pages
  • eg., movie theaters and restaurants from
    Web-based entertainment guides combined with a
    map system
  • Information Extraction - extract from free-form,
    unstructured data
  • hand-constructed extraction rules
  • learned rules
  • to identify the name of a person given home page
  • to identify relations suggested by hyperlniks

4
Data sources
  • corporations around the world
  • propositional and relational facts on 4312
    companies collected by crawling the Web
  • data sources
  • Hoovers Online Web resource - selected info.
    about companies including Web site URL
  • the first 50 Web pages from the company site -
    extracting predefined features

5
Data Mining
Extracting from corp. Web sites
New knowledge
Wrapping from corp. info.
KB
The Web
Abstracting features
6
Features describing a company
  • Extracted types of activity, companies it points
    to/mentions, officers, sector predicted from the
    text using Naïve Bayes, locations derived using
    Naïve Bayes and autoslog-based rules, country
    derived from URL
  • Wrapped listed sector and sub-sector, type,
    address including city and state, competitors,
    subsidiaries, products, officers, auditors,
    revenue, net-income, net-profit, employees
  • Abstracted companies in the same city/state,
    sharing officers, linking/mentioning the same
    companies, reciprocally linking/mentioning/compeet
    ing, discretized revenue, net-income, net profit,
    employees

7
Data Mining Algorithms
  • Finding regularities
  • Association rules - Apriori algorithm
  • features first mapped to Boolean features,
    companies represented with sparse vectors
  • Y - X (support P(X,Y), confidenceP(YX))
  • Describing a target concept
  • Propositional rules - C5.0 decision tree/rules
  • set-valued features (locations, officers,) and
    relational features (same-city, mentions,)
    excluded
  • Relational rules - FOIL
  • Y - X (covered positive, covered negative)

8
Apriori experiments
  • 2658 rules found using default support and
    confidence (10, 80) and all the features (about
    26 000)
  • the most frequent rules pointed out the need for
    data cleaning (errors in extraction)
  • 254 rules after removing rules containing wrongly
    extracted features

9
Regularities found I
  • Companies having documentation on their sites,
    that are located in USA or provide technical
    assistance, are involved in selling.
  • activitysell - locationsunited-states,
    links-toadobe-systems-incorporate (10.8, 93)
  • activitysell - activitytechnical-assistance,
    links-toadobe-systems-incorporate (11.9, 91.1)
  • Most companies located in Japan either sell or
    perform research, while the companies located in
    USA either sell or supply.
  • activitysell - locationsjapan (14.5,90.8)
  • activityresearch - locationsjapan (13.2,
    82.2)

10
Regularities found II
  • Companies mentioning software on their Web pages
    are mostly located in USA.
  • locationsunited-states - activitysupply,
    activityexpertise, mentionssoftware (5.3, 64)
  • Companies in our dataset are stable in their
    finances.
  • revenue-1997high - revenue-1996high,
    revenue-1995high, revenue-1994high,
    revenue-1993high (5, 99.5)

11
Regularities found III
  • Using support 1, confidence 10 and 4 features
    url-country, sector, competitor, auditors (mapped
    to 3532 Boolean features)
  • The most frequent auditors for our data
    Price-Waterhouse Coopers LLP (13.4), Ernst
    Young LLP (11), Arthur Andersen LLP (10.2)

12
Regularities found IV
  • Companies in computer-software--services have
    Price Waterhouse Coopers (20.9) or Ernst Young
    (14.3) as their auditor.
  • Companies in diversified-services have
    Price-Waterhouse Coopers (15.7) or Arthur
    Andersen (13.9) as their auditor.
  • Companies in drugs have Ernst Young (26.8) as
    their auditor.

13
Regularities found V
  • About half of the companies that compete with
    microsoft-corporation are in computer-software--s
    ervices and about a quarter of companies that are
    in computer-software--services compete with
    microsoft-corporation.
  • competitormicrosoft-corporation (2.1, 54.9)
  • competitormicrosoft-corporation -
    hoovers-sectorcomputer-software--services
    (2.1, 25.7)
  • Competitors as predictors for computer-software--
    services companies Computer Sciences
    Corporation, Associates International Inc., SAP
    Aktiengesellschaft

14
Regularities found VI
  • Telecommunications also have some outstanding
    competitors Bellsouth Corporation, MCI Worldcom
    Inc., Bell Atlantic Corporation, Lucent
    Technologies inc..
  • Most companies competing with Conagra inc., KMart
    Corporation and BP Amoco p.l.c. are in
    food-beverage--tobacco, retail and energy,
    respectively.

15
Learning target concepts
  • Target concepts were selected hoovers-sector,
    hoovers-type, auditors, competitors,
    share-officers, country, and state.
  • Decision trees for learning propositional rules
  • Foil for learning first order rules for unary
    relations, for binary relations

16
Propositional rules found I
  • Depending on the city the company is located in,
    different features are used to predict the
    hoovers-sector
  • for Atlanta, computer companies have a higher
    revenue than diversified services companies (same
    for Chicago).
  • for Houston, depending on the Naive Bayes
    classification (based on the company web-pages),
    we predict either Manufacturing, Computer
    Software Services, or Energy.
  • for Dallas, most Health companies are non-profit
    and thus have a lower income than leisure
    companies.

17
Propositional rules found II
  • When the city is excluded from the feature set
  • (improvements of Web-based sector classifier)
  • Telecommunications has more employees than Energy
    (employees can help weed out incorrect
    classifications in the coarse-sector prediction
    for Energy.
  • Where the Naive Bayes classifier predicts
    communications-services, income can be used to
    distinguish between Media (lower-income) and
    Telecommunications (high).
  • Where the Naive Bayes classifier predicts
    investment-services, employees can be used to
    distinguish between Financial Services(lower) and
    Banking (high).

18
First order rules found I
  • Predicting hoovers-sector
  • Companies headquartered somewhere other than
    Fremont competing with Computer Associates
    International'' are in the computer software
    services sector (51 pos.,0 neg.).
  • Companies headquartered in New York, that are not
    in natural-gas-industry nor technology-sector,
    are in the media industry (8 pos., 0 neg.).
  • media(A) - hq-city(A,new-york), sector(A,B),
    Bltgtnatural-gas-industry, coarse-sector(A,C),
    Cltgttechnology-sector, competitor(?,A),
    performs-activity(A,?), not(products(A,?)),
    not(locations(A,?)).

19
First order rules found II
  • Predicting auditors
  • Companies headquartered in Madrid having listed
    historical financial information use Arthur
    Andersen as their auditor (4 pos, 0 neg.).
  • arthur-andersen(A) - hq-city(A,madrid),
    net_profit(A,?,?).
  • Predicting binary competitor from hq-city,
    url-country, links-to, hoovers-sector
  • Two companies in the same sector are competitors
    (11407 pos., 0 neg.).
  • competitor(A,B) - A ltgt B, hoovers-sector(A,C),
    hoovers-sector(B,C).

20
Disucssion
  • Data cleaning needed - additional source of noise
    from imperfect feature extractors
  • very frequent regularities checked manually for
    obvious extractor errors
  • Feature selection needed, especially for
    relational learning
  • learn simple, unary target relation to suggest
    useful features for learning binary relation
  • Interaction observed between symbolic features
    and Naïve Bayesian classifier prediction
    (decision tree improved Naïve Bayes prediction
    using symbolic features).

21
Future work
  • Information from wrapped Web sites can be used to
    train extractors.
  • Extractors extended to use text and symbolic
    features (look at Web pages and use background
    knowledge).
  • Use data mining to identify potential errors in
    extractors.
  • Combine unsupervised search for associations with
    rule learning in an incremental, iterative
    approach supporting data cleaning and feature
    selection.
Write a Comment
User Comments (0)
About PowerShow.com