A knowledge map approach to the discovery of business intelligence on the Web - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

A knowledge map approach to the discovery of business intelligence on the Web

Description:

But not scalable, because they rely on manual construction of Web directory. 6 ... A family of techniques that portray the data's structure in a spatial fashion ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 73
Provided by: wingyan
Category:

less

Transcript and Presenter's Notes

Title: A knowledge map approach to the discovery of business intelligence on the Web


1
A knowledge map approach to the discovery of
business intelligence on the Web
  • Wingyan Chung
  • 20 September 2002

2
Outline
  • Introduction
  • Review on Browsing
  • Review on Web Mining
  • Review on Document Visualization
  • Research Questions
  • The Knowledge Map Approach
  • Evaluation Methodology
  • Experiment Results and Discussion
  • Conclusion

3
Introduction
4
Information Overload
  • Nowadays, information overload on the World Wide
    Web is serious
  • The world produces between 635,000 and 2.12
    million terabytes of unique information per year,
    mostly stored in hard drives or servers (Lyman
    Varian, 2000)
  • In business world, most KMS only store companies
    internal data but do not capture external
    competitive information (McGonagle Vella, 2002)
  • Only enable the lower-level understandings (data
    and information) but not higher-level
    understandings (knowledge and wisdom) (Nunamaker
    et al., 2001)

5
Web Search Engines
  • Commonly used to locate publicly available
    information
  • Usually retrieve a large number of Web pages on
    simple query search
  • A search of knowledge management
  • Lycos 14,948,890 results Google 3,860,000
    results
  • Alta Vista 4,690,123 results Teoma 2,837,000
    results
  • Community search engines
  • e.g. Open Directory, Zeal, Hotrate
  • Wide coverage of Web communities
  • But not scalable, because they rely on manual
    construction of Web directory

6
Automatic Discovery of Business Intelligence
  • Intelligence the acquisition, interpretation,
    collation, assessment, and exploitation of
    information (Davies, 2002)
  • Business analysts What is the landscape of
    knowledge management on the Web?
  • But information overload often prevents the
    discovery of business intelligence on the Web
  • Need automatic techniques to extract knowledge
    from the Web
  • Need new browsing methods to let users visualize
    the landscape of results (not just textual result
    lists !!)

7
Review on Browsing
8
Browsing
  • Dictionary meanings casual reading
  • Exploratory information seeking strategy
    (Marchionini and Shneiderman, 1988)
  • Registration of content into mental model
    (Spence, 1999)
  • Strategies scan, review, search (Carmel et al.,
    1992)
  • Our definition An exploratory information
    seeking process characterized by the absence of
    planning with a view to form a mental model of
    the content being browsed
  • Understanding browsing is useful for developing
    browsing methods

9
Hypertext and Browsing
  • Hypertext (Nelson, 1965)
  • as non-sequential writing nodes pages edges
    links
  • Provide free navigation on the Web
  • Lead to users disorientation (Nielsen, 1989)
  • The problem is more serious in textual display of
    Web pages
  • Limited amount of information shown on screen
  • Users need to click many times to browse through
    the whole set of Web pages related to their tasks
  • ? Experience of information overload !!

10
Visual Display of Textual Information
  • Task by data type taxonomy of information
    visualizations (Shneiderman, 1996)
  • Data types 1D, 2D, 3D, temporal,
    multidimensional, tree, network
  • Tasks Overview, zoom, filter, details-on-demand,
    extract, history, relate
  • Result list 1D (only limited browsing allowed)
  • 2D, tree, network data types support human visual
    capabilities more effectively
  • Hierarchical display shown to be an effective
    information access tool particular for browsing
    (Lin, 1997 Cutting, 1992)
  • Map display semantic road map view the
    entire collection at a distance (Doyle, 1961
    Lin, 1997)

11
Review on Web Mining
12
Web Mining
  • The use of data mining techniques to
    automatically discover and extract information
    from Web documents and services (Etzioni, 1996)
  • resource discovery, information extraction,
    uncovering general patterns
  • Web content mining
  • Web structure mining
  • Web usage mining
  • Combination of Web content and structure mining
  • Clustering (He et al., 2001) Searching (Bharat
    Henzinger, 1998) Compile topic taxonomies
    (Chakrabarti et al., 1999)
  • Web communities ? clusters

A large amount of information is stored in the
form of documents
13
Resource Discovery on the Web
  • A challenge to researchers and practitioners
  • Exponential growth of the Web
  • Commercial search engines exhibit bias in their
    search results (Mowshowitz Kawaguchi, 2002)
  • Bias deviation from the norm (pooling the
    results of a basket of search engines )
  • No search engine could return more than 45 of
    relevant results (Selberg, 1995)
  • Any single search engine on the Web could only
    cover about 16 of the entire Web (Lawrence
    Giles, 1999)

14
Meta Searching
  • A highly effective method of resource discovery
    and collection on the Web
  • Integrating meta-searching with textual
    clustering tools achieved high precision in
    searching the Web (Chen et al., 2001)
  • The only realistic way to counter the adverse
    effects of search engine bias is to use more SEs
    (i.e. meta searching) (Mowshowitz Kawaguchi,
    2002)
  • MetaCrawler analysis of relevance rankings
    (Selberg Etzioni, 1997)
  • Vivisimo automatic clustering (Palmer et al.,
    2001)

15
Review on Document Visualization
16
Document Visualization
  • Getting insight into information obtained from
    one or more documents, but without reading those
    documents (Wise et al., 1995)
  • Involves three stages (Spence, 2001)

Analysis Extract useful attributes from documents
Algorithm Cluster similar documents and reduce
dimensionality of the original representation
Visualization Displaying the encoded data in a
visual format
17
Analysis
  • Based on automatic text processing techniques
    (e.g. Vector space model)
  • A document is represented by a vector
  • Term discrimination values (Salton et al., 1975)
  • The similarity between every pair of documents
    can be computed examples of such measures
  • Simple matching coefficient, dices coefficient,
    Jaccards coefficient, cosine coefficient, and
    overlap coefficient (van Rijsbergen, 1979)
  • Asymmetric cluster function (Chen Lynch, 1992)

18
Algorithms
  • Cluster algorithms and multidimensional scaling
    (MDS) algorithms are frequently used in
    visualization (Spence, 2001)
  • Cluster algorithms classify objects into
    meaningful disjoint subsets or partitions (Jain
    Dubes, 1988)
  • Hierarchical clustering bottom-up approach
  • Partitional clustering top-down approach
  • MDS algorithms transform similarity matrices
    into coordinates of lower dimensions

19
Hierarchical Clustering
  • A procedure for transforming a proximity matrix
    into a sequence of nested partitions (Grabmeier
    Rudolph, 2002)
  • Starting with the n one-element clusters, the
    method combines pair of clusters into one
    cluster The process repeats until one cluster
    remains
  • Variations Single-link method, complete link
    method, average link method, centroid method,
    weighted average method, unweighted centroid
    method, weighted centroid method, and Wards
    method
  • Strengths visual impact (dendrogram), efficiency
  • Weaknesses adverse chaining effect, hierarchical
    structure to change dramatically, vulnerability
    to ties

20
Partitional Clustering
  • Assigns objects into groups such that objects in
    a cluster are more similar to each other than to
    objects in different clusters
  • Repeatedly assign objects to closest cluster
    centers until convergence, using an optimal
    criterion to guide the partitioning process
  • Clustering criterion guide the search of optimal
    partition points at each iteration
  • Square-error clustering criterion (Gordon
    Henderson, 1977)
  • Normalized cut (Shi Malik, 2000 He et al.,
    2001)
  • But finding optimal graph partitioning has been
    shown to be NP-complete (Garey Johnson, 1979),
    thus search heuristics are required

21
Search Heuristics
  • Many to choose from
  • Genetic algorithms parallel hill climbing
    technique that performs global searching of the
    optimal value (Holland, 1975)
  • Taboo search memorizing modifications to
    solutions to avoid visiting the same solutions
    twice (Glover, 1986)
  • Scatter search population evolves through
    selection, linear combination, integer vector
    transformation and culling (Glover, 1977)
  • Simulated annealing hill climbing but allows the
    search to take some downhill steps to escape the
    local maximum (van Laarhoven, 1988)

22
Search Heuristics
  • Can be grouped under the same roof called
    adaptive memory programming
  • Implementations of these general solving methods
    are increasingly similar (Taillard et al., 2001)
  • Applied in generalized assignment type goal
    programming problems quadratic assignment,
    vehicle routing, graph coloring, nurse
    scheduling, timetabling
  • Among them, GA perform best when the search space
    is very large (e.g. the Web graph)
  • Web searching, spidering, query optimization,
    graph partitioning

23
Comparing Cluster Methods
  • No theory exists to select the best clustering
    method for a particular application
  • Just an exploratory data analysis other issues
    still important (Jain Dubes, 1988)
  • General considerations computational efficiency,
    quality of clusters formed, and visual impact
  • Hierarchical providing visual dendrogram, high
    efficiency, satisfactory initial clustering
    (Grabmeier Rudolph, 2001)
  • Partitional tries to achieve optimal clustering
    quality, high computation intensity

24
Multidimensional Scaling
  • A family of techniques that portray the datas
    structure in a spatial fashion
  • P1000 A picture worths a thousand words
  • History of MDS
  • The first systematic MDS procedure for metric
    solutions (Torgerson, 1952)
  • First nonmetric MDS (Kruskal, 1964)
  • Consolidation allows for either metric or
    nonmetric analysis using either weighted or
    unweighted Euclidean model (Takane et al. 1977)
  • ALSCAL in SPSS

25
MDS Applications
  • Many applications in visualization
  • Display author cluster maps in their author
    co-citation analysis (He Hui, 2002 Eom
    Farris, 1996)
  • Group memory visualization (McQuaid et al. 1999)
  • Visualizing user preferences (Kanai and Hakozaki,
    2000)
  • Study the change in the knowledge map of groups
    over time (Kealy, 2001)
  • Surprisingly, none of them is found to apply MDS
    to the discovery of business intelligence on the
    Web
  • No existing search engine applies MDS to
    facilitate Web browsing

26
Visualization
  • The process of displaying the encoded data in a
    visual format
  • Output often takes the form of a knowledge map
  • A knowledge representation that reveals the
    underlying relationships of the knowledge sources
  • e.g. Web page content, newsgroup messages,
    business market trends, newspaper articles, and
    other textual and numerical information

27
Knowledge Map
  • Early work manual drawing of blocks and
    connecting lines Concept Map (Novak, 1984) and
    Mind Map (Buzan, 1993)
  • Automatically created maps
  • Galaxy of News system (Rennison, 1994)
  • Themescape (Wise et al., 1995) - Cartias NewsMap
    showing financial articles
  • Kohonens self-organizing map (Lin, 1991 Chen et
    al., 1996 hierarchically clustered regions of
    documents
  • Fisheye and fractal views (Yang et al., 2002)
  • Kartoo interconnected nodes representing Web
    sites

28
Research Gaps
  • Hierarchical and map displays were shown to be
    effective information access and browse tools,
    but have not been widely applied to Web browsing
  • Past researches on Web mining had attempted to
    use either content information or structure
    information to cluster Web pages ? how can Web
    communities be identified base on both?
  • None of the existing search engines allows users
    to visualize the relationships
  • In terms of relative closeness

29
Research Questions
  • Based on content and structure information, how
    can Web communities be identified among a set of
    business Web sites?
  • How can a knowledge map be created to represent
    the relationships among Web sites?
  • What are the effectiveness, efficiency, and
    usability of Web community and knowledge map in
    Web browsing, compared with result list and
    Kartoo map?

30
The Knowledge Map Approach
31

Compute Similarity
A Knowledge map approachto the discovery of
businessintelligence on the Web
Automatic Indexing
Meta Searching
Queries
Identify Web Community
KM
AltaVista
AlltheWeb
Analysis
DB
Assemble Web Sites
Yahoo
Teoma
DW
MSN
LookSmart
ERP
Display Web Sites on a Map
Wisenut
INSPEC (1969-2002)
CRM

SQL Database
Business intelligence articles
32
Research Testbed
  • Identify business terms
  • A search of business intelligence on INSPEC
    returns 281 article abstracts published between
    1969 and 2002
  • 9 key terms/topics were manually identified
    (based on their importance in the abstracts) from
    these abstracts
  • knowledge management, database technology, CRM,
    ERP, etc.
  • Assemble business Web sites by meta-searching
  • A total of 700 business Web sites were collected
    from 7 major search engines with the 9 key terms
    as queries
  • After removing duplicates and filtering, 3,149
    pages from 2,860 business Web sites were
    collected
  • Non-website pages are filtered out

33
Three browsing methods Result list, Web
community, and Knowledge map are provided.
Users can choose a business intelligence topic
here to browse.
34
Automatic Parsing and Indexing
  • Automatically extract key words and hyperlinks
    from the Web pages
  • Remove stop words and identify term type
  • title, heading, content text, and image alternate
    text
  • Used Arizona Noun Phraser to automatically
    extract and index all the noun phrases from each
    Web page (Tolle Chen, 2000)
  • Term frequency measures how often a particular
    term occurs in a Web page
  • Inverse Web page frequency indicates the
    specificity of the term

35
Co-occurrence Analysis (1)
  • The similarity between every pair of Web sites
    (site i and site j) contains the content and
    structural (connectivity) information. We modify
    the algorithm used in (He et al., 2001) to find
    the similarity.
  • where
  • where A, S, C are matrices for Aij, Sij, Cij
    respectively. ? and ? are parameters between 0
    and 1, and 0 ? ? ? ? 1
  • Aij 1 if site i has a hyperlink to site j, Aij
    0 otherwise

36
Co-occurrence Analysis (2)
  • Sij Asymmetric similarity score between site i
    and site j (Chen Lynch, 1992)
  • where
  • Cij number of pages pointing to both site i and
    site j (co-citation matrix)
  • Overall similarity
  • where A, S, C are matrices for Aij, Sij, Cij
    respectively. ? and ? are parameters between 0
    and 1, and 0 ? ? ? ? 1

37
Identifying Web Communities
  • Compute Web communities by GA graph partitioning
    and normalized cut measure (Shi, 2000)
  • Model the Web as a graph consisting of nodes (Web
    pages) and edges (similarities)
  • A cut on a graph G (V, E) is the removal of a
    set of edges such that the graph is split into
    two disconnected sub-graphs

38
Identifying Web Communities
  • A normalized cut measures the disassociation
    between the nodes in the two sub-graphs (Shi,
    2000). Define
  • The association value of all the nodes in a
    sub-graph G (the partition) to all the nodes in
    the entire graph G
  • Minimize the normalized cut value (or maximize
    the normalized association value)

A
B
39
Identifying Web Communities
  • Recursively apply GA to bipartition the
    sub-graphs to obtain hierarchical clustering
  • Label each community properly by the top phrases
    and human identification
  • Hierarchical Partitional clustering

C
D
A
B
40
Creating Knowledge Maps
  • Used Multidimensional Scaling (MDS) to transform
    a high-dimension similarity matrix into a
    2-dimensional representation of points
    (Torgerson, 1952)
  • Convert the similarity matrix into a
    dissimilarity matrix
  • Calculate matrix B, which is the scalar products,
    by using the cosine law. Each element in B is
    given by
  • Perform a singular value decomposition on B and
    use the following formulae to find out the
    coordinates of points B U ? V ? U X U ?
    V½ B X ? X

41
Summary of the Approach
42
Evaluation Methodology
43
Objectives
  • To understand the effectiveness, efficiency and
    usability of the two browsing methods
  • Web community (WC), Knowledge map (KM)
  • To compare the knowledge map with a existing
    browsing methods
  • Result list display (RL)
  • Kartoo.com map display (KT)

44
Browsing
  • Browsing the focus of this user study
  • The true purpose of hypertext is to provide an
    open, exploratory browsing information space to
    the user, J. Nielsen, 1990
  • connotes an informal search process
    characterized by the absence of planning, G.
    Marchionini, 1987
  • The human user interface affects the
    effectiveness and efficiency of browsing (G.
    Marchionini and B. Shneiderman, 1988)
  • Display of search results ? browsing

45
Result List
46
Web Community
Groups of Web sites organized in hierarchical
communities
Clicking on any nodes immediately below the root
will open that sub-tree
Clicking this button, users can open a Web site
when they have specified it.
Back button allows users to traverse upward in
the tree.
Panel showing details on demand (labels, title,
summary, URL)
47
The closeness of any two points reflects their
similarity
Details of this Web site is being shown on the
bottom panel
Users can control the number of Web sites to be
displayed
Panel showing details (title, URL summary)
Navigation buttons allow browsing in four
directions
Zooming buttons allow zoom-in or zoom-out
functions
Knowledge Map
48
Kartoo Map
49
Experiment Tasks
  • Designed according to TREC tasks (Voorhees
    Harman, 1997)
  • For each of the 4 browsing tools
  • Task 1 Find Web site information for two
    companies stated in the question (2 questions, 4
    minutes in total)
  • Close-ended task, requires specific matching
  • e.g. Find out the URL and the major business
    areas of Gensym Corporation
  • Task 2 Find Web site information relevant to a
    topic stated in the question (1 question, 8
    minutes in total)
  • Open task, requires to find similar or relevant
    results
  • e.g. Find out the titles and URLs of the Web
    sites that are related to CRM benchmarking

50
Experiment design
  • Comparisons
  • Web community vs. Result list
  • Knowledge map vs. Web community
  • Knowledge map vs. Kartoo map
  • One-factor repeated-measures design
  • The content of the tasks were different for
    different browsing methods but their natures were
    the same (i.e. close-ended for task 1 and
    open-ended for task 2)
  • Subjects use each method to perform 2 tasks
    (total 2 ? 4 8 tasks in one hour)

51
Participants
  • 30 subjects
  • Students of the University of Arizona
  • Profile

Age
52
Hypotheses on Effectiveness
  • Consists of accuracy, precision and recall
  • H1 WC is more effective than RL
  • Rationale Clustering and WCs labels help users
    understand major topics more easily
  • H2 KM is more effective than WC
  • Rationale Relative distances on the map provide
    a more intuitive way than WCs labels to find
    relevant results
  • H3 KM is more effective than KT
  • Rationale Relative closeness of Web sites on KM
    enable users to find precise and relevant results

53
Hypotheses on Efficiency
  • The amount of time users need to spend on using a
    browsing method to finish the tasks
  • H4 WC is more efficient than RL
  • Rationale Clustering and WCs labels help users
    to browse and search more quickly
  • H5 KM is more efficient than WC
  • Rationale Map display allows users to see the
    titles quickly and does not need clicking on
    nodes
  • H6 KM is more efficient than KT
  • Rationale KMs placement of Web sites enables
    users to find similar web sites more quickly

54
Hypotheses on Usability
  • How satisfied users are when they use a browsing
    method
  • H7 WC obtains higher users ratings than RL
  • Rationale Hierarchical organized and colored
    clusters are more attractive to users
  • H8 KM obtains higher users ratings than WC
  • Rationale Map is more flexible as it allows
    users to adjust the number of results to show
  • H9 KMs placement of Web sites is more
    meaningful than KTs
  • Rationale The closeness of any two points in KM
    reflects the similarity of the pairs of Web sites

55
Performance Measurement
  • Effectiveness
  • Efficiency Total time spent on the two tasks
  • Usability Users ratings on the browsing tool
    and its main feature

Applied to task 1
Applied to task 2
56
Experiment Results and Discussion
57
Result Summary
Best precision, F value, accuracy, and overall
rating
Best recall, accuracy and total time
58
Results on Effectiveness
  • The following table shows the p-values
  • not significant at alpha 5 level

59
Results on Efficiency
  • The following table shows the p-values
  • not significant at alpha 5 level

60
Results on Usability
  • The following table shows the p-values
  • not significant at alpha 5 level

61
Result List vs. Web Community
  • Hypotheses 1, 4, and 7 (All confirmed)
  • WC performed significantly better than RL
  • In terms of effectiveness, efficiency and
    usability
  • Main reasons advantages of clustering and
    visualization effects
  • WC grouped similar Web sites together but RL did
    not ? higher effectiveness
  • WCs hierarchical structure allowed subjects to
    visualize the landscape of the entire collection
    ? higher efficiency
  • Subjects Once I spot the label, I can move to
    the relevant topics very easily, visualization
    helps to navigate faster and easier, too much
    reading in RL

62
Web Community vs. Knowledge Map
  • Hypotheses H2, H5 and H8 (All not confirmed!!)
  • Surprisingly, we found that KM performed very
    similarly to WC in terms of effectiveness and
    efficiency
  • Both display results on a 2D format while
    provides details on demand
  • Both employ the concept of similarity, but in
    different ways
  • Both allow quick scan of results
  • The opposite of H8 was confirmed !!!
  • KMs inadequate zooming function

63
Knowledge Map vs. Kartoo Map
  • Hypotheses H3, H6 and H9, (all confirmed) and
    subjects verbal comments
  • KM performed significantly better than KT
  • In terms of effectiveness, efficiency and users
    ratings on the meaning of point placement
  • Reasons KMs intuitive meaning of point
    placement, and providing details on demand
  • Subject 9 This is an intelligent tool and has
    features superior to any other search as it gives
    a visual picture of the topic and all topics
    closely related to the one under search. The map
    is intuitive and helps steer the user to the
    right topics or the ones that are close.

64
KM vs. KT (1)
65
KM vs. KT (2)
66
KM vs. KT (3)
67
KM vs. KT (4)
68
Knowledge Map vs. Kartoo Map
  • Subjects verbal comments
  • Kartoo has a better GUI, while KMs clean
    interface provides good user friendliness and
    quality of results
  • KM allows users to quickly locate answers
  • Kartoos many functionality may be confusing to
    some subjects
  • flashing links and labels, URL labels not
    meaningful, titles and summaries cannot be copied
    and disappear quickly

69
Conclusion
70
Contributions
  • Our approach can alleviate information overload
    problem and discover business intelligence on the
    Web
  • WC and KM browsing methods are suitable for
    discovering the landscape of a large number of
    Web sites
  • Findings from our user study provided practical
    implications to SE developers and HCI researchers

71
Future Directions
  • Other algorithms can be applied in the different
    steps of document visualization
  • Co-occurrence analysis, clustering, MDS
  • New visualization metaphors can be developed for
    Web browsing
  • Other domain areas can be explored to create
    knowledge maps and discovering communities
  • Organizations with scattered and voluminous data

72
Thank you very much!
  • Comments and Suggestions ?
Write a Comment
User Comments (0)
About PowerShow.com