A knowledge map approach to the discovery of business intelligence on the Web

About This Presentation

Title:

A knowledge map approach to the discovery of business intelligence on the Web

Description:

But not scalable, because they rely on manual construction of Web directory. 6 ... A family of techniques that portray the data's structure in a spatial fashion ... – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 73

Provided by: wingyan

Category:

more less

Transcript and Presenter's Notes

Title: A knowledge map approach to the discovery of business intelligence on the Web

1
A knowledge map approach to the discovery of
business intelligence on the Web

Wingyan Chung
20 September 2002

2
Outline

Introduction
Review on Browsing
Review on Web Mining
Review on Document Visualization
Research Questions
The Knowledge Map Approach
Evaluation Methodology
Experiment Results and Discussion
Conclusion

3
Introduction
4
Information Overload

Nowadays, information overload on the World Wide
Web is serious
The world produces between 635,000 and 2.12
million terabytes of unique information per year,
mostly stored in hard drives or servers (Lyman
Varian, 2000)
In business world, most KMS only store companies
internal data but do not capture external
competitive information (McGonagle Vella, 2002)
Only enable the lower-level understandings (data
and information) but not higher-level
understandings (knowledge and wisdom) (Nunamaker
et al., 2001)

5
Web Search Engines

Commonly used to locate publicly available
information
Usually retrieve a large number of Web pages on
simple query search
A search of knowledge management
Lycos 14,948,890 results Google 3,860,000
results
Alta Vista 4,690,123 results Teoma 2,837,000
results
Community search engines
e.g. Open Directory, Zeal, Hotrate
Wide coverage of Web communities
But not scalable, because they rely on manual
construction of Web directory

6
Automatic Discovery of Business Intelligence

Intelligence the acquisition, interpretation,
collation, assessment, and exploitation of
information (Davies, 2002)
Business analysts What is the landscape of
knowledge management on the Web?
But information overload often prevents the
discovery of business intelligence on the Web
Need automatic techniques to extract knowledge
from the Web
Need new browsing methods to let users visualize
the landscape of results (not just textual result
lists !!)

7
Review on Browsing
8
Browsing

Dictionary meanings casual reading
Exploratory information seeking strategy
(Marchionini and Shneiderman, 1988)
Registration of content into mental model
(Spence, 1999)
Strategies scan, review, search (Carmel et al.,
1992)
Our definition An exploratory information
seeking process characterized by the absence of
planning with a view to form a mental model of
the content being browsed
Understanding browsing is useful for developing
browsing methods

9
Hypertext and Browsing

Hypertext (Nelson, 1965)
as non-sequential writing nodes pages edges
links
Provide free navigation on the Web
Lead to users disorientation (Nielsen, 1989)
The problem is more serious in textual display of
Web pages
Limited amount of information shown on screen
Users need to click many times to browse through
the whole set of Web pages related to their tasks
? Experience of information overload !!

10
Visual Display of Textual Information

Task by data type taxonomy of information
visualizations (Shneiderman, 1996)
Data types 1D, 2D, 3D, temporal,
multidimensional, tree, network
Tasks Overview, zoom, filter, details-on-demand,
extract, history, relate
Result list 1D (only limited browsing allowed)
2D, tree, network data types support human visual
capabilities more effectively
Hierarchical display shown to be an effective
information access tool particular for browsing
(Lin, 1997 Cutting, 1992)
Map display semantic road map view the
entire collection at a distance (Doyle, 1961
Lin, 1997)

11
Review on Web Mining
12
Web Mining

The use of data mining techniques to
automatically discover and extract information
from Web documents and services (Etzioni, 1996)
resource discovery, information extraction,
uncovering general patterns
Web content mining
Web structure mining
Web usage mining
Combination of Web content and structure mining
Clustering (He et al., 2001) Searching (Bharat
Henzinger, 1998) Compile topic taxonomies
(Chakrabarti et al., 1999)
Web communities ? clusters

A large amount of information is stored in the
form of documents
13
Resource Discovery on the Web

A challenge to researchers and practitioners
Exponential growth of the Web
Commercial search engines exhibit bias in their
search results (Mowshowitz Kawaguchi, 2002)
Bias deviation from the norm (pooling the
results of a basket of search engines )
No search engine could return more than 45 of
relevant results (Selberg, 1995)
Any single search engine on the Web could only
cover about 16 of the entire Web (Lawrence
Giles, 1999)

14
Meta Searching

A highly effective method of resource discovery
and collection on the Web
Integrating meta-searching with textual
clustering tools achieved high precision in
searching the Web (Chen et al., 2001)
The only realistic way to counter the adverse
effects of search engine bias is to use more SEs
(i.e. meta searching) (Mowshowitz Kawaguchi,
2002)
MetaCrawler analysis of relevance rankings
(Selberg Etzioni, 1997)
Vivisimo automatic clustering (Palmer et al.,
2001)

15
Review on Document Visualization
16
Document Visualization

Getting insight into information obtained from
one or more documents, but without reading those
documents (Wise et al., 1995)
Involves three stages (Spence, 2001)

Analysis Extract useful attributes from documents
Algorithm Cluster similar documents and reduce
dimensionality of the original representation
Visualization Displaying the encoded data in a
visual format
17
Analysis

Based on automatic text processing techniques
(e.g. Vector space model)
A document is represented by a vector
Term discrimination values (Salton et al., 1975)
The similarity between every pair of documents
can be computed examples of such measures
Simple matching coefficient, dices coefficient,
Jaccards coefficient, cosine coefficient, and
overlap coefficient (van Rijsbergen, 1979)
Asymmetric cluster function (Chen Lynch, 1992)

18
Algorithms

Cluster algorithms and multidimensional scaling
(MDS) algorithms are frequently used in
visualization (Spence, 2001)
Cluster algorithms classify objects into
meaningful disjoint subsets or partitions (Jain
Dubes, 1988)
Hierarchical clustering bottom-up approach
Partitional clustering top-down approach
MDS algorithms transform similarity matrices
into coordinates of lower dimensions

19
Hierarchical Clustering

A procedure for transforming a proximity matrix
into a sequence of nested partitions (Grabmeier
Rudolph, 2002)
Starting with the n one-element clusters, the
method combines pair of clusters into one
cluster The process repeats until one cluster
remains
Variations Single-link method, complete link
method, average link method, centroid method,
weighted average method, unweighted centroid
method, weighted centroid method, and Wards
method
Strengths visual impact (dendrogram), efficiency
Weaknesses adverse chaining effect, hierarchical
structure to change dramatically, vulnerability
to ties

20
Partitional Clustering

Assigns objects into groups such that objects in
a cluster are more similar to each other than to
objects in different clusters
Repeatedly assign objects to closest cluster
centers until convergence, using an optimal
criterion to guide the partitioning process
Clustering criterion guide the search of optimal
partition points at each iteration
Square-error clustering criterion (Gordon
Henderson, 1977)
Normalized cut (Shi Malik, 2000 He et al.,
2001)
But finding optimal graph partitioning has been
shown to be NP-complete (Garey Johnson, 1979),
thus search heuristics are required

21
Search Heuristics

Many to choose from
Genetic algorithms parallel hill climbing
technique that performs global searching of the
optimal value (Holland, 1975)
Taboo search memorizing modifications to
solutions to avoid visiting the same solutions
twice (Glover, 1986)
Scatter search population evolves through
selection, linear combination, integer vector
transformation and culling (Glover, 1977)
Simulated annealing hill climbing but allows the
search to take some downhill steps to escape the
local maximum (van Laarhoven, 1988)

22
Search Heuristics

Can be grouped under the same roof called
adaptive memory programming
Implementations of these general solving methods
are increasingly similar (Taillard et al., 2001)
Applied in generalized assignment type goal
programming problems quadratic assignment,
vehicle routing, graph coloring, nurse
scheduling, timetabling
Among them, GA perform best when the search space
is very large (e.g. the Web graph)
Web searching, spidering, query optimization,
graph partitioning

23
Comparing Cluster Methods

No theory exists to select the best clustering
method for a particular application
Just an exploratory data analysis other issues
still important (Jain Dubes, 1988)
General considerations computational efficiency,
quality of clusters formed, and visual impact
Hierarchical providing visual dendrogram, high
efficiency, satisfactory initial clustering
(Grabmeier Rudolph, 2001)
Partitional tries to achieve optimal clustering
quality, high computation intensity

24
Multidimensional Scaling

A family of techniques that portray the datas
structure in a spatial fashion
P1000 A picture worths a thousand words
History of MDS
The first systematic MDS procedure for metric
solutions (Torgerson, 1952)
First nonmetric MDS (Kruskal, 1964)
Consolidation allows for either metric or
nonmetric analysis using either weighted or
unweighted Euclidean model (Takane et al. 1977)
ALSCAL in SPSS

25
MDS Applications

Many applications in visualization
Display author cluster maps in their author
co-citation analysis (He Hui, 2002 Eom
Farris, 1996)
Group memory visualization (McQuaid et al. 1999)
Visualizing user preferences (Kanai and Hakozaki,
2000)
Study the change in the knowledge map of groups
over time (Kealy, 2001)
Surprisingly, none of them is found to apply MDS
to the discovery of business intelligence on the
Web
No existing search engine applies MDS to
facilitate Web browsing

26
Visualization

The process of displaying the encoded data in a
visual format
Output often takes the form of a knowledge map
A knowledge representation that reveals the
underlying relationships of the knowledge sources
e.g. Web page content, newsgroup messages,
business market trends, newspaper articles, and
other textual and numerical information

27
Knowledge Map

Early work manual drawing of blocks and
connecting lines Concept Map (Novak, 1984) and
Mind Map (Buzan, 1993)
Automatically created maps
Galaxy of News system (Rennison, 1994)
Themescape (Wise et al., 1995) - Cartias NewsMap
showing financial articles
Kohonens self-organizing map (Lin, 1991 Chen et
al., 1996 hierarchically clustered regions of
documents
Fisheye and fractal views (Yang et al., 2002)
Kartoo interconnected nodes representing Web
sites

28
Research Gaps

Hierarchical and map displays were shown to be
effective information access and browse tools,
but have not been widely applied to Web browsing
Past researches on Web mining had attempted to
use either content information or structure
information to cluster Web pages ? how can Web
communities be identified base on both?
None of the existing search engines allows users
to visualize the relationships
In terms of relative closeness

29
Research Questions

Based on content and structure information, how
can Web communities be identified among a set of
business Web sites?
How can a knowledge map be created to represent
the relationships among Web sites?
What are the effectiveness, efficiency, and
usability of Web community and knowledge map in
Web browsing, compared with result list and
Kartoo map?

30
The Knowledge Map Approach
31

Compute Similarity
A Knowledge map approachto the discovery of
businessintelligence on the Web
Automatic Indexing
Meta Searching
Queries
Identify Web Community
KM
AltaVista
AlltheWeb
Analysis
DB
Assemble Web Sites
Yahoo
Teoma
DW
MSN
LookSmart
ERP
Display Web Sites on a Map
Wisenut
INSPEC (1969-2002)
CRM

SQL Database
Business intelligence articles
32
Research Testbed

Identify business terms
A search of business intelligence on INSPEC
returns 281 article abstracts published between
1969 and 2002
9 key terms/topics were manually identified
(based on their importance in the abstracts) from
these abstracts
knowledge management, database technology, CRM,
ERP, etc.
Assemble business Web sites by meta-searching
A total of 700 business Web sites were collected
from 7 major search engines with the 9 key terms
as queries
After removing duplicates and filtering, 3,149
pages from 2,860 business Web sites were
collected
Non-website pages are filtered out

33
Three browsing methods Result list, Web
community, and Knowledge map are provided.
Users can choose a business intelligence topic
here to browse.
34
Automatic Parsing and Indexing

Automatically extract key words and hyperlinks
from the Web pages
Remove stop words and identify term type
title, heading, content text, and image alternate
text
Used Arizona Noun Phraser to automatically
extract and index all the noun phrases from each
Web page (Tolle Chen, 2000)
Term frequency measures how often a particular
term occurs in a Web page
Inverse Web page frequency indicates the
specificity of the term

35
Co-occurrence Analysis (1)

The similarity between every pair of Web sites
(site i and site j) contains the content and
structural (connectivity) information. We modify
the algorithm used in (He et al., 2001) to find
the similarity.
where
where A, S, C are matrices for Aij, Sij, Cij
respectively. ? and ? are parameters between 0
and 1, and 0 ? ? ? ? 1
Aij 1 if site i has a hyperlink to site j, Aij
0 otherwise

36
Co-occurrence Analysis (2)

Sij Asymmetric similarity score between site i
and site j (Chen Lynch, 1992)
where
Cij number of pages pointing to both site i and
site j (co-citation matrix)
Overall similarity
where A, S, C are matrices for Aij, Sij, Cij
respectively. ? and ? are parameters between 0
and 1, and 0 ? ? ? ? 1

37
Identifying Web Communities

Compute Web communities by GA graph partitioning
and normalized cut measure (Shi, 2000)
Model the Web as a graph consisting of nodes (Web
pages) and edges (similarities)
A cut on a graph G (V, E) is the removal of a
set of edges such that the graph is split into
two disconnected sub-graphs

38
Identifying Web Communities

A normalized cut measures the disassociation
between the nodes in the two sub-graphs (Shi,
2000). Define
The association value of all the nodes in a
sub-graph G (the partition) to all the nodes in
the entire graph G
Minimize the normalized cut value (or maximize
the normalized association value)

A
B
39
Identifying Web Communities

Recursively apply GA to bipartition the
sub-graphs to obtain hierarchical clustering
Label each community properly by the top phrases
and human identification
Hierarchical Partitional clustering

C
D
A
B
40
Creating Knowledge Maps

Used Multidimensional Scaling (MDS) to transform
a high-dimension similarity matrix into a
2-dimensional representation of points
(Torgerson, 1952)
Convert the similarity matrix into a
dissimilarity matrix
Calculate matrix B, which is the scalar products,
by using the cosine law. Each element in B is
given by
Perform a singular value decomposition on B and
use the following formulae to find out the
coordinates of points B U ? V ? U X U ?
V½ B X ? X

41
Summary of the Approach
42
Evaluation Methodology
43
Objectives

To understand the effectiveness, efficiency and
usability of the two browsing methods
Web community (WC), Knowledge map (KM)
To compare the knowledge map with a existing
browsing methods
Result list display (RL)
Kartoo.com map display (KT)

44
Browsing

Browsing the focus of this user study
The true purpose of hypertext is to provide an
open, exploratory browsing information space to
the user, J. Nielsen, 1990
connotes an informal search process
characterized by the absence of planning, G.
Marchionini, 1987
The human user interface affects the
effectiveness and efficiency of browsing (G.
Marchionini and B. Shneiderman, 1988)
Display of search results ? browsing

45
Result List
46
Web Community
Groups of Web sites organized in hierarchical
communities
Clicking on any nodes immediately below the root
will open that sub-tree
Clicking this button, users can open a Web site
when they have specified it.
Back button allows users to traverse upward in
the tree.
Panel showing details on demand (labels, title,
summary, URL)
47
The closeness of any two points reflects their
similarity
Details of this Web site is being shown on the
bottom panel
Users can control the number of Web sites to be
displayed
Panel showing details (title, URL summary)
Navigation buttons allow browsing in four
directions
Zooming buttons allow zoom-in or zoom-out
functions
Knowledge Map
48
Kartoo Map
49
Experiment Tasks

Designed according to TREC tasks (Voorhees
Harman, 1997)
For each of the 4 browsing tools
Task 1 Find Web site information for two
companies stated in the question (2 questions, 4
minutes in total)
Close-ended task, requires specific matching
e.g. Find out the URL and the major business
areas of Gensym Corporation
Task 2 Find Web site information relevant to a
topic stated in the question (1 question, 8
minutes in total)
Open task, requires to find similar or relevant
results
e.g. Find out the titles and URLs of the Web
sites that are related to CRM benchmarking

50
Experiment design

Comparisons
Web community vs. Result list
Knowledge map vs. Web community
Knowledge map vs. Kartoo map
One-factor repeated-measures design
The content of the tasks were different for
different browsing methods but their natures were
the same (i.e. close-ended for task 1 and
open-ended for task 2)
Subjects use each method to perform 2 tasks
(total 2 ? 4 8 tasks in one hour)

51
Participants

30 subjects
Students of the University of Arizona
Profile

Age
52
Hypotheses on Effectiveness

Consists of accuracy, precision and recall
H1 WC is more effective than RL
Rationale Clustering and WCs labels help users
understand major topics more easily
H2 KM is more effective than WC
Rationale Relative distances on the map provide
a more intuitive way than WCs labels to find
relevant results
H3 KM is more effective than KT
Rationale Relative closeness of Web sites on KM
enable users to find precise and relevant results

53
Hypotheses on Efficiency

The amount of time users need to spend on using a
browsing method to finish the tasks
H4 WC is more efficient than RL
Rationale Clustering and WCs labels help users
to browse and search more quickly
H5 KM is more efficient than WC
Rationale Map display allows users to see the
titles quickly and does not need clicking on
nodes
H6 KM is more efficient than KT
Rationale KMs placement of Web sites enables
users to find similar web sites more quickly

54
Hypotheses on Usability

How satisfied users are when they use a browsing
method
H7 WC obtains higher users ratings than RL
Rationale Hierarchical organized and colored
clusters are more attractive to users
H8 KM obtains higher users ratings than WC
Rationale Map is more flexible as it allows
users to adjust the number of results to show
H9 KMs placement of Web sites is more
meaningful than KTs
Rationale The closeness of any two points in KM
reflects the similarity of the pairs of Web sites

55
Performance Measurement

Effectiveness
Efficiency Total time spent on the two tasks
Usability Users ratings on the browsing tool
and its main feature

Applied to task 1
Applied to task 2
56
Experiment Results and Discussion
57
Result Summary
Best precision, F value, accuracy, and overall
rating
Best recall, accuracy and total time
58
Results on Effectiveness

The following table shows the p-values
not significant at alpha 5 level

59
Results on Efficiency

The following table shows the p-values
not significant at alpha 5 level

60
Results on Usability

The following table shows the p-values
not significant at alpha 5 level

61
Result List vs. Web Community

Hypotheses 1, 4, and 7 (All confirmed)
WC performed significantly better than RL
In terms of effectiveness, efficiency and
usability
Main reasons advantages of clustering and
visualization effects
WC grouped similar Web sites together but RL did
not ? higher effectiveness
WCs hierarchical structure allowed subjects to
visualize the landscape of the entire collection
? higher efficiency
Subjects Once I spot the label, I can move to
the relevant topics very easily, visualization
helps to navigate faster and easier, too much
reading in RL

62
Web Community vs. Knowledge Map

Hypotheses H2, H5 and H8 (All not confirmed!!)
Surprisingly, we found that KM performed very
similarly to WC in terms of effectiveness and
efficiency
Both display results on a 2D format while
provides details on demand
Both employ the concept of similarity, but in
different ways
Both allow quick scan of results
The opposite of H8 was confirmed !!!
KMs inadequate zooming function

63
Knowledge Map vs. Kartoo Map

Hypotheses H3, H6 and H9, (all confirmed) and
subjects verbal comments
KM performed significantly better than KT
In terms of effectiveness, efficiency and users
ratings on the meaning of point placement
Reasons KMs intuitive meaning of point
placement, and providing details on demand
Subject 9 This is an intelligent tool and has
features superior to any other search as it gives
a visual picture of the topic and all topics
closely related to the one under search. The map
is intuitive and helps steer the user to the
right topics or the ones that are close.

64
KM vs. KT (1)
65
KM vs. KT (2)
66
KM vs. KT (3)
67
KM vs. KT (4)
68
Knowledge Map vs. Kartoo Map

Subjects verbal comments
Kartoo has a better GUI, while KMs clean
interface provides good user friendliness and
quality of results
KM allows users to quickly locate answers
Kartoos many functionality may be confusing to
some subjects
flashing links and labels, URL labels not
meaningful, titles and summaries cannot be copied
and disappear quickly

69
Conclusion
70
Contributions

Our approach can alleviate information overload
problem and discover business intelligence on the
Web
WC and KM browsing methods are suitable for
discovering the landscape of a large number of
Web sites
Findings from our user study provided practical
implications to SE developers and HCI researchers

71
Future Directions

Other algorithms can be applied in the different
steps of document visualization
Co-occurrence analysis, clustering, MDS
New visualization metaphors can be developed for
Web browsing
Other domain areas can be explored to create
knowledge maps and discovering communities
Organizations with scattered and voluminous data