Title: A knowledge map approach to the discovery of business intelligence on the Web
1A knowledge map approach to the discovery of
business intelligence on the Web
- Wingyan Chung
- 20 September 2002
2Outline
- Introduction
- Review on Browsing
- Review on Web Mining
- Review on Document Visualization
- Research Questions
- The Knowledge Map Approach
- Evaluation Methodology
- Experiment Results and Discussion
- Conclusion
3Introduction
4Information Overload
- Nowadays, information overload on the World Wide
Web is serious - The world produces between 635,000 and 2.12
million terabytes of unique information per year,
mostly stored in hard drives or servers (Lyman
Varian, 2000) - In business world, most KMS only store companies
internal data but do not capture external
competitive information (McGonagle Vella, 2002) - Only enable the lower-level understandings (data
and information) but not higher-level
understandings (knowledge and wisdom) (Nunamaker
et al., 2001)
5Web Search Engines
- Commonly used to locate publicly available
information - Usually retrieve a large number of Web pages on
simple query search - A search of knowledge management
- Lycos 14,948,890 results Google 3,860,000
results - Alta Vista 4,690,123 results Teoma 2,837,000
results - Community search engines
- e.g. Open Directory, Zeal, Hotrate
- Wide coverage of Web communities
- But not scalable, because they rely on manual
construction of Web directory
6Automatic Discovery of Business Intelligence
- Intelligence the acquisition, interpretation,
collation, assessment, and exploitation of
information (Davies, 2002) - Business analysts What is the landscape of
knowledge management on the Web? - But information overload often prevents the
discovery of business intelligence on the Web - Need automatic techniques to extract knowledge
from the Web - Need new browsing methods to let users visualize
the landscape of results (not just textual result
lists !!)
7Review on Browsing
8Browsing
- Dictionary meanings casual reading
- Exploratory information seeking strategy
(Marchionini and Shneiderman, 1988) - Registration of content into mental model
(Spence, 1999) - Strategies scan, review, search (Carmel et al.,
1992) - Our definition An exploratory information
seeking process characterized by the absence of
planning with a view to form a mental model of
the content being browsed - Understanding browsing is useful for developing
browsing methods
9Hypertext and Browsing
- Hypertext (Nelson, 1965)
- as non-sequential writing nodes pages edges
links - Provide free navigation on the Web
- Lead to users disorientation (Nielsen, 1989)
- The problem is more serious in textual display of
Web pages - Limited amount of information shown on screen
- Users need to click many times to browse through
the whole set of Web pages related to their tasks - ? Experience of information overload !!
10Visual Display of Textual Information
- Task by data type taxonomy of information
visualizations (Shneiderman, 1996) - Data types 1D, 2D, 3D, temporal,
multidimensional, tree, network - Tasks Overview, zoom, filter, details-on-demand,
extract, history, relate - Result list 1D (only limited browsing allowed)
- 2D, tree, network data types support human visual
capabilities more effectively - Hierarchical display shown to be an effective
information access tool particular for browsing
(Lin, 1997 Cutting, 1992) - Map display semantic road map view the
entire collection at a distance (Doyle, 1961
Lin, 1997)
11Review on Web Mining
12Web Mining
- The use of data mining techniques to
automatically discover and extract information
from Web documents and services (Etzioni, 1996) - resource discovery, information extraction,
uncovering general patterns - Web content mining
- Web structure mining
- Web usage mining
- Combination of Web content and structure mining
- Clustering (He et al., 2001) Searching (Bharat
Henzinger, 1998) Compile topic taxonomies
(Chakrabarti et al., 1999) - Web communities ? clusters
A large amount of information is stored in the
form of documents
13Resource Discovery on the Web
- A challenge to researchers and practitioners
- Exponential growth of the Web
- Commercial search engines exhibit bias in their
search results (Mowshowitz Kawaguchi, 2002) - Bias deviation from the norm (pooling the
results of a basket of search engines ) - No search engine could return more than 45 of
relevant results (Selberg, 1995) - Any single search engine on the Web could only
cover about 16 of the entire Web (Lawrence
Giles, 1999)
14Meta Searching
- A highly effective method of resource discovery
and collection on the Web - Integrating meta-searching with textual
clustering tools achieved high precision in
searching the Web (Chen et al., 2001) - The only realistic way to counter the adverse
effects of search engine bias is to use more SEs
(i.e. meta searching) (Mowshowitz Kawaguchi,
2002) - MetaCrawler analysis of relevance rankings
(Selberg Etzioni, 1997) - Vivisimo automatic clustering (Palmer et al.,
2001)
15Review on Document Visualization
16Document Visualization
- Getting insight into information obtained from
one or more documents, but without reading those
documents (Wise et al., 1995) - Involves three stages (Spence, 2001)
Analysis Extract useful attributes from documents
Algorithm Cluster similar documents and reduce
dimensionality of the original representation
Visualization Displaying the encoded data in a
visual format
17Analysis
- Based on automatic text processing techniques
(e.g. Vector space model) - A document is represented by a vector
- Term discrimination values (Salton et al., 1975)
- The similarity between every pair of documents
can be computed examples of such measures - Simple matching coefficient, dices coefficient,
Jaccards coefficient, cosine coefficient, and
overlap coefficient (van Rijsbergen, 1979) - Asymmetric cluster function (Chen Lynch, 1992)
18Algorithms
- Cluster algorithms and multidimensional scaling
(MDS) algorithms are frequently used in
visualization (Spence, 2001) - Cluster algorithms classify objects into
meaningful disjoint subsets or partitions (Jain
Dubes, 1988) - Hierarchical clustering bottom-up approach
- Partitional clustering top-down approach
- MDS algorithms transform similarity matrices
into coordinates of lower dimensions
19Hierarchical Clustering
- A procedure for transforming a proximity matrix
into a sequence of nested partitions (Grabmeier
Rudolph, 2002) - Starting with the n one-element clusters, the
method combines pair of clusters into one
cluster The process repeats until one cluster
remains - Variations Single-link method, complete link
method, average link method, centroid method,
weighted average method, unweighted centroid
method, weighted centroid method, and Wards
method - Strengths visual impact (dendrogram), efficiency
- Weaknesses adverse chaining effect, hierarchical
structure to change dramatically, vulnerability
to ties
20Partitional Clustering
- Assigns objects into groups such that objects in
a cluster are more similar to each other than to
objects in different clusters - Repeatedly assign objects to closest cluster
centers until convergence, using an optimal
criterion to guide the partitioning process - Clustering criterion guide the search of optimal
partition points at each iteration - Square-error clustering criterion (Gordon
Henderson, 1977) - Normalized cut (Shi Malik, 2000 He et al.,
2001) - But finding optimal graph partitioning has been
shown to be NP-complete (Garey Johnson, 1979),
thus search heuristics are required
21Search Heuristics
- Many to choose from
- Genetic algorithms parallel hill climbing
technique that performs global searching of the
optimal value (Holland, 1975) - Taboo search memorizing modifications to
solutions to avoid visiting the same solutions
twice (Glover, 1986) - Scatter search population evolves through
selection, linear combination, integer vector
transformation and culling (Glover, 1977) - Simulated annealing hill climbing but allows the
search to take some downhill steps to escape the
local maximum (van Laarhoven, 1988)
22Search Heuristics
- Can be grouped under the same roof called
adaptive memory programming - Implementations of these general solving methods
are increasingly similar (Taillard et al., 2001) - Applied in generalized assignment type goal
programming problems quadratic assignment,
vehicle routing, graph coloring, nurse
scheduling, timetabling - Among them, GA perform best when the search space
is very large (e.g. the Web graph) - Web searching, spidering, query optimization,
graph partitioning
23Comparing Cluster Methods
- No theory exists to select the best clustering
method for a particular application - Just an exploratory data analysis other issues
still important (Jain Dubes, 1988) - General considerations computational efficiency,
quality of clusters formed, and visual impact - Hierarchical providing visual dendrogram, high
efficiency, satisfactory initial clustering
(Grabmeier Rudolph, 2001) - Partitional tries to achieve optimal clustering
quality, high computation intensity
24Multidimensional Scaling
- A family of techniques that portray the datas
structure in a spatial fashion - P1000 A picture worths a thousand words
- History of MDS
- The first systematic MDS procedure for metric
solutions (Torgerson, 1952) - First nonmetric MDS (Kruskal, 1964)
- Consolidation allows for either metric or
nonmetric analysis using either weighted or
unweighted Euclidean model (Takane et al. 1977) - ALSCAL in SPSS
25MDS Applications
- Many applications in visualization
- Display author cluster maps in their author
co-citation analysis (He Hui, 2002 Eom
Farris, 1996) - Group memory visualization (McQuaid et al. 1999)
- Visualizing user preferences (Kanai and Hakozaki,
2000) - Study the change in the knowledge map of groups
over time (Kealy, 2001) - Surprisingly, none of them is found to apply MDS
to the discovery of business intelligence on the
Web - No existing search engine applies MDS to
facilitate Web browsing
26Visualization
- The process of displaying the encoded data in a
visual format - Output often takes the form of a knowledge map
- A knowledge representation that reveals the
underlying relationships of the knowledge sources - e.g. Web page content, newsgroup messages,
business market trends, newspaper articles, and
other textual and numerical information
27Knowledge Map
- Early work manual drawing of blocks and
connecting lines Concept Map (Novak, 1984) and
Mind Map (Buzan, 1993) - Automatically created maps
- Galaxy of News system (Rennison, 1994)
- Themescape (Wise et al., 1995) - Cartias NewsMap
showing financial articles - Kohonens self-organizing map (Lin, 1991 Chen et
al., 1996 hierarchically clustered regions of
documents - Fisheye and fractal views (Yang et al., 2002)
- Kartoo interconnected nodes representing Web
sites
28Research Gaps
- Hierarchical and map displays were shown to be
effective information access and browse tools,
but have not been widely applied to Web browsing - Past researches on Web mining had attempted to
use either content information or structure
information to cluster Web pages ? how can Web
communities be identified base on both? - None of the existing search engines allows users
to visualize the relationships - In terms of relative closeness
29Research Questions
- Based on content and structure information, how
can Web communities be identified among a set of
business Web sites? - How can a knowledge map be created to represent
the relationships among Web sites? - What are the effectiveness, efficiency, and
usability of Web community and knowledge map in
Web browsing, compared with result list and
Kartoo map?
30The Knowledge Map Approach
31 Compute Similarity
A Knowledge map approachto the discovery of
businessintelligence on the Web
Automatic Indexing
Meta Searching
Queries
Identify Web Community
KM
AltaVista
AlltheWeb
Analysis
DB
Assemble Web Sites
Yahoo
Teoma
DW
MSN
LookSmart
ERP
Display Web Sites on a Map
Wisenut
INSPEC (1969-2002)
CRM
SQL Database
Business intelligence articles
32Research Testbed
- Identify business terms
- A search of business intelligence on INSPEC
returns 281 article abstracts published between
1969 and 2002 - 9 key terms/topics were manually identified
(based on their importance in the abstracts) from
these abstracts - knowledge management, database technology, CRM,
ERP, etc. - Assemble business Web sites by meta-searching
- A total of 700 business Web sites were collected
from 7 major search engines with the 9 key terms
as queries - After removing duplicates and filtering, 3,149
pages from 2,860 business Web sites were
collected - Non-website pages are filtered out
33Three browsing methods Result list, Web
community, and Knowledge map are provided.
Users can choose a business intelligence topic
here to browse.
34Automatic Parsing and Indexing
- Automatically extract key words and hyperlinks
from the Web pages - Remove stop words and identify term type
- title, heading, content text, and image alternate
text - Used Arizona Noun Phraser to automatically
extract and index all the noun phrases from each
Web page (Tolle Chen, 2000) - Term frequency measures how often a particular
term occurs in a Web page - Inverse Web page frequency indicates the
specificity of the term
35Co-occurrence Analysis (1)
- The similarity between every pair of Web sites
(site i and site j) contains the content and
structural (connectivity) information. We modify
the algorithm used in (He et al., 2001) to find
the similarity. - where
- where A, S, C are matrices for Aij, Sij, Cij
respectively. ? and ? are parameters between 0
and 1, and 0 ? ? ? ? 1 - Aij 1 if site i has a hyperlink to site j, Aij
0 otherwise
36Co-occurrence Analysis (2)
- Sij Asymmetric similarity score between site i
and site j (Chen Lynch, 1992) - where
- Cij number of pages pointing to both site i and
site j (co-citation matrix) - Overall similarity
- where A, S, C are matrices for Aij, Sij, Cij
respectively. ? and ? are parameters between 0
and 1, and 0 ? ? ? ? 1
37Identifying Web Communities
- Compute Web communities by GA graph partitioning
and normalized cut measure (Shi, 2000) - Model the Web as a graph consisting of nodes (Web
pages) and edges (similarities) - A cut on a graph G (V, E) is the removal of a
set of edges such that the graph is split into
two disconnected sub-graphs
38Identifying Web Communities
- A normalized cut measures the disassociation
between the nodes in the two sub-graphs (Shi,
2000). Define - The association value of all the nodes in a
sub-graph G (the partition) to all the nodes in
the entire graph G - Minimize the normalized cut value (or maximize
the normalized association value)
A
B
39Identifying Web Communities
- Recursively apply GA to bipartition the
sub-graphs to obtain hierarchical clustering - Label each community properly by the top phrases
and human identification - Hierarchical Partitional clustering
C
D
A
B
40Creating Knowledge Maps
- Used Multidimensional Scaling (MDS) to transform
a high-dimension similarity matrix into a
2-dimensional representation of points
(Torgerson, 1952) - Convert the similarity matrix into a
dissimilarity matrix - Calculate matrix B, which is the scalar products,
by using the cosine law. Each element in B is
given by - Perform a singular value decomposition on B and
use the following formulae to find out the
coordinates of points B U ? V ? U X U ?
V½ B X ? X
41Summary of the Approach
42Evaluation Methodology
43Objectives
- To understand the effectiveness, efficiency and
usability of the two browsing methods - Web community (WC), Knowledge map (KM)
- To compare the knowledge map with a existing
browsing methods - Result list display (RL)
- Kartoo.com map display (KT)
44Browsing
- Browsing the focus of this user study
- The true purpose of hypertext is to provide an
open, exploratory browsing information space to
the user, J. Nielsen, 1990 - connotes an informal search process
characterized by the absence of planning, G.
Marchionini, 1987 - The human user interface affects the
effectiveness and efficiency of browsing (G.
Marchionini and B. Shneiderman, 1988) - Display of search results ? browsing
45Result List
46Web Community
Groups of Web sites organized in hierarchical
communities
Clicking on any nodes immediately below the root
will open that sub-tree
Clicking this button, users can open a Web site
when they have specified it.
Back button allows users to traverse upward in
the tree.
Panel showing details on demand (labels, title,
summary, URL)
47The closeness of any two points reflects their
similarity
Details of this Web site is being shown on the
bottom panel
Users can control the number of Web sites to be
displayed
Panel showing details (title, URL summary)
Navigation buttons allow browsing in four
directions
Zooming buttons allow zoom-in or zoom-out
functions
Knowledge Map
48Kartoo Map
49Experiment Tasks
- Designed according to TREC tasks (Voorhees
Harman, 1997) - For each of the 4 browsing tools
- Task 1 Find Web site information for two
companies stated in the question (2 questions, 4
minutes in total) - Close-ended task, requires specific matching
- e.g. Find out the URL and the major business
areas of Gensym Corporation - Task 2 Find Web site information relevant to a
topic stated in the question (1 question, 8
minutes in total) - Open task, requires to find similar or relevant
results - e.g. Find out the titles and URLs of the Web
sites that are related to CRM benchmarking
50Experiment design
- Comparisons
- Web community vs. Result list
- Knowledge map vs. Web community
- Knowledge map vs. Kartoo map
- One-factor repeated-measures design
- The content of the tasks were different for
different browsing methods but their natures were
the same (i.e. close-ended for task 1 and
open-ended for task 2) - Subjects use each method to perform 2 tasks
(total 2 ? 4 8 tasks in one hour)
51Participants
- 30 subjects
- Students of the University of Arizona
- Profile
Age
52Hypotheses on Effectiveness
- Consists of accuracy, precision and recall
- H1 WC is more effective than RL
- Rationale Clustering and WCs labels help users
understand major topics more easily - H2 KM is more effective than WC
- Rationale Relative distances on the map provide
a more intuitive way than WCs labels to find
relevant results - H3 KM is more effective than KT
- Rationale Relative closeness of Web sites on KM
enable users to find precise and relevant results
53Hypotheses on Efficiency
- The amount of time users need to spend on using a
browsing method to finish the tasks - H4 WC is more efficient than RL
- Rationale Clustering and WCs labels help users
to browse and search more quickly - H5 KM is more efficient than WC
- Rationale Map display allows users to see the
titles quickly and does not need clicking on
nodes - H6 KM is more efficient than KT
- Rationale KMs placement of Web sites enables
users to find similar web sites more quickly
54Hypotheses on Usability
- How satisfied users are when they use a browsing
method - H7 WC obtains higher users ratings than RL
- Rationale Hierarchical organized and colored
clusters are more attractive to users - H8 KM obtains higher users ratings than WC
- Rationale Map is more flexible as it allows
users to adjust the number of results to show - H9 KMs placement of Web sites is more
meaningful than KTs - Rationale The closeness of any two points in KM
reflects the similarity of the pairs of Web sites
55Performance Measurement
- Effectiveness
- Efficiency Total time spent on the two tasks
- Usability Users ratings on the browsing tool
and its main feature
Applied to task 1
Applied to task 2
56Experiment Results and Discussion
57Result Summary
Best precision, F value, accuracy, and overall
rating
Best recall, accuracy and total time
58Results on Effectiveness
- The following table shows the p-values
- not significant at alpha 5 level
59Results on Efficiency
- The following table shows the p-values
- not significant at alpha 5 level
60Results on Usability
- The following table shows the p-values
- not significant at alpha 5 level
61Result List vs. Web Community
- Hypotheses 1, 4, and 7 (All confirmed)
- WC performed significantly better than RL
- In terms of effectiveness, efficiency and
usability - Main reasons advantages of clustering and
visualization effects - WC grouped similar Web sites together but RL did
not ? higher effectiveness - WCs hierarchical structure allowed subjects to
visualize the landscape of the entire collection
? higher efficiency - Subjects Once I spot the label, I can move to
the relevant topics very easily, visualization
helps to navigate faster and easier, too much
reading in RL
62Web Community vs. Knowledge Map
- Hypotheses H2, H5 and H8 (All not confirmed!!)
- Surprisingly, we found that KM performed very
similarly to WC in terms of effectiveness and
efficiency - Both display results on a 2D format while
provides details on demand - Both employ the concept of similarity, but in
different ways - Both allow quick scan of results
- The opposite of H8 was confirmed !!!
- KMs inadequate zooming function
63Knowledge Map vs. Kartoo Map
- Hypotheses H3, H6 and H9, (all confirmed) and
subjects verbal comments - KM performed significantly better than KT
- In terms of effectiveness, efficiency and users
ratings on the meaning of point placement - Reasons KMs intuitive meaning of point
placement, and providing details on demand - Subject 9 This is an intelligent tool and has
features superior to any other search as it gives
a visual picture of the topic and all topics
closely related to the one under search. The map
is intuitive and helps steer the user to the
right topics or the ones that are close.
64KM vs. KT (1)
65KM vs. KT (2)
66KM vs. KT (3)
67KM vs. KT (4)
68Knowledge Map vs. Kartoo Map
- Subjects verbal comments
- Kartoo has a better GUI, while KMs clean
interface provides good user friendliness and
quality of results - KM allows users to quickly locate answers
- Kartoos many functionality may be confusing to
some subjects - flashing links and labels, URL labels not
meaningful, titles and summaries cannot be copied
and disappear quickly
69Conclusion
70Contributions
- Our approach can alleviate information overload
problem and discover business intelligence on the
Web - WC and KM browsing methods are suitable for
discovering the landscape of a large number of
Web sites - Findings from our user study provided practical
implications to SE developers and HCI researchers
71Future Directions
- Other algorithms can be applied in the different
steps of document visualization - Co-occurrence analysis, clustering, MDS
- New visualization metaphors can be developed for
Web browsing - Other domain areas can be explored to create
knowledge maps and discovering communities - Organizations with scattered and voluminous data
72Thank you very much!
- Comments and Suggestions ?