Scalable Web Usage Mining and Soft Computing Approaches for High Performance Intelligent Web Recomme - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Scalable Web Usage Mining and Soft Computing Approaches for High Performance Intelligent Web Recomme

Description:

Encountering similar pathogen a second time. Remember past encounters ... Memory (remembers past encounters: basis for vaccine) ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 80
Provided by: wwwbiscCs
Category:

less

Transcript and Presenter's Notes

Title: Scalable Web Usage Mining and Soft Computing Approaches for High Performance Intelligent Web Recomme


1
Scalable Web Usage Mining and Soft Computing
Approaches for High Performance Intelligent Web
Recommender Systems
  • Olfa Nasraoui
  • Research Assistants Cesar Cardona, Carlos Rojas,
    Fabio Gonzalez, Elizabeth Leon, Chris Petenes,
    Mrudula Pavuluri
  • Dept. of Electrical Computer Engineering
  • The University of Memphis
  • E-mail onasraou_at_memphis.edu

Research sponsored by National Science Foundation
CAREER Award NSF-IIS 0133948
2
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

3
Web Personalization
  • WWW Personalization Tailor users interaction
    with Web info space based on information about
    the user
  • eg recommend items/links based on prior
    ratings/visits
  • Manually entered profiles are subjective, static,
    not always available, and raised privacy concerns
  • Alternative Extract profiles based on all users
  • access patterns Mass profiling
  • ?anonymous profiles

4
Web Personalization System
Web Usage Mining
P R O F I L E S
Recomm-endation Engine
5
Web Mining Data Types Challenges
  • Mining the Web Data
  • Content Web pages
  • Usage Web access log files
  • Structure Link structure of Web pages
  • Challenges
  • Huge, semi/unstructured, highly dynamic data
  • data corrupted with noise (not all info. is
    relevant)

6
Web Usage Mining Framework
7
Knowledge Discovery Process For Web Usage Mining
  • Source of Data Web Clickstreams ? Web Log Files
  • Goal Extract interesting user profiles by
    categorizing user sessions into groups or
    clusters
  • Complete KDD process
  • Preprocessing selecting and cleaning data
  • Data Mining (Learning Phase)
  • A model for session data (bag of URLs visited)
  • Similarity assessment
  • Clustering algorithm to categorize sessions
  • Derivation and Interpretation of results
  • Computing profiles
  • Evaluating results

8
Different Ways to Mine Web User Profiles
Possible Solutions Problems
  • Mining Web Usage Data using Clustering
  • K Means (Shahabi et al. 1997) (problems w/
    Euclidean distance, feature vector
    representation, sparsity, noise, not knowing the
    of clusters, assumes user profiles do not
    overlap, sensitive to initialization local
    optima, etc)
  • Relational Clustering (same as above, except that
    they use a distance/relation matrix instead of
    vector representation, huge memory and
    computation requirements, etc)
  • Fuzzy Relational Clustering (Nasraoui et al.
    1999) (same as above, but allows overlapping
    clusters, etc)
  • Robust relational clustering (Nasraoui et al.
    1999) (same as above, but can handle noise in
    data!)

9
Different Ways to Mine Web User Profiles
Possible Solutions Problems
  • Evolutionary techniques can avoid the feature
    vector representation dilemma (but appropriate
    coding is required), flexible allow any
    similarity measure, any subjective fitness
    criterion measure, better global search
    (population based)
  • Hierarchical Unsupervised Niche Clustering or
    HUNC (Nasraoui Krishnapuram, 2001) based on
    Darwinian evolution metaphor and niche
    speciation, handles noise, unknown of clusters,
    reliable w.r.t initialization
  • Immune Based Clustering (Nasraoui et al., 2002)
    based on immune system metaphor a microcosm of
    evolution
  • Need a Scalable Immune Based Clustering linear
    complexity even for huge clickstream data that
    cannot fit in main memory dynamic online
    learning of evolving user profiles (Nasraoui et
    al., 2003)

10
Different Ways to Mine Web User Profiles
Possible Solutions Problems
  • Other Approaches (Web Session Transaction)
  • Frequent Itemset Association Rule Mining
  • Very sensitive to required parameters support
    and confidence thresholds
  • Either too few profiles or too many including
    spurious profiles
  • Exorbitant computational complexity for low
    support thresholds
  • Association Rule Mining, followed by Hypergraph
    Partition based Clustering
  • Above all drawbacks of hypergraph partition
    clustering (huge complexity)

11
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

12
Background Genetic Algorithms (Inspired by
Nature Darwinian Evolution)
  • Genetic Algorithms (GAs) Evolve a population of
    individuals/solutions using selection, crossover,
    mutation as in nature

Operators
Operators
13
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

14
Mining Web Profiles with Hierarchical
Unsupervised Niche Clustering Step 1 Data
Preprocessing
  • Access log Record of all URLs accessed by users
    on a Web site
  • Log entry access time, IP address, URL viewed,
    etc.
  • ___________________________________________
  • 171148 141.225.195.29 GET /graphics/griffin.jpg
    200
  • 171148 141.225.195.29 GET /people/faculty/nasrao
    ui.html 200
  • ___________________________________________
  • Map NU URLs on site to indices
  • User session vector s(i) temporally compact
    sequence of Web accesses by a user

15
Step 2 Clustering Sessions
  • Clustering Dividing unlabeled data into groups
  • Unsupervised clustering when number of
    categories unknown
  • Robust clustering when data contains noise
    outliers
  • Genetic Clustering Can deal with
    non-differentiable objective functions/similarity
    measures.

16
Unsupervised Niche Clustering (UNC)
  • Representation binary chromosome strings (one
    substring per feature)
  • Deterministic Crowding Selection Children
    replace closest parent if they have better
    fitness.
  • Density fitness measure
  • Robust weight

17
Evolution Example
18
Adaptation to Web Mining Hierarchical UNC (HUNC)
  • Encode binary session vectors
  • Perform UNC in hierarchical mode (HUNC)
  • ? Fast Multi-resolution profiling Vary no. of
    levels (L)
  • Start by applying UNC to entire data set w/ small
    pop. size (L 1)
  • Focus on each cluster recursively Reapply UNC on
    data subset assigned to each cluster to extract
    more clusters at higher resolution (L gt 1)
  • Repeat until cluster size or scale become too
    small

19
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

20
Web Mining Experimental Results with HUNC
21
(No Transcript)
22
Level 2 examples
  • General outside visitor Profiles 1 and 3
  • Prospective students Profiles 2 and 4
  • Insiders (students) Profiles 6, 7, etc

23
Main Site of Univ. Of Missouri
Profile 16 Example of discovering associations
w/out any prior knowledge of content
24
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

25
Towards Scalability Dynamic Web Mining
  • Typically, data mining has to be completely
    re-applied periodically and offline on newly
    generated Web server logs in order to keep the
    discovered knowledge up to date.
  • An intelligent Web mining system should be able
    to continuously learn evolving usage trends
    without ungraceful stoppages, reconfigurations,
    or restarting from scratch
  • Need to make scalable to handle huge data sets
    given limited amounts of main memory
  • We may view the problem as clustering data
    streams huge flux of data with very limited
    memory to store it need single pass learning.
  • Applicable both to usage and content (text) data

26
Another Evolutionary System in Nature The
Immune System
  • immune system parallel and distributed adaptive
    system w/ tremendous potential in many
    intelligent computing applications.
  • Protects our bodies from foreign pathogens
    (viruses/bacteria)
  • Innate Immune System (initial, limited, ex skin,
    tears, etc)
  • Acquired Immune System (Learns how to respond to
    NEW threats adaptively through an evolutionary
    process)
  • Primary immune response
  • First response to invading pathogens
  • Secondary immune response
  • Encountering similar pathogen a second time
  • Remember past encounters
  • Faster and stronger response than primary response

27
Points of Strength of The Immune System
  • Recognition (Anomaly detection, Noise tolerance)
  • Robustness (Noise tolerance)
  • Feature extraction
  • Diversity (can face an entire repertoire of
    foreign invaders)
  • Reinforcement learning
  • Memory (remembers past encounters basis for
    vaccine)
  • Distributed Detection (no single central system)
  • Multi-layered (defense mechanisms at multiple
    levels)
  • Adaptive (Self-regulated)

28
Learning in the Immune System
  • Main purpose of the immune system recognize all
    cells (or molecules) within the body and
    categorize those cells as self or non-self.
  • Non-self cells are further categorized in order
    to stimulate an appropriate type of defensive
    mechanism.
  • The immune system learns through micro-scale
    evolution to distinguish between foreign antigens
    (e.g., bacteria, viruses, etc.) and the body's
    own cells or molecules.

29
B-Cells
  • Through a process of recognition and stimulation,
    B-Cells will clone and mutate to produce a
    diverse set of antibodies adapted to different
    antigens
  • B-Cells secrete antibodies that can bind to
    specific antigens and destroy their host invading
    agent through a KILL, SUICIDE, or INGEST signal.
  • B-Cells antibody also can bind to antibodies on
    other B-Cells, hence sending a STIMULATE or
    SUPPRESS signal ? Network!

30
Immune Recognition
  • Immune recognition based on complementarity
    between binding region of the receptor and a
    portion
  • of the antigen called epitope.
  • B-cell Antibodies present a single type of
    receptor, antigens might present several
    epitopes.
  • This means that different antibodies can
    recognize a single antigen.
  • Binding between B-cells and antigens is NOT exact
    (soft error-tolerant binding)

31
Artificial Immune Systems (AIS) Recent History
  • Based on the Immune Network theory (Jerne, 1973)
  • The system consists of a network of B-Cells
  • Antigens represent data
  • B-Cells represent clusters
  • B-cells form a Network by interacting with each
    other through stimulation and suppression (to
    form memory of past antigens)
  • Exponential explosion in B-Cell population!!!
  • Huge immune network bottleneck against
    scalability!
  • Unscalable (time and memory!)

32
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

33
General Architecture of Proposed Approach
  • Information in Immune network
  • Stimulation (competition memory)
  • Age (old vs. new)
  • Co-stimulation /
  • suppression
  • ? network interactions
  • - Outliers (based on activation)

1-Pass Adaptive Immune Learning
Evolving data ?
?
Evolving Immune Network (compressed into
subnetworks)
34
Model for Artificial Immune Cell
  • Antigens data
  • B-Cells clusters or patterns to be
    learned/extracted
  • Dynamic environment antigens are presented to
    the immune network one at a time, with the
    stimulation and scale measures re-updated with
    each presentation.
  • antigen index, j monotonically increasing with
    time antigens are presented in the following
    chronological order x1, x2, , xN,
  • Dynamic Weighted B-cell (D-W-B-Cell)
  • Represents neighborhood modeled by activation
    function robust weight/membership function

35
Model for Artificial Immune Cell
  • Each D-W-B-Cell is allowed to have is own zone of
    influence with size/scale si
  • D-W-B-Cells dynamically adapt their influence
    zones/hence stimulation level in a strife for
    survival.
  • Activation Weight function dynamically adapts to
    evolving data (time decay)
  • Outliers are easily detected through weak
    activations
  • Flexible for different attributes types
    (numerical, categorical, etc)
  • D-W-B-cells are cloned in proportion to their
    stimulation levels.

36
Incremental Update Eqs. Network Interactions
  • Stimulation (fitness)
  • Influence Zone scale
  • Stimulation and Suppression from neighboring
    B-cells
  • Positive suppression (competition), but no
    stimulation good population control and no
    redundancy, but no memory immune network will
    forget past encounters.
  • Positive stimulation, but no suppression good
    memory but no competition proliferation of
    D-W-B-cell population maximum redundancy.
  • Natural tradeoff between redundancy/memory and
    competition/reduced costs.

37
Divide and Conquer Compress Immune Network into
K Subnetworks
  • Assuming that the network is divided into roughly
    K equal sized subnetworks (ex w/ 2-3 iter. Of K
    Means),
  • Then the number of internal interactions in an
    immune network of NB D-W-B-cells, can drop from
    (NB)2 in the uncompressed network, to (NB /K)2
    intra-subnetwork interactions and (K-1)
    inter-subnetwork interactions in the compressed
    immune network.
  • Can approach linear complexity as K ? (NB)1/2
  • Significant savings in computation

38
Internal and External Immune Interactions
Before After Compression
39
  • Memory Constraints

Start/ Reset
Activates subNet ?
Yes
No
Outlier?
  • Domain Knowledge Constraints

Yes
B-cells gt MaxLimit?
Secondary storage
No
ImmuNet Stats Visualization
40
Immune Based Learning of Web profiles
  • The Web server plays the role of the human body,
    and the incoming requests play the role of
    antigens that need to be detected
  • The input data is similar to web log data (a
    record of all files/URLs accessed by users on a
    Web site)
  • The data is pre-processed to produce session
    lists
  • A session list Si for user i is a list of URLs
    visited by same user
  • In discovery mode, a session is fed to the
    learning system as soon as it is available
  • B-celli ith candidate profile
  • List of URLs
  • Each profile has its own influence zone defined
    by ?i

41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
Single Pass Results (locationscale) on a Noisy
dataset presented one at a time in the same order
as clusters
350 samples 1125 samples 1925 samples
all 3200 samples
45
Ability to distinguish between core and outlier
points (wij lt0.001) for noisy data set presented
in the different orders
cluster 1 to cluster 5, cluster 5 to cluster
1, random order
  • Blank areas first pts used to start the network
  • Most recent noise pts (depending on order) shown
    in black their fate is still uncertain until
    future data confirms it

46
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

47
Simulation Scenarios for Tracking Evolving Web
Usage Trends
  • Scenario 1 We used 20 profiles previously
    discovered using Hierarchical Unsupervised Niche
    Clustering (HUNC) to partition the Web sessions
    into 20 distinct sets of sessions, each one
    assigned to the closest profile. Then we
    presented these sessions to the immune clustering
    algorithm one usage trend at a time from trend 0
    to 19. That is, we first present the sessions
    assigned to trend 0, then the sessions assigned
    to profile 1, , etc.
  • Scenario 2 we used the same pre-partitioned
    session data set as the previous scenario, but
    presented the profiles in reverse order from
    trend 19 to 0. That is, we first present the
    sessions assigned to trend 19, then the sessions
    assigned to profile 18, , etc, ending with
    sessions from trend 0.
  • Scenario 3 Natural chronological order exactly
    as they were received in real time by the web
    server.

48
Distribution of input sessions
  • Trends 5, 9, 13, 14, 15, and 19 appear to be
    weaker and noisier.
  • Also trends 6 and 7 emerge late in the 12-day
    access log, while trend 0 weakens in the last days

49
Noise sessions
50
Evaluation Method
  • In each scenario, we track the actual composition
    of the B cells in the immune network, i.e., the
    URLs present in the B cell (profile).
  • We also track the number of B cells that succeed
    in learning each of the 20 ground truth profiles
    after each session is presented
  • by computing an evolving number of hits per
    expected usage trend number of B-cells within
    0.4 radius of the ground truth profile.
  • distance is computed as (1 - cosine similarity)2.

51
Hits per usage trend vs. time
52
Hits per usage trend vs. time scenario 1 (from
trend 0 to 19)
53
Hits per usage trend vs. time scenario 2
(reverse order)
54
Hits per usage trend vs. time scenario 3
(chronological)
55
If centroids of compressed sub-networks are
allowed to clone
56
Hits Based on High Precision and Coverage
B-cells form a faithful synopsis of input usage
data in 1 pass.
Compare to Input Data (below)
57
Suitability for Real Time Web Mining
  • Single pass over all 1704 Web user sessions
    (non-optimized Java code) lt 7 seconds on 2 GHz
    Pentium IV PC on Linux.
  • Ability to learn an unknown number of evolving
    profiles in real time average of 4 milliseconds
    per user session ? suitable for real time
    personalization.
  • Old profiles can be handled in a variety of ways
    They may either be discarded, moved to secondary
    storage, or cached for possible re-emergence.
  • Even if discarded, older profiles that re-emerge
    later, would be re-learned from scratch.
  • Logistics of maintaining old profiles are not
    crucial.
  • Used same technique successfully to track learn
    evolving topic categories/clusters in text data

58
Outline
  • Web Usage Mining
  • Background Evolutionary Computation
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Need for scalability Background Artificial
    Immune Systems
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

59
Recommender Systems Motivations
  • The move from traditional commerce ? e-commerce
  • Space limited physical inventory ? Huge virtual
    inventory (Information Overload)
  • Recommender systems
  • Help in navigation (adaptive websites)
  • Enhance e-commerce sales (cross-sell, customer
    loyalty, turning browsers into buyers,etc)
  • Web request prediction prefetching, caching,
    load balancing (for single websites ISPs)
  • E-commerce sites simply cannot survive without
    recommender systems!
  • Same for huge web portals (Yahoo!)
  • Same for huge digital libraries and Web
    information systems
  • On Search engines context awareness modify
    query or order/rank of search results

60
Web Personalization System
Web Usage Mining
P R O F I L E S
Recomm-endation Engine
61
Recommender Systems
  • K- Nearest Neighbor based Collaborative
    Filtering Recommend items preferred by K
    nearest neighbors (usually top N items)
  • Association Rule based
  • Discover association rules
  • items_L ? items_R (support, confidence)
  • At recommendation time find all rules supported
    by customer (rated items included in items_L)
  • Sort rules by confidence
  • Recommend first N ranked products
  • Challenges Scalability, sparsity, low coverage
    or precision

62
Fuzzy Inference (IF x is A THEN y is B) via fuzzy
relation matrix
  • R X ? Y ? 0,1 - encodes strength of
    relationship between x and y
  • (x,y) ? mR (x,y)
  • R P ? U ? 0,1 - encodes strength of
    relationship between Profile i and URL j
  • (Pi,urlj) ? mR (Pi,urlj) Pij

R (profiles)
A mA(x)
B mB(y)
s (session)
ms(i)
R
r (recommendations) mr(j)
similarity
Input membership computation
Fuzzy Inference
Pi (profiles)
63
Fuzzy Recommendation Engine
  • Given current session s, infer recommendation r.
    Hence, the following implication
  • s ? r.
  • Given the Web user profiles discovered by mining
    the Web logs, the relation R is defined as
    follows
  • Rik pik
  • The input fuzzy set is derived from the current
    session x s by computing the similarity value
    between s defined in (1) and each profile, pi, as
    follows msi ms(i) sim (s, pi)
  • inferrence procedure concludes the recommendation
    r as the possibility for URL relevance via the
    following composition
  • mrk mr(URLk) ms(i) ? R

t-norm (intersection/AND, e.g. min)
t-conorm (union/OR, e.g. max)
64
(No Transcript)
65
Simulation Experiments
  • Given the prediscovered profiles, and a set of
    Web sessions extracted from the same Web log
    file, we treat every complete session as
    ground-truth session.
  • For each such ground-truth session, all possible
    subsets of this session consisting of between 1
    and 9 URLs are considered as current test
    subsessions,
  • recommendations are generated for each test
    subsession, and
  • coverage and precision measures are averaged for
    each subsession size,
  • roughly 380,000 separate recommendation tests,
    each tested using 16 different recommendation
    scenarios, corresponding to different
    combinations of input membership generation,
    composition, ...etc,

66
Tested Recommendation Scenarios
  • 2 types of similarities cosine or Web session
    (denoted as Cosine and WS respectively in the
    plots).
  • 2 different compositions were tested Max-Min and
    Bounded Sum-Min (denoted as MM and BS in the
    plots).
  • 2 different types of profiles were tested
  • raw profiles generate real similarities (denoted
    as Real Cosine and Real WS in plots), and
  • crisp a cuts with a0.2 (denoted as Binary
    Cosine - .2 Thresholded Profile or Binary WS - .2
    Thresholded Profile in plots, and was only tested
    for max-min composition).
  • Either raw recommendations or crisp a cuts with
    a0.001, 0.2, and 0.3 (denoted Type of Input
    Similarity - a Thresholded To Bin in plots).

67
Evaluation Measures
  • actual completed session, sT treated as
    ground-truth,
  • a subset of this session is treated as incomplete
    current sub-session, sj,
  • rj fuzzy recommendations
  • Then rj rj - sj recommendations obtained
    after omitting all URLs that are part of the
    current subsession.
  • Also, sj sT - sj ground truth URLs, not
    including the ones in the current subsession
    being processed for recommendations.
  • Precision is given by
  • Coverage is given by

68
Coverage Comparison Fuzzy vs. Nearest-Profile
K-NN
69
Precision Fuzzy (better for longer sessions) vs.
Nearest Profile (middle performance) K-NN
70
F1 Measure Fuzzy (better for longer sessions)
vs. Nearest Profile K-NN
71
Discussion of Results Comparison with Nearest
Profile K-NN
  • Fuzzy Recommendations have better Precision at
    larger session sizes (starting at 3 to 4 URLs)
  • K-NN have highest Precision at small session
    sizes (lt4)
  • However Nearest-Profile performs midway between
    k-NN and Fuzzy at small session sizes (lt4)
  • Fuzzy Recommendations have superior Coverage
    regardless of session length
  • Overall F1 measure Fuzzy Recommendations are
    better when the session size is larger and
    Nearest-Profile performs midway between k-NN and
    Fuzzy at small session sizes (lt4)
  • Best to combine Nearest-Profile and Fuzzy and
    alternate between max-min and Bounded-sum
    depending on session length!!!

72
Time and Memory Complexity
  • Offline training takes longer with the profile
    based approach.
  • However this can be done on a back end computer
    and not on the server, and is therefore an
    offline process that does not affect the
    operation of the web server.
  • On the other hand, both fuzzy and nearest profile
    based approaches are extremely fast at
    recommendation time and require a minimal amount
    of main memory to function
  • (a mere summary of the previous usage history
    instead of the entire history as in collaborative
    filtering).
  • In our simulations, fuzzy recommendations with
    non-optimized Perl script (non-compiled) code
    running on a 2 GHz Pentium 4 Linux PC operated at
    an average of 48 recommendations per second.
  • The far more computationally complex K-Nearest
    Neighbor generated recommendations at a leisurely
    2 recommendations per second.

73
Two-Step Recommendation Process based on a
Committee of Profile-Specific URL-Predictor
Neural Networks
74
Average precision, coverage, F1, cosine for the
Two-Step Profile-Specific URL-Predictor
Recommender model
75
Outline
  • Web Usage Mining
  • Background Evolutionary Computation Artificial
    Immune Systems
  • Mining Web Profiles with Hierarchical
    Unsupervised Niche Clustering (H-UNC)
  • H-UNC Web Mining Experimental Results
  • Scalable Web Usage Mining based on the Immune
    System Metaphor
  • Scalable Web Usage Mining Preliminary Results
  • Tying it All Together An Automated Real Time
    Recommendation System
  • Conclusions and Future Prospects

76
Conclusion And Future Prospects
  • Ill-Posed Feature Representation problems,
    Subjective dissimilarity between Web sessions and
    subjective multi-modal fitness functions handled
    well using Evolutionary computation
  • Why Evolutionary computation?
  • Deals w/ ill-Posed Feature Representation
    problems,
  • Handles subjective dissimilarity between Web
    sessions
  • Handles subjective multi-modal fitness functions
  • Insensitive to initialization
  • HUNC first evolutionary based technique for Web
    usage mining
  • Why NOT Evolutionary computation?
  • Scalability problems
  • Immune System Clustering scalable and flexible
    in the face of huge dynamic web usage access
    trends

77
Conclusion And Future Prospects
  • Immune System Clustering Also used successfully
    to extract topics/cluster text documents in 1
    pass.
  • Also used successfully to cluster and perform
    anomaly detection on kdd-cup99 network data in 1
    pass (unsupervised learning using only the normal
    data with no labels)
  • Results in both attack detection false alarms
    superior to best results (kdd-cup winner
    supervised learning trained with data labeled in
    normal all different attacks)
  • Completely Automated real-time web
    personalization system is feasible
  • Soft computing techniques for intelligent
    recommender systems
  • Web usage mining Fuzzy Inference Based better
    for longer sessions (increased uncertainty and
    noise)
  • Web usage mining Two Stage Neural Network
    Unorecedented performance, but slow in training

78
Impact on WWW
  • Scalable and adaptive Recommendation engine for
    personalization
  • Improve search results by taking profile/context
    into account
  • Improve design of dynamic Web sites
  • Facilitate navigation

79
Some Related Publications
  • O. Nasraoui and R. Krishnapuram, and A. Joshi.
    Mining Web Access Logs Using a Relational
    Clustering Algorithm Based on a Robust Estimator,
    8th International World Wide Web Conference,
    Toronto, pp. 40-41, 1999.
  • O. Nasraoui, and R. Krishnapuram, A Novel
    Approach to Unsupervised Robust Clustering using
    Genetic Niching, Proc. of the 9th IEEE
    International Conf. on Fuzzy Systems, San
    Antonio, TX, May 2000, pp. 170-175.
  • O. Nasraoui and R. Krishnapuram. A New
    Evolutionary Approach to Web Usage and Context
    Sensitive Associations Mining, International
    Journal on Computational Intelligence and
    Applications - Special Issue on Internet
    Intelligent Systems, Vol. 2, No. 3, pp. 339-348,
    Sep. 2002.
  • Nasraoui O., Cardona C., Rojas C., and Gonzalez
    F, "TECNO-STREAMS Tracking Evolving Clusters in
    Noisy Data Streams with a Scalable Immune System
    Learning Model", in Proc. of Third IEEE
    International Conference on Data Mining
    (ICDM'03), Melbourne, FL, November 2003.
  • Nasraoui O., Petenes C., "Combining Web Usage
    Mining and Fuzzy Inference for Website
    Personalization", in Proc. of WebKDD 2003 KDD
    Workshop on Web mining as a Premise to Effective
    and Intelligent Web Applications, Washington DC,
    August 2003, p. 37.
  • Nasraoui O., Gonzalez F., Cardona C., Rojas C.,
    and Dasgupta D., "A Scalable Artificial Immune
    System Model for Dynamic Unsupervised Learning",
    Proc. of the Genetic and Evolutionary Computation
    Conference (GECCO), Chicago, IL, July 2003, p.
    219,
  • Nasraoui O., Cardona C., Rojas C., and Gonzalez
    F., "Mining Evolving User Profiles in Noisy Web
    Clickstream Data with a Scalable Immune System
    Clustering Algorithm", in Proc. of WebKDD 2003
    KDD Workshop on Web mining as a Premise to
    Effective and Intelligent Web Applications,
    Washington DC, August 2003, p. 71,
Write a Comment
User Comments (0)
About PowerShow.com