VIRTUAL PRESENCE - PowerPoint PPT Presentation

1 / 99
About This Presentation
Title:

VIRTUAL PRESENCE

Description:

http://galeb.etf.bg.ac.yu/~vm/tutorial. Voislav ... it is snowing today: (confirm :sender i :receiver j :content 'weather( today, snowing )' :language Prolog ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 100
Provided by: Gal118
Category:

less

Transcript and Presenter's Notes

Title: VIRTUAL PRESENCE


1
VIRTUAL PRESENCE
Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial
2
DEFINITION
Virtual presence is a term with various shades
of meanings in different industries, but its
essence remains constant it is a new tool that
enables some form of telecommunication in which
the individual may substitute their physical
presence with an alternate, typically,
electronic presence
3
SUMMARY
- Introduction to Virtual Presence - Data Mining
for Virtual Presence - A New Software
Paradigm - Selected Case Studies
4
INTRODUCTION TO VP
  • - Definitions
  • VP applications
  • Psychological aspects

5
DATA MINING FOR VP
  • Why Data Mining?
  • What can Data Mining do?
  • Growing popularity of Data Mining
  • - Algorithms

6
SOFTWARE AGENTS
  • A new software paradigm
  • Standardization
  • FIPA specifications
  • Agent management
  • Agent Communication Language

7
GoodNews (CMU)
  • Categorization of financial news articles
  • Co-located phrases
  • Domain Experts
  • Implementation and results

Carnegie Mellon University, Pittsburgh, USA
8
iMatch (MIT)
  • The idea
  • - associate MIT students and staff in order to
    ease their cooperation
  • - help students find resources they need
  • Implementation
  • advanced, agent-based system architecture
  • - Tomorrow?

Massachusetts Institute of Technology, USA
9
Tourist city (ETF)
  • A qualitative step forward in the domain of
    maximization of customer satisfaction
  • Technologies
  • Data Mining
  • Software Agents (mobile)

Faculty of Electrical Engineering, University
of Belgrade, Serbia and Montenegro
10
CONCLUSION
  • This tutorial will attempt to familiarize you
    with
  • The concept of VP (Virtual Presence) as a
    new technological challenge
  • The new paradigms and technologies that will
    bring the VP to everyday life
  • - Data Mining - Software Agents

11
INTRODUCTION
  • Virtual presence will arguably be one of the
    most important aspects of personal communication
    in the twenty-first century

12
Essence of VP
  • The usefulness and reliability of virtual
    presence
  • The ability to conduct everyday tasks by being
    virtually or electronically present

13
How to Accomplish it?
  • The presence is accomplished through the
    Internet, video, or other communications,
    perhaps even psychically one day
  • Technological advance will sophisticate virtual
    presence, altering the very meaning of the word
    presence

14
VP Applications
  • VP in government
  • Sunshine laws
  • Voting

15
VP Applications
  • VP in business
  • Online board meetings
  • Shareholder voting online

16
VP Applications
  • VP in education
  • interactive lectures and courses

17
VP Applications
  • VP in medicine
  • Telemedicine
  • Diagnostics
  • Remote surgery
  • Risks
  • Privacy

18
VP Applications
  • VP in everyday life
  • Telecommuting/Telework
  • Software agents as our virtual shadows

19
Psychological Aspects
  • Cyberspace and Mind
  • Presence in Virtual Space
  • Communal Mind and Virtual Community

20
DATA MINING
  • Knowledge discovery is a non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data

21
Many Definitions
  • Data mining is also called data or knowledge
    discovery
  • It is a process of inferring knowledge from
    large oceans of data
  • Search for valuable information in large volumes
    of data
  • Analyzing data from different perspectives and
    summarizing it into useful information

22
Why Data Mining ?
  • DM allows you to extract knowledge from
    historical data and predict outcomes of future
    situations
  • Optimize business decisions and improve
    customers satisfaction with your services
  • Analyze data from many different angles,
    categorize it, and summarize the relationships
    identified
  • Reveal knowledge hidden in data and turn this
    knowledge into a crucial competitive advantage

23
What Can Data Mining Do?
  • Identify your best prospects and then retain
    them as customers
  • Predict cross-sell opportunities and make
    recommendations
  • Learn parameters influencing trends in sales and
    margins
  • Segment markets and personalize communications
  • etc.

24
The Power of Data Mining
  • Having a database is one thing, making sense of
    it is quite another
  • It does not rely on narrow human queries to
    produce results, but instead uses AI related
    technology and algorithms
  • Inductive reasoning
  • Using more than one type of algorithm to search
    for patterns in data
  • Data mining produces usually more general (more
    powerful) results than those obtained by
    traditional techniques
  • Relational DB storage and management technology
    is OK for data mining applications less than 50
    gigabytes

25
Reasons for the Growing Popularity of Data Mining
  • Growing Data Volume
  • Low Cost of Machine Learning
  • Limitations of Human Analysis

26
Tasks Solved by Data Mining
  • Predicting
  • Classification
  • Detection of relations
  • Explicit modeling
  • Clustering
  • Market basket analysis
  • Deviation detection

27
Algorithms
  • Generally, their complexity is around n (log
    n)(n is the number of records)
  • Data mining includes three major components,
    with corresponding algorithms
  • Clustering (Classification)
  • Association Rules
  • Sequential Analysis

28
Classification Algorithms
  • The aim is to develop a description or model for
    each class in a database, based on the features
    present in a set of class-labeled training
    data
  • Data Classification Methods
  • Statistical algorithms
  • Neural networks
  • Genetic algorithms
  • Nearest neighbor method
  • Rule induction
  • Data visualization

29
Classification-rule Learning
  • Data abstraction
  • Classification-rule learning finding rules or
    decision trees that partition given data into
    predefined classes
  • Hunts method
  • Decision tree building algorithms
  • ID3 / C4.5 algorithm
  • SLIQ / SPRINT algorithm (IBM)
  • Other algorithms

30
Parallel Algorithms
  • Basic Idea N training data items are randomly
    distributed to P processors. All the processors
    cooperate to expand the root node of the
    decision tree
  • There are two approaches for future progress
    (the remaining nodes)
  • Synchronous approach
  • Partitioned approach

31
Association Rule Algorithms
  • Association rule implies certain association
    relationship among the set of objects in a
    database
  • These objects occur together, or one implies
    the other
  • Formally X ? Y, where X and Y are sets of items
    (itemsets)
  • Key terms
  • Confidence
  • Support
  • The goal to find all association rules that
    satisfy user-specified minimum support and
    minimum confidence constraints.

32
Association Rule Algorithms
  • Apriori algorithm and its variations
  • AprioriTid
  • AprioriHybrid
  • FT (Fault-tolerant) Apriori
  • Distributed / Parallel algorithms (FDM, )

33
Sequential Analysis
  • Sequential Patterns
  • The problem finding all sequential patterns
    with user-specified minimum support
  • Elements of a sequential pattern need not to be
  • consecutive
  • simple items
  • Algorithms for finding sequential patterns
  • count-all algorithms
  • count-some algorithms

34
Conclusion
  • Drawbacks of existing algorithms
  • Data size
  • Data noise
  • There are two critical technological drivers
  • Size of the database
  • Query complexity
  • The infrastructure has to be significantly
    enhanced to support larger applications
  • Solutions
  • Adding extensive indexing capabilities
  • Using new HW architectures to achieve
    improvements in query time

35
THE NEW SOFTWARE PARADIGM
  • All software agents are programs, but not all
    programs are agents

36
Many Definitions
  • Computational systems that inhabit some dynamic
    environment, sense and act autonomously and
    realize a set of goals or tasks for which they
    are designed
  • Hardware or (more usually) software-based
    computer system that enjoys the following
    properties

- Reactive (sensing and acting) - Autonomous -
Goal-oriented (pro-active purposeful) -
Temporally continuous - Communicative (socially
able)
- Learning (adaptive) - Mobile - Flexible -
Character
37
Interesting Topic of Study
  • They draw on and integrate many diverse
    disciplines of computer science and other areas
  • objects and distributed object architectures
  • adaptive learning systems
  • artificial intelligence and expert systems
  • collaborative online social environments
  • security
  • knowledge based systems, databases
  • communications networks
  • cognitive science and psychology

38
What Problems do Agents Solve ?
  • Client/server network bandwidth problem
  • In the design of a client/server architecture
  • The problems created by intermittent or
    unreliable network connections
  • Attempts to get computers to do real thinking for
    us

39
The New Software Paradigm
  • Unless special care has been taken in the design
    of the code, two software programs cannot
    interoperate
  • The promise of agent technology is to move the
    burden of interoperability from software
    programmers to programs themselves
  • This can happen if two conditions are met
  • A common language (Agent Communication Language
    ACL)
  • An appropriate architecture

40
The Need for Standards
  • Anywhere, anytime consumer access to the
    Universal bouquet of information and services is
    the new goal of the information revolution
  • The scope of Internet standards makes the scope
    of choices extreme
  • The Foundation for Intelligent Physical Agents
    (FIPA), established in 1996 in Geneva
  • international non-profit association of companies
    and organizations
  • specifications of generic agent technologies.

41
FIPA Specifications
  • Agent Management
  • Agent Communication Language
  • Agent/Software Integration
  • Agent Management Support for Mobility
  • Human-Agent Interaction
  • Agent Security Management
  • Agent Naming
  • FIPA Architecture
  • Agent Message Transport
  • etc.

42
Agent Management
  • Provides the normative framework within which
    FIPA agents exist and operate
  • Establishes the logical reference model for the
    creation, registration, location, communication,
    migration and retirement of agents
  • The entities contained in the reference model are
    logical capability sets and do not imply any
    physical configuration
  • - Additionally, the implementation details of
    individual APs and agents are the design choices
    of the individual agent system developers

43
Components of the Model
  • Agent

- computational process - fundamental actor on an
AP - as a physical software process has a life
cycle that has to be managed by the AP
  • Directory Facilitator
  • - yellow pages to other agents
  • supported function are
  • register
  • deregister
  • modify
  • search
  • Agent Management System
  • - white pages services to other agents
  • - maintains a directory of AIDs which contain
    transport addresses
  • supported function are
  • register
  • deregister
  • modify
  • search
  • get-description
  • operations for underlying AP
  • Message Transport Service

- communication method between agents
  • Agent Platform

- physical infrastructure in which agents can be
deployed
  • Software

- all non-agent, executable collections of
instructions accessible through an agent
44
Agent Life Cycle
  • FIPA agents exist physically on an AP and utilize
    the facilities offered by the AP for realising
    their functionalities
  • In this context, an agent, as a physical software
    process, has a physical life cycle that has to
    be managed by the AP

The state transitions of agents can be described
as
- create - invoke - destroy - quit - suspend
- resume - wait - wake up - move - execute
45
Agent Communication Language
  • The specification consists of a set of message
    types and the description of their meanings
  • Requirements
  • Implementing a subset of the pre-defined message
    types and protocols
  • Sending and receiving the not-understood message
  • Correct implementation of communicative acts
    defined in the specification
  • Freedom to use communicative acts with other
    names, not defined in the specification
  • Obligation of correctly generating messages in
    the transport form
  • Language must be able to express propositions,
    objects and actions
  • The use of Agent Management Content Language and
    ontology

46
ACL Syntax Elements
  • Pre-defined message parameters

sender receiver content reply-with in-reply-t
o envelope language ontology reply-by protoco
l conversation-id
  • Communicative acts

accept-proposal agree cancel cfp confirm disconfir
m failure inform inform-if inform-ref
not-understood propose query-if query-ref refuse r
eject-proposal request request-when request-whenev
er subscribe
47
Communication Examples
  • Agent i confirms to agent j that it is,
  • in fact, true that it is snowing today
  • (confirm     sender i     receiver j   
    content "weather( today, snowing )"   
    language Prolog
  • )

- Agent i asks agent j if j is registered with
domain server d1 (query-if     sender i    
receiver j    content       (registered
(server d1) (agent j))    reply-with
r09) ... (inform    sender j    receiver
i    content (not (registered (server d1)
(agent j)))    in-reply-to r09)
- Agent j replies that it can reserve trains,
planes and automobiles (inform     sender j
    receiver i    content       ( (iota ?x
(available-services j ?x))         
((reserve-ticket train)          
(reserve-ticket plane)           (reserve
automobile))       )    )
  • Agent i, believing that agent j thinks that a
    shark is a
  • mammal, attempts to change j's belief
  • (disconfirm     sender i     receiver j   
    content (mammal shark)
  • )

- Agent j refuses to i reserve a ticket for i,
since i there are insufficient funds in i's
account (refuse     sender j     receiver
i    content      (       (action j
(reserve-ticket LHR, MUC, 27-sept-97))      
(insufficient-funds ac12345)      )   
language sl)
- Auction bid (inform    sender agent_X    
receiver auction_server_Y    content      
(price (bid good02) 150) in-reply-to
round-4 reply-with bid04 language sl
ontology auction)
- Agent i did not understand an query-if message
because it did not recognize the
ontology (not-understood    sender i   
receiver j    content ((query-if sender j
receiver i )              (unknown (ontology
www)))    language sl )
- Agent i asks agent j for its available
services (query-ref     sender i    
receiver j    content       (iota ?x
(available-services j ?x))    )
48
Agent/Software Integration
  • Integration of services provided by non-agent
    software into a multi-agent community
  • Definition of the relationship between agents
    and software systems
  • Allowing agents to describe, broker and negotiate
    over software systems
  • Allowing new software services to be dynamically
    introduced into an agent community
  • Defining how software resources can be described,
    shared and dynamically controlled in an agent
    community

49
New Agent Roles
  • To support specification, two new agent roles
    have been identified
  • Agent Resource Broker (ARB)
  • WRAPPER Agent

50
GoodNews
  • A system that automatically categorizesnews
    reports that reflect positively or negativelyon
    a companys financial outlook

51
Introduction
  • Correlation between news reports on a companys
    financial outlook and its attractiveness as an
    investment
  • Volume of such reports is huge
  • A new text classification algorithm Domain
    Experts with self-confident sampling
    technique
  • Two types of data
  • (Human-)labeled
  • Unlabeled
  • The algorithm classifies financial news into the
    predefined five categories
  • (good) ? (good, uncertain) ? (neutral) ? ? (bad,
    uncertain) ? (bad)

52
Introduction
  • Text categorization task
  • FCP (Frequently Co-located Phrase) the building
    elementfor the categorization algorithm
  • Text categorization very difficult domainfor
    the use of machine learning
  • Very large number of input features
  • High level of attribute and class noise
  • Large percent of irrelevant features
  • Very expensive labeled data, while unlabeled
    data are cheaply available

53
Categorization
  • The algorithm categorizes each given news article
    into the predefined categories in terms of
    referred companys financial well-being
  • GOOD strong and explicit evidences of the
    companys financial status
  • shares of ABC company rose 2 percent to
    24-15/16
  • GOOD, UNCERTAIN predictions and forecasts of
    future profitability
  • ABC company predicts fourth-quarter earnings
    will be high

54
Categorization
  • NEUTRAL nothing is mentioned about the
    financial well-being of the company
  • ABC announced plans to focus on products based
    on recycledmaterials
  • BAD, UNCERTAIN predictions of future loses
  • ABC announced today that fourth-quarter results
    couldfall short of expectations
  • BAD explicitly bad evidences
  • shares of ABC fell 0.57 to 44.65 in early NY
    trading
  • Problems with construction of the training (i.e.
    labeled)data set inter-indexer inconsistency

55
Co-located Phrase
  • The proposed algorithm labels the unlabeled
    news articlesthrough voting process among
    experts that are FCPs
  • Definition a co-located phrase is a sequence of
    nearby, but not necessarily consecutive words
  • shares of ABC rose 8.5 (shares, rose) GOOD
  • ABC presented its new product (present,
    product) NEUTRAL
  • Contextual information
  • The use of heuristics to cope with enormous
    phrase space(amount of possible phrases)

56
Naive-Bayes v Domain Experts
  • Naive-Bayes with EM (Expectation Maximization)
  • Problems with small sets of labeled (training)
    data
  • EM (Expectation Maximization) a class of
    iterative algorithms for maximum likelihood
    estimation in problems with incomplete data
  • Domain Experts algorithm is able to deal with
    inconsistent hypotheses
  • Iterative building of the training set

57
Implementation and Results
  • The experiment focused on two performance
    criteria
  • Using unlabeled data for improving categorization
    accuracy
  • The categorization itself
  • The accuracy is around 75 (total of 2000 news
    articles)
  • Comparison of a few different methods (picture)

58
Conclusions
  • Domain Experts with SC sampling outperform naive
    Bayes with EM
  • collocation property and vote entropy are
    appropriate to such a domain
  • The accuracy of around 75 is the limit with the
    techniques used
  • Better performance could be achieved by using
    some natural language processing techniques
  • Such techniques are pretty rudimental today

59
iMatch
  • The vision of each MIT student
  • having a personal software agent,
  • which helps to manage its owner's academic life

60
Introduction
  • The aim bring together MIT students and staff
    who may usefully collaborate with each other
  • This collaboration can have several goals
  • completing final projects
  • studying for exams
  • tutoring one another
  • iMATCH agents are supposed to facilitate students
    and faculty matching for
  • Research
  • Teaching
  • Internship
  • opportunities within and across campuses

61
iMatch Agent Architecture
  • iMatch agents are situated within an environment
  • Sensors of the agent convert environmental
    inputsinto representations that can be
    manipulated within the agent
  • Effectors translate actions planned by the
    agentinto executable statements for the
    environment
  • The action planner selects the action with the
    highest utilityaccording to the owners
    preference specification

62
Impacts and Benefits
  • MIT
  • Benefit MIT students by matching them to
    appropriate resources
  • Aid the recruitment of student researchers
  • Help students manage their lives
  • Use iMATCH in Medical Computing
  • GLOBAL
  • Facilitate Cross Community Collaboration

63
Research Topics
  • Knowledge representation
  • preference specification
  • Multi-agents systems
  • reputation management system
  • static interest matching
  • dynamic interest matching
  • Infrastructure
  • distributed security infrastructure

64
Ceteris Paribus Preference
  • Ceteris paribus relations express a preference
    over sets of possible outcomes
  • All possible outcomes are considered to be
    describable by some (large) set of binary
    features (true or false)
  • The specified features are instantiated to either
    true or false
  • Other features are ignored

65
CPP Agent Configuration
  • Specify a domain for preference
  • Agent methods of communication and notification
  • Different security settings of different servers
  • Preference statements themselves
  • How to get users to easily adjust C.P. rules
    (graphical interface)
  • Pose hypothetical preference questions to user to
    help complete the preferences of an ambivalent
    user
  • People will only put down their true profile, if
    they know that the system is secure

66
Static Interest Matching
  • Group together similar users for specific context
  • This enables viewing a human user as a
    resourcefor dynamic resource discovery
  • (locate experts, enthusiasts,...)
  • The approach 
  • Keyword matching
  • Ontological matching using Kulbeck-Leiber (KL)
    distance

67
Dynamic Interest Matching
  • Location and/or temporal specific resource
    matching
  • As students and their agents move from one
    physical location to another, iMatch services for
    matching the closest resources can be offered
  • The idea anything worthwhile is locatable
  • The approach
  • Intentional naming scheme
  • Reputation based resource discovery

68
Technology
  • Components
  • Distributed Multi-Agent Infrastructures
  • Ceteris Paribus preference-based Interest
    Matching
  • Reputation Management Infrastructure
  • Technology
  • Microsoft.Net
  • Bluetooth
  • IEEE 802.11
  • Smartcards (PC/SC)
  • INS (International Naming System)

69
Conclusion
  • Benefit MIT students by matching them to
    appropriate resources
  • Static interest matching
  • Group together similar users for specific context
  • This enables viewing a human user as a resource
    for dynamic resource discovery (locate experts,
    enthusiasts,...)
  • Dinamic interest matching
  • Location and/or temporal specific resource
    matching As students and their agents move from
    one physical location to another, iMatch
    services for matching the closest resources can
    be offered
  • Help students manage their lives

70
The near future
  • The focus of the research is on e-tourism after
    the year 2005, but the applications of the
    proposed infrastructure are multifold

71
Introduction
  • The assumptions
  • after the year 2005, each tourist in Europe will
    be equiped with a cell phone of the power same or
    better than the Pentium IV
  • whenever a tourism-based service or product is
    purchased, a mobile agent is assigned to that
    cell phone PC, to monitor the behaviour of the
    customer
  • all tourist cell phone PCs create an AD-HOC
    networkaround the points of touristic
    attractions, and link to a data mine that
    collects all information of interest

72
How to accomplish it?
  • The information of interest is not collected by
    asking the customer to fill out the forms, but by
    monitoring the behaviour of the customer
  • The collected information, sorted in the data
    mine, is made available to other tourists, as an
    on-line owner-independent source of information
    about the given services and/or products

73
What can be done
  • If a tourist would like to know, at that very
    moment, what restaurant has good food/atmosphere
    and happy customers, he/she can access the data
    mine (via the Internet) and obtain the
    information that is linked to that very moment,
    and is not created by the owner of the business,
    but by the customers themselves
  • Accessing the given restaurants website has two
    drawbacks
  • the information is not fresh - periodically
    updated
  • the information is made by the owner of the
    restaurant, and therefore not completely objective

74
Conclusion
  • Consequently, the proposed approach works much
    better , and represents a qualitative step
    forward in the domain of maximization of
    customer satisfaction
  • This may mean that the privacy of the person is
    jeopardized,however, if the monitored behaviour
    is non-personalized, and if the customer obtains
    a discount based on the fact that mobile agents
    are welcome, the privacy stops to be an issue,
    and people will sign up voluntarily

75
Appendix
  • A Survey of the Data Mining Algorithms

76
Apriori Algorithm
  • The task mining association rules by finding
    large itemsets and translating them to the
    corresponding association rules
  • A ? B, or A1 ? A2 ?? Am ? B1 ? B2 ?? Bn, where
    A ? B ?
  • The terminology
  • Confidence
  • Support
  • k-itemset a set of k items
  • Large itemsets the large itemset A, B
    corresponds to the following rules
    (implications) A ? B and B ? A

77
Apriori Algorithm
  • The ? operator definition
  • n 1 S2 S1 ? S1 A, B, C ? A, B,
    C AB, AC, BC
  • n k Sk1 Sk ? Sk X ? Y X, Y ? Sk, X ?
    Y k-1
  • X and Y must have the same number of elements,
    and must have exactly k-1 identical elements
  • Every k-element subset of any resulting set
    element (an element is actually a k1 element
    set) has to belong to the original set of
    itemsets

78
Apriori Algorithm
  • Example

79
Apriori Algorithm
  • Step 1 generate a candidate set of 1-itemsets
    C1
  • Every possible 1-element set from the database is
    potentially a large itemset, because we dont
    know the number of its appearances in the
    database in advance (á priori ?)
  • The task adds up to identifying (counting) all
    the different elements in the database every
    such element forms a 1-element candidate set
  • C1 A, B, C, D, E
  • Now, we are going to scan the entire database, to
    count the number of appearances for each one of
    these elements (i.e. one-element sets)

80
Apriori Algorithm
  • Now, we are going to scan the entire database, to
    count the number of appearances for each one of
    these elements (i.e. one-element sets)

81
Apriori Algorithm
  • Step 2 generate a set of large 1-itemsets L1
  • Each element in C1 with support that exceeds some
    adopted minimum support (for example 50) becomes
    a member of L1
  • L1 A, B, C,E and we can omit D in
    further steps (if D doesnt have enough support
    alone, there is no way it could satisfy
    requested support in a combination with some
    other element(s))

82
Apriori Algorithm
  • Step 3 generate a candidate set of large
    2-itemsets, C2
  • C2 L1 ? L1 AB, AC, AE, BC, BE,
    CE
  • Count the corresponding appearances
  • Step 4 generate a set of large 2-itemsets, L2
  • Eliminate the candidates without minimum
    support
  • L2 AC, BC, BE, CE

83
Apriori Algorithm
  • Step 5 (C3)
  • C3 L2 ? L2 BCE
  • Why not ABC and ACE because their 2-element
    subsets AB and AE are not the elements of
    large 2-itemset set L2 (calculation is made
    according to the operator ? definition)
  • Step 6 (L3)
  • L3 BCE, since BCE satisfies the required
    support of 50 (two appearances)
  • There can be no further steps in this particular
    case, because L3 ? L3 ?
  • Answer L1 ? L2 ? L3

84
Apriori Algorithm
  • L1 large 1-itemsets
  • for (k2 Lk-1 ? ? k)
  • Ck apriori-gen(Lk-1)
  • forall transactions t ? D do begin
  • Ct subset (Ck, t)
  • forall candidates c ? Ct do
  • c.count
  • end
  • Lk c ? Ck c.count ? minsup
  • end
  • Answer ?k Lk

85
Apriori Algorithm
  • Enhancements to the basic algorithm
  • Scan-reduction
  • The most time consuming operation in Apriori
    algorithm is the database scan it is originally
    performed after each candidate set generation, to
    determine the frequency of each candidate in the
    database
  • Scan number reduction counting candidates of
    multiple sizes in one pass
  • Rather than counting only candidates of size k in
    the kth pass, we can also calculate the
    candidates Ck1, where Ck1 is generated from
    Ck (instead Lk), using the ? operator

86
Apriori Algorithm
  • Compare Ck1 Ck ? Ck Ck1 Lk ? Lk
  • Note that Ck1 ? Ck1
  • This variation can pay off in later passes, when
    the cost of counting and keeping in memory
    additional Ck1 - Ck1 candidates becomes less
    than the cost of scanning the database
  • There has to be enough space in main memory for
    both Ck and Ck1
  • Following this idea, we can make further scan
    reduction
  • Ck1 is calculated from Ck for k gt 1
  • There must be enough memory space for all Cks (k
    gt 1)
  • Consequently, only two database scans need to be
    performed (the first to determine L1, and the
    second to determine all the other Lks)

87
Apriori Algorithm
  • Abstraction levels
  • Higher level associations are stronger (more
    powerful), but also less certain
  • A good practice would be adopting different
    thresholds for different abstraction levels
    (higher thresholds for higher levels of
    abstraction)

88
DHP Algorithm
  • DHP Direct Hashing and Pruning another
    algorithm for mining association rules
  • Based on the Apriori algorithm (Ck/Lk generation
    in the kth step)
  • Empirical analysis of the Apriori algorithm shows
    that candidate sets (Ck) are much larger than
    corresponding sets of large itemsets (Lk),
    especially in a first few iterations
  • DHP introduces more efficient candidate set
    generation method
  • The idea is to insert into Ck only those
    candidate sets that are likely to become large
    itemsets

89
DHP Algorithm
  • Additional improvement is accomplished through
    two-dimensional search base reduction
    length(number of records in the search base)
    and width (number of relevant attributes in a
    record)
  • Large itemsets characteristics
  • Every non-empty subset of a large itemset is a
    large itemset as well, for example, BCD ? L3 ?
    BC, CD, BD ? L2
  • It implies that a record is relevant for
    discovering large k1-itemsets only if it
    contains at least k1 large k-itemsets

90
DHP Algorithm
  • During the Ck ? Lk phase we might count large
    k-itemsets in each record if their number in a
    particular record is less than k1, we omit that
    record during the Ck1 generation
  • Similarly, if a record contains one or more large
    k1-itemsets, each element (item) of these
    itemsets appears in, at least, k candidates from
    Ck
  • Hashing
  • Hashing boosts the performance of the DHP
    algorithm
  • The algorithm does not specify any hash function
    in particular, it depends on the application
  • Likewise, it does not specify the size of the
    hash table (number of groups/addresses)

91
DHP Algorithm
  • Application example

92
DHP Algorithm
  • Step 1 generate a candidate set of 1-itemsets
    C1
  • C1 A, B, C, D, E
  • Simultaneously with counting each elements
    support, a hash tree is generated that contains
    all the elements from the database, in order to
    improve the counting performance
  • For each new element, DHP checks whether the
    element is already in the tree or not
  • If yes, DHP increments the current number of
    appearances for that element otherwise, the
    element is added to the hash tree, and the number
    of its appearances is set to 1

93
DHP Algorithm
  • Having counted each C1 element appearances, all
    possible 2-element subsets are generated and
    inserted into H2 hash table
  • The address of a particular subset could be
    calculated with respect to the position of its
    elements in C1 candidate set, using chosen hash
    function h(x, y)

94
DHP Algorithm
  • For example, lets adopt the following hash
    function h(x y) (posC1(x)10 posC1(y))
    mod 7
  • The corresponding H2 hash table is shown below

95
DHP Algorithm
  • Whenever a new element is added to the hash
    table, the weight of the particular address is
    increased by one
  • C2 is generated out of L1 (just like in Apriori
    case)
  • Besides that, only those elements that map to the
    addresses whose weight is greater or equal than
    specified minimum support (let the minimum
    support be 50), will be taken into consideration
    during the C2 generation
  • C2 AC, BC, BE, CE
  • It contains two elements less (!) than the C2 set
    generated by the Apriori algorithm for the same
    example database

96
DHP Algorithm
  • In general, the Hk hash table is used for the Ck
    candidate set generation in the kth step of the
    algorithm Hk is created in the previous (k-1)th
    step
  • Each address of the Hk hash table contains a
    number of k-element subsets as elements its
    weight denotes the number of elements
  • The fact that an address doesnt satisfy minimum
    support requirement means that neither element
    (set) that is mapped to the address can satisfy
    the requirement alone ? all the elements (sets)
    at such Hk addresses are omitted for the Ck
    generation
  • During the kth step, Ck is generated starting
    from Lk-1, with the restrictions described above

97
DHP Algorithm
  • Conclusions
  • DHP outperforms Apriori, for the same input data
  • The time spent for the hash tables generation
    (especially H2) is overcome by extremely reduced
    candidate sets (C2, )
  • The same improvements applied on Apriori, may as
    well be applied here (scan reduction, abstraction
    levels, )

98
References
  • http//www.marconi.com
  • http//www.blueyed.com
  • http//www.fipa.org
  • http//www.rpi.edu
  • http//research.microsoft.com
  • http//imatch.lcs.mit.edu

99
THE END
  • Quatenus nobis denegatum diu vivere, relinquamus
    aliquid, quo nos vixisse testemur

Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial
Write a Comment
User Comments (0)
About PowerShow.com