Web Community Mining and Web log Mining : Commody Cluster based execution - PowerPoint PPT Presentation

About This Presentation
Title:

Web Community Mining and Web log Mining : Commody Cluster based execution

Description:

Mining di Dati Web. Web Community Mining and Web log Mining : Commody Cluster ... Define the plan after checking the weather: Submit_weather = Wether Forecast ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 22
Provided by: rom860
Category:

less

Transcript and Presenter's Notes

Title: Web Community Mining and Web log Mining : Commody Cluster based execution


1
Web Community Mining and Web log Mining Commody
Cluster based execution
  • Romeo Zitarosa

2
Overview
  • Introduction
  • Web Community Mining
  • Web log mining on MIS
  • Parallel Data Mining on Pc Cluster
  • Performance Evaluation
  • Conclusion

3
Introduction
  • Proposed two application of web mining
  • 1) Extract web Communities
  • 2) Understand Behaviour of Mobile
    Internet Users (Usage Mining)

4
Web Community Mining
  • Web Community
  • def A web Community is a collection of web
    pages created by individuals or association that
    have common interests on a specific topic.

5
Proposed technique
  • Starts from a set o seed
  • Based on RPA
  • Create a Community Chart

6
Authorities and Hubs
  • Authority page with good contents on a topic
    linked by many good hub pages.
  • Hub page with a list of hyperlink to valuable
    pages on a topic, that points to good
    authorities.
  • Community Core Authority Hubs

7
Web Community Mining
  • Algorithm
  • 1. Seed set
  • 2. Apply RSA to each seed
  • Built web subgraph and extract
    (using HITS) hubs and authority.
  • 3. Investigate how seed derive other seed
    as related pages.

8
Example
  • 1. Consider that s derives t as related
    page and vice versa.
  • s and t are pointed to by
    similar set of hubs.
  • 2. Consider that s derives t as related
    page and but t doesnt derives s.
  • t is pointed to by many different
    hubs so t derives a different set of
    related pages

9
Observation
  • In this way we define a symmertic derivation
    relationship for identify Communities.
  • Def. Community Set of pages strongly connected
    by s.d.r.
  • Two Communities are related if a member of one
    community derives a member of the other
    community.

10
Web Community Chart
  • Def. Is a Graph that consist of communities as
    nodes and weighted edges between nodes.
  • The weight represents the relevance of the
    community
  • We need a tool to browse Communities

11
Web Community Chart(2)
  • Label assigned manually
  • Box list of URLs sorted by connectivity score.
  • Def. Connectivity score
  • number of derivation relatioship from the
    node to others node of the community.

12
Example
13
Mobile Info Search (MIS)
  • NTT laboratories
  • Goal provide location aware information from
    internet collecting, structuring, filtering and
    organizing.
  • www.kokono.net

14
kokono
  • There is a database-type resource between user
    and information souces (online maps,yellow pages,
    etc.)

15
MIS Functionalities
  • User Location Acquisition
  • - GPS,PHS,postal number
  • Location Oriented Robot-Based Search(kokono)
  • - search documents close to a location
  • - display documents in order of distance
    written in the doc and user position
  • Location Oriented Meta Search
  • - backbone database accessed by CGI
    programs.

16
Association Rule Mining
  • Support , confidence
  • Hierarchy gt Taxonomy
  • Hierarchy allow to find not only rules specific
    to a location but also wider area that covers
    that location.
  • Identify Acces patterns of MIS users.
  • Prefetch information.
  • Reduce acces time.
  • Spatial information gives valuabel information to
    mobile users.

17
Sequential Rule Mining
  • Sequential Patterns
  • Derive how different services are used together.
  • Example
  • Define the plan after checking the weather
  • Submit_weather Wether Forecast ?
  • subimit_shop Shop Info shop_web townpage
    ?
  • Submit_kokono KOKONOSearch ? Submit_map MAP

18
Parallel DM and Pc Cluster
  • Parallel Apriori
  • - nodes keep all candidate itemsets
  • - scan indipendently the dataset
  • - comunicate only at the end of the phase
  • Problem Too much memory used!!!
  • Solution (Partial) Hash Partitioned Apriori
    (HPA).
  • - candidates are partitioned using hash
    function
  • - each node buils candidate Itemsets
  • - a lot of disk I/O when support is small

19
Parallel Algorithm for Association Rule Mining
  • Non partitioned generalized (NPGM)
  • Hash Partitioned (HPGM)
  • - reduce communications
  • Hierarchical HPGM (H-HPGM)
  • - candidate whoose root is identical allocated
    on the same node
  • H-HPGM with Fine Grain Duplicates
  • (H-HPGM-FGD)
  • - use remaining free space

20
Performance evaluation
  • Oss. Time increase when support becomes small

21
Conclusion
  • Real web Mining application need high performance
    computing system
  • Pc Cluster with his scalable performance (and
    high costs) is a promising platform
Write a Comment
User Comments (0)
About PowerShow.com