Data Clustering with Application to Relational Data - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data Clustering with Application to Relational Data

Description:

... newt, ocelot, okapi, opossum, orangutan, oryx, otter, ox, panda, panther, ... Domains: Actors, Directors, Movies, demographic attributes... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 28
Provided by: cora3
Category:

less

Transcript and Presenter's Notes

Title: Data Clustering with Application to Relational Data


1
Data Clustering with Application to Relational
Data
  • Adam Anthony
  • Ph.D. Candidate
  • University of Maryland Baltimore County
  • Adviser Marie desJardins

2
Overview
  • Clustering Tutorial
  • Clustering discussion
  • K-means Clustering
  • ___-Link Clustering
  • Probabilistic Clustering
  • My work Relational Data Clustering
  • Relational Data Examples
  • Sources of Information in Relational Data
    Clustering
  • Fast approximate Relational Data Clustering
  • Relation Selection
  • Constraining Solutions in Relational Data
    Clustering
  • Conclusion

3
What Is Data Clustering?
  • Clustering grouping objects into categories
    without outside input
  • Quality of a clustering depends on an objective
  • Which clustering is better?
  • By rank
  • By suit
  • By color
  • Combinations

4
Clustering An Intelligence Perspective
  • Why is clustering considered an intelligent
    activity?
  • What are the categories?
  • Squirrel, Marlin, Salmon, Mouse, Tuna, Bat
  • How many faces?
  • But theres more to it
  • aardvark, addax, alligator, alpaca, anteater,
    antelope, aoudad, ape, argali, armadillo, ass,
    baboon, badger, basilisk, bat, bear, beaver,
    bighorn, bison, boar, budgerigar, buffalo, bull,
    bunny, burro, camel, canary, capybara cat,
    chameleon, chamois, cheetah, chimpanzee,
    chinchilla, chipmunk, civet, coati, colt, cony,
    cougar, cow, coyote, crocodile, crow, deer,
    dingo, doe, dog, donkey, dormouse, dromedary,
    duckbill, dugong, eland, elephant, elk, ermine,
    ewe, fawn, ferret, finch, fish, fox, frog,
    gazelle, gemsbok, gila_monster, giraffe, gnu,
    goat, gopher, gorilla, grizzly_bear, ground_hog,
    guanaco, guinea_pig, hamster, hare, hartebeest,
    hedgehog, hippopotamus, hog, horse, hyena, ibex,
    iguana, impala, jackal, jaguar, jerboa, kangaroo,
    kid, kinkajou, kitten, koala, koodoo, lamb,
    lemur, leopard, lion, lizard, llama, lovebird,
    lynx, mandrill, mare, marmoset, marten, mink,
    mole, mongoose, monkey, moose, mountain_goat,
    mouse, mule, musk_deer, musk_ox, muskrat,
    mustang, mynah_bird, newt, ocelot, okapi,
    opossum, orangutan, oryx, otter, ox, panda,
    panther, parakeet, parrot, peccary, pig,
    platypus, polar_bear, pony, porcupine, porpoise,
    prairie_dog, pronghorn, puma, puppy, quagga,
    rabbit

5
Clustering An Agents Perspective
  • An agent has three short- and long-range binary
    sensors
  • Light (high/low)
  • Heat (high/low)
  • Damaged (yes/no)
  • Clustering can be used to predict unknown values
  • Recharge station (with fluorescent lightbulb)
  • Candle (causes damage)
  • How can clustering help this agent?
  • Agent can predict and avoid damage using
    clustering
  • Clustering can also filter out irrelevant
    information
  • Add a noise sensor, but noise never causes damage

6
Formal Data Clustering
  • Data clustering is
  • Dividing a set of data objects into groups such
    that there is a clear pattern (e.g. similarity to
    each other) for why objects are in the same
    cluster
  • A clustering algorithm requires
  • A data set D
  • A clustering description C
  • A clustering objective Obj(C)
  • An optimization method Opt(D) C
  • Obj measures the goodness of the best clustering
    C that Opt(D) can find

7
K-Means Clustering
  • D numeric d-dimensional data
  • C partitioning of data points into k clusters
  • Obj(C) Root Mean Squared Error (RMSE)
  • Average distance between each object and its
    clusters mean value
  • Optimization Method
  • Select k random objects as the initial means
  • While RMSE_new
  • Move each object to the cluster with the closest
    mean

8
K-Means Demo
9
___-Link Clustering
  • Initialize each object in its own cluster
  • Compute the cluster distance matrix M by the
    selected criterion (below)
  • While there is more than one cluster
  • Join the clusters with the shortest distance
  • Update M by the selected criterion
  • Criterion for ___-link clustering
  • Single-link use the distance of the closest
    objects between two clusters
  • Complete-link use the distance of the most
    distant objects between the two clusters

10
___-Link Demo
  • How can we measure the distance between these
    clusters?
  • What is best for
  • Spherical data (above)?
  • Chain-like data? ?

Complete-Link Distance
Single-Link Distance
11
Probabilistic Clustering
  • There are many ways to optimize such a
    clustering, including Expectation Maximization
    and Simulated Annealing
  • P(C ? ) is called the prior on C and lets us
    control the kinds of clusterings that are found
  • Balanced-size clusters, lots of little clusters,
    a few big clusters, etc.
  • P( D C, ?) is where the interesting
    application-specific work is performed

12
Probabilistic Clustering with Simulated Annealing
  • Use Maximum Likelihood Estimators for the
    parameters ?
  • Use simulated annealing to find optimal C
  • Start with a Random C0, and temperature T0 then
    iterate
  • Perturb a small portion of Ci, store as Ci1
  • Re-Estimate MLE(?), given Ci1
  • Compute L Objprob(Ci1) Objprob(Ci)
  • If L 0 or with probability eL/T, keep solution
    Ci1
  • Else, revert to solution Ci
  • Ti1 Ti/t_s // t_s is a number slightly greater
    than 1
  • Stop when there is little or no change between
    iterations

13
My Research Clustering Relational Data
14
Relational Data
  • Formally
  • A set of Object Domains
  • Sets of instances from those domains
  • Sets of relational tuples between instances
  • Simplifications
  • Atrribute Vectors
  • (Attributed) graphs, when compatible
  • In Practice
  • Relational Data refers only to data that
    requires the use of tuples

Fred,M
Sally,F
Joe,M
15
Some Relational Data Examples
  • Domains People, demographic attributes
  • Relations Friendship, Group
  • Domains Documents, words
  • Relations Directed cross-document references
  • (Internet Movie Database)
  • Domains Actors, Directors, Movies, demographic
    attributes
  • Relations Worked-Together, Directed, Acted-In

16
Observation Attributes and Relations Encode
Unique Information
  • Internet Movie database subset 508 actors
  • 7 binary features has_award, act_drama,
    act_comedy, experienced, gender, popularity,
    many_movies
  • Ground-truth clustering
  • currently active actors and (semi-)retired actors
  • Adjusted Rand index for partition comparison
    (closeness of partition A to partition B, not a
    percentage)
  • ARI between Features Only and Graph Only 0.51

17
Types of Similarity in Relational Data
  • Attribute Similarity
  • Can we sort facebook users into categories based
    on demographic similarity?
  • Structural (Relational) Similarity
  • Can we categorize an actress based on the people
    she has worked with?
  • Correlation Similarity (contribution)
  • Given two blogs that are connected by reciprocal
    URLs, how likely are they to cover
    similar/different topics?

18
Modeling Different Similarities
  • Hypothesis A model that uses one or a
    combination of attribute, structural, and/or
    correlation similarity will be able to find
    non-trivial clusterings that contrast what other
    models may find.

19
The Probabilistic Relational Clustering Framework
(PRCF)
20
Probabilistic Relational Clustering Framework
(Cont.)
  • Prior probability of observing C
  • Attribute Similarity Probability
  • Structural and/or correlation similarity
    probability
  • bCiCj Specifies the block of edges between
    clusters i and j

21
Improving Link Prediction with a novel PRCF model
  • For the Edge model, most researchers choose
  • I proposed a new edge model
  • Experiment
  • Withhold a fraction of edges from an artificial
    graph as a test set
  • Remaining edges are the training set
  • Learn several models with more and more training
    set edges observed
  • AUC Area under ROC curve, a measure of
    classification performance

22
Current Work Block Modularity(joint work with
Michael Lombardi 10)
  • Block bij Set of edges falling into the block
    between clusters i and j
  • If some are dense, and the rest are sparse, we
    can generate a summary graph, and state that
    objects in the same cluster have high structural
    similarity

23
Optimizing Block Modularity
  • Starting with a random partition C,
  • Iterate Until Convergence
  • Compute a new partition C by assigning each
    object to the cluster that would increase the
    block modularity objective the most
  • Let C C
  • Preliminary results
  • Block modularity 3-10 iterations on a graph with
    100 vertices
  • Simulated annealing with PRCF 10000 iterations
  • Block modularity algorithm is significantly
    easier to program than any PRCF model and finds
    the same solution

24
Relation Selection
  • New Application of Block Modularity
  • Generate an initial clustering
  • Use a randomized single-link clustering where the
    distance is measured by the fraction of common
    neighbors
  • Measure the block modularity score for this
    clustering
  • Average over several runs
  • Experiment take a graph with obvious clusters,
    and rewire some edges

25
Observation Sometimes Relations Are Ambiguous
  • Assume attributes cant identify CS/CHEM
  • We know that they belong apart
  • Solution Constrained Clustering

CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
26
Constrained Relational Clustering(joint work
with Paul Guseman 09)
  • Add constraints Must-Link and Cannot-Link
  • Constrain the original algorithm (e.g. PRCF) so
    that no (or very few) constraints are violated
  • Constrain Obj(C) penalize score for broken
    constraints
  • Constrain Opt(D) avoid solutions with broken
    constraints

CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
27
Future Work
  • Relation Extraction/Adjustment
  • Denser relations dominate solutions over sparser
    relations
  • Degree Prediction
  • Given document topic and proposed outlinks, what
    is the expected number of references to my blog?
  • Abstract Relation Type Discovery
  • Given a set of unlabeled edges, can they be split
    into distinct relation types?

28
Conclusion
  • Combining Attribute Similarity, Structural
    Similarity
  • boosts performance compared to using either
    individually
  • New source of information correlation similarity
  • Improves link prediction performance
  • Block Modularity fast (and simple) algorithm for
    optimizing block models
  • Constrained Clustering will help to avoid
    ambiguous clustering scenarios
  • Relation Selection
  • Block modularity can help to quickly decide if a
    relation has high-quality structure
Write a Comment
User Comments (0)
About PowerShow.com