Mining di dati web - PowerPoint PPT Presentation

About This Presentation
Title:

Mining di dati web

Description:

Watanabe's Ugly Duckling theorem. Wolpert's ... The Ugly Duckling theorems ... The ugly duckling and 3 beautiful swans. In binary 0000, 0001, 0010, 0100, 1000, ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 61
Provided by: malvasia
Category:
Tags: dati | duckling | mining | web

less

Transcript and Presenter's Notes

Title: Mining di dati web


1
Mining di dati web
  • Clustering di Documenti Web
  • Le metriche di similarità
  • A.A 2006/2007

2
Learning What does it Means?
  • Definition from Wikipedia
  • Machine learning is an area of artificial
    intelligence concerned with the development of
    techniques which allow computers to "learn".
  • More specifically, machine learning is a method
    for creating computer programs by the analysis of
    data sets.
  • Machine learning overlaps heavily with
    statistics, since both fields study the analysis
    of data, but unlike statistics, machine learning
    is concerned with the algorithmic complexity of
    computational implementations.

3
And What about Clustering?
  • Again from Wikipedia
  • Data clustering is a common technique for
    statistical data analysis, which is used in many
    fields, including machine learning, data mining,
    pattern recognition, image analysis and
    bioinformatics. Clustering is the classification
    of similar objects into different groups, or more
    precisely, the partitioning of a data set into
    subsets (clusters), so that the data in each
    subset (ideally) share some common trait - often
    proximity according to some defined distance
    measure.
  • Machine learning typically regards data
    clustering as a form of unsupervised learning.

4
What clustering is?
5
Clustering
  • Example where euclidean distance is the distance
    metric
  • hierarchical clustering dendrogram

6
Clustering K-Means
  • Randomly generate k clusters and determine the
    cluster centers or directly generate k seed
    points as cluster centers
  • Assign each point to the nearest cluster center.
  • Recompute the new cluster centers.
  • Repeat until some convergence criterion is met
    (usually that the assignment hasn't changed).

7
K-means Clustering
  • Each point is assigned to the cluster with the
    closest center point
  • The number K, must be specified
  • Basic algorithm
  • Select K points as the initial center points.
  • Repeat
  • 2. Form K clusters by assigning all points to the
    closet center point.
  • 3. Recompute the center point of each cluster
  • Until the center points dont change

8
Example 1 data points
image the k-means clustering result.
9
K-means clustering result
10
Importance of Choosing Initial Guesses (1)
11
Importance of Choosing Initial Guesses (2)
12
Local optima of K-means
Supplement to K-means
  • The local optima of K-means
  • Kmeans is to minimize the sum of point-centroid
    distances. The optimization is difficult.
  • After each iteration of K-means the MSE (mean
    square error) decreases. But K-means may converge
    to a local optimum. So K-means is sensitive to
    initial guesses.

13
Example 2 data points
14
Image the clustering result
15
Example 2 K-means clustering result
16
Limitations of K-means
  • K-means has problem when clusters are of
    differing
  • Sizes
  • Densities
  • Non-globular shapes
  • K-means has problems when the data contains
    outliers
  • One solution is to use many clusters
  • Find parts of clusters, but need to put together

17
Clustering K-Means
  • The main advantages of this algorithm are its
    simplicity and speed which allows it to run on
    large datasets. Its disadvantage is that it does
    not yield the same result with each run, since
    the resulting clusters depend on the initial
    random assignments.
  • It maximizes inter-cluster (or minimizes
    intra-cluster) variance, but does not ensure that
    the result has a global minimum of variance.

18
d_intra or d_inter? That is the question!
  • d_intra is the distance among elements (points or
    objects, or whatever) of the same cluster.
  • d_inter is the distance among clusters.
  • Questions
  • Should we use distance or similarity?
  • Should we care about inter cluster distance?
  • Should we care about cluster shape?
  • Should we care about clustering?

19
Distance Functions
  • Informal definition
  • The distance between two points is the length of
    a straight line segment between them.
  • A more formal definition
  • A distance between two points P and Q in a metric
    space is d(P,Q), where d is the distance function
    that defines the given metric space.
  • We can also define the distance between two sets
    A and B in a metric space as being the minimum
    (or infimum) of distances between any two points
    P in A and Q in B.

20
Distance or Similarity?
  • In a very straightforward way we can define the
    Similarity functionsim SxS ? 0,1 as
    sim(o1,o2) 1 - d(o1,o2)where o1 and o2 are
    elements of the space S.

21
What does similar (or distant) really mean?
  • Learning (either supervised or unsupervised) is
    impossible without ASSUMPTIONS
  • Watanabes Ugly Duckling theorem
  • Wolperts No Free Lunch theorem
  • Learning is impossible without some sort of bias.

22
The Ugly Duckling theorems
The theorem gets its fanciful name from the
following counter-intuitive statement assuming
similarity is based on the number of shared
predicates, an ugly duckling is as similar to a
beautiful swan A as a beautiful swan B is to A,
given that A and B differ at all. It was proposed
and proved by Satosi Watanabe in 1969.
23
Satosis Theorem
  • Let n be the cardinality of the Universal set S.
  • We have to classify them without prior knowledge
    on the essence of categories.
  • The number of different classes, i.e. the
    different way to group the objects into clusters,
    is given by the cardinality of the Power Set of
    S
  • Pow(S)2n
  • Without any prior information, the most natural
    way to measure the similarity among two distinct
    objects we can measure the number of classes they
    share.
  • Oooops They share exactly the same number of
    classes, namely 2n-2.

24
The ugly duckling and 3 beautiful swans
  • So1,o2,o3,o4
  • Pow(S) ,o1,o2,o3,o4, o1,o2,o1,
    o3,o1,o4, o2,o3,o2,o4,o3,o4, o1,o2
    ,o3,o1,o2,o4, o1,o3,o4,o2,o3,o4, o1,
    o2,o3,o4
  • How many classes have in common oi, oj iltgtj?
  • o1 and o3
  • 4
  • o1 and o4
  • 4

25
The ugly duckling and 3 beautiful swans
  • In binary 0000, 0001, 0010, 0100, 1000,
  • Chose two objects
  • Reorder the bits so that the chosen object are
    represented by the first two bits.
  • How many strings share the first two bits set to
    1?
  • 2n-2

26
Wolperts No Free Lunch Theorem
  • For any two algorithms, A and B, there exist
    datasets for which algorithm A outperform
    algorithm B in prediction accuracy on unseen
    instances.
  • Proof Take any Boolean concept. If A outperforms
    B on unseen instances, reverse the labels and B
    will outperforms A.

27
So Lets Get Back to Distances
  • In a metric space a distance is a function
    dSxS-gtR so that if a,b,c are elements of S
  • d(a,b) 0
  • d(a,b) 0 iff ab
  • d(a,b) d(b,a)
  • d(a,c) d(a,b) d(b,c)
  • The fourth property (triangular inequality) holds
    only if we are in a metric space.

28
Minkowski Distance
  • Lets consider two elements of a set S described
    by their feature vectors
  • x(x1, x2, , xn)
  • y(y1, y2, , yn)
  • The Minkowski Distance is parametric in pgt1

29
p1. Manhattan Distance
  • If p 1 the distance is called Manhattan
    Distance.
  • It is also called taxicab distance because it is
    the distance a car would drive in a city laid out
    in square blocks (if there are no one-way
    streets).

30
p2. Euclidean Distance
  • If p 2 the distance is the well known Euclidean
    Distance.

31
p?. Chebyshev Distance
  • If p ? then we must take the limit.
  • It is also called chessboard Distance.

32
2D Cosine Similarity
  • Its easy to explain in 2D.
  • Lets consider a(x1,y1) and b(x2,y2).

a
b
?
33
Cosine Similarity
  • Lets consider two points x, y in Rn.

34
Jaccard Distance
  • Another commonly used distance is the Jaccard
    Distance

35
Binary Jaccard Distance
  • In the case of binary feature vector the Jaccard
    Distance could be simplified to

36
Edit Distance
  • The Levenshtein distance or edit distance between
    two strings is given by the minimum number of
    operations needed to transform one string into
    the other, where an operation is an insertion,
    deletion, or substitution of a single character

37
Binary Edit Distance
  • The binary edit distance, d(x,y), from a binary
    vector x to a binary vector y is the minimum
    number of simple flips required to transform one
    vector to the other

x(0,1,0,0,1,1) y(1,1,0,1,0,1) d(x,y)3
The binary edit distance is equivalent to the
Manhattan distance (Minkowski p1) for binary
features vectors.
38
The Curse of High Dimensionality
  • The dimensionality is one of the main problem to
    face when clustering data.
  • Roughly speaking the higher the dimensionality
    the lower the power of recognizing similar
    objects.

39
Volume of theUnit-Radius Sphere
40
Sphere/CubeVolume Ratio
  • Unit-radius Sphere / Cube whose edge lenghts is 2.

41
Sphere/Sphere Volume Ratio
  • Two embedded spheres. Radiuses 1 and 0.9.

42
Concentration of the Norm Phenomenon
  • Gaussian distributions of points (std. dev. 1).
    Dimensions (1,2,3,5,10, and 20).
  • Probability Density Functions to find a point
    drawn according to a Gaussian distribution, at
    distance r from the center of that distribution.

43
Web Document Representation
  • The Web can be characterized in three different
    ways
  • Content.
  • Structure.
  • Usage.
  • We are concerned with Web Content information.

44
Bag-of-Words vs. Vector-Space
  • Let C be a collection of N documentsd1, d2, ,
    dN
  • Each document is composed of terms drawn from a
    term-set of dimension T.
  • Each document can be represented in two different
    ways
  • Bag-of-Words.
  • Vector-Space.

45
Bag-of-Words
  • In the bag-of-words model the document is
    represented asdApple,Banana,Coffee,Peach
  • Each term is represented.
  • No information on frequency.
  • Binary encoding. t-dimensional Bitvector

Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d1,0,1,1,0,0,0,1.
d1,0,1,1,0,0,0,1.
46
Vector-Space
  • In the vector-space model the document is
    represented asdltApple,4gt,ltBanana,2gt,ltCoffee,2gt,
    ltPeach,2gt
  • Information about frequency are recorded.
  • t-dimensional vectors.

Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d4,0,2,2,0,0,0,1.
d4,0,2,2,0,0,0,1.
47
Typical WebCollection Dimensions
  • No. of documents 20B
  • No of terms is approx 150,000,000!!!
  • Very high dimensionality
  • Each document contains from 100 to 1000 terms.
  • Classical clustering algorithm cannot be used.
  • Each document is very similar to the other due to
    the geometric properties just seen.

48
We Need to Cope with High Dimensionality
  • Possible solutions
  • JUST ONE Reduce Dimensions!!!!

49
Dimensionality reduction
  • dimensionality reduction approaches can be
    divided into two categories
  • feature selection approaches try to find a subset
    of the original features. Optimal feature
    selection for supervised learning problems
    requires an exhaustive search of all possible
    subsets of features of the chosen cardinality
  • feature extraction is applying a mapping of the
    multidimensional space into a space of fewer
    dimensions. This means that the original feature
    space is transformed by applying e.g. a linear
    transformation via a principal components analysis

50
PCA - Principal Component Analysis
  • In statistics, principal components analysis
    (PCA) is a technique that can be used to simplify
    a dataset.
  • More formally it is a linear transformation that
    chooses a new coordinate system for the data set
    such that the greatest variance by any projection
    of the data set comes to lie on the first axis
    (then called the first principal component), the
    second greatest variance on the second axis, and
    so on.
  • PCA can be used for reducing dimensionality in a
    dataset while retaining those characteristics of
    the dataset that contribute most to its variance
    by eliminating the later principal components (by
    a more or less heuristic decision).

51
The Method
  • Suppose you have a random vector population x
  • x(x1,x2, , xn)T
  • and the mean
  • ?xEx
  • and the covariance matrix
  • CxE(x- ?x)(x- ?x)T

52
The Method
  • The components of Cx, denoted by cij, represent
    the covariances between the random variable
    components xi and xj. The component cii is the
    variance of the component xi. The variance of a
    component indicates the spread of the component
    values around its mean value. If two components
    xi and xj of the data are uncorrelated, their
    covariance is zero (cij cji 0).
  • The covariance matrix is, by definition, always
    symmetric.
  • Take a sample of vectors x1, x2, , xM we can
    calculate the sample mean and the sample
    covariance matrix as the estimates of the mean
    and the covariance matrix.

53
The Method
  • From a symmetric matrix such as the covariance
    matrix, we can calculate an orthogonal basis by
    finding its eigenvalues and eigenvectors. The
    eigenvectors ei and the corresponding eigenvalues
    ?i are the solutions of the equation Cxei ?
    iei gt Cx- ?I 0
  • By ordering the eigenvectors in the order of
    descending eigenvalues (largest first), one can
    create an ordered orthogonal basis with the first
    eigenvector having the direction of largest
    variance of the data. In this way, we can find
    directions in which the data set has the most
    significant amounts of energy.

54
The Method
  • By ordering the eigenvectors in the order of
    descending eigenvalues (largest first), one can
    create an ordered orthogonal basis with the first
    eigenvector having the direction of largest
    variance of the data. In this way, we can find
    directions in which the data set has the most
    significant amounts of energy.
  • To reduce n to k take the first k eigenvectors.

55
A Graphical Example
56
PCA Eigenvectors Projection
  • Project the data onto the selected eigenvectors

57
Another Example
58
Singular Value Decomposition
  • Is a technique used for reducing dimensionality
    based on some properties of symmetric matrices.
  • Will be the subject for a talk given by one of
    you!!!!

59
Locality-Sensitive Hashing
  • The key idea of this approach is to create a
    small signature for each documents, to ensure
    that similar documents have similar signatures.
  • There exists a family H of hash functions such
    that for each pair of pages u, v we have Prmh(u)
    mh(v) sim(u,v), where the hash function mh
    is chosen at random from the family H.

60
Locality-Sensitive Hashing
  • Will be a subject of a talk given by one of
    you!!!!!!!
Write a Comment
User Comments (0)
About PowerShow.com