Data_clustering Data_clustering

Data clustering - Definition and Overview

Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering consists of partitioning a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often similarity or proximity for some defined distance measure.

Data clustering algorithms can be hierarchical or partitional, and hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down).

Contents

Hierarchical clustering

Introduction

Hierarchical clustering builds a hierarchy of clusters from individual elements. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other.

Traditional representation

Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.

Agglomerative hierarchical clustering

This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance <math>d(element_1,element_2)<math> between elements.

Suppose we have merged the two closest elements {b} and {c}, we now have the following clusters {a} {b c} {d} {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters <math>\mathcal{A}<math> and <math>\mathcal{B}<math> is one of the following:

  • the maximum distance between elements of each cluster (also called complete linkage clustering) <math> max_{x \in \mathcal{A}, y \in \mathcal{B}} d(x,y) <math>
  • the minimum distance between elements of each cluster (also called single linkage clustering) <math> min_{x \in \mathcal{A}, y \in \mathcal{B}} d(x,y) <math>
  • the mean distance between elements of each cluster (also called average linkage clustering) <math> {1 \over {card(\mathcal{A})card(\mathcal{B})}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y) <math>
  • the sum of all intra cluster variance
  • the increase in variance for the cluster being merged (Ward's criterion)

Each of the agglomeration occurs at a certain distance between clusters and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

K-means and derivatives

k-means clustering

The k-means algorithm assigns each point to the cluster which center (or centroid) is nearest, the centroid being defined as the point which is the average of all the points in the cluster.

As one can see, this is not a complete definition, because the clusters depend on the centroid and the centroid depend on the cluster, so the following algorithm is used:

  • Choose a number of clusters
  • Assign each point randomly to a cluster
  • Repeat until the algorithm has converged (that is, no more points are changing cluster during one iteration) :
    • Compute the centroid for each cluster, that is the point which coordinates is the mean of all the points in the cluster
    • For each point, assign it to the cluster which centroid is nearest

The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

fuzzy c-means clustering

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster.

Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster <math>u_k(x)<math>. Usually, the sum of those coefficients has to be one, so that <math>u_k(x)<math> denotes a probability of belonging to a certain cluster. <math> \forall x \sum_{k=1}^{num. clusters} u_k(x) \ =1<math>

With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is <math>center_k = {{\sum_x u_k(x) x} \over {\sum_x u_k(x)}}<math>.

The degree of being in a certain cluster is the inverse of the distance to the cluster <math>u_k(x) = {1 \over d(center_k,x)}<math>, then the coefficients are normalized so that their sum is 1.

The fuzzy c-means algorithm is greatly similar to the k-means algorithm:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than <math>\epsilon<math>, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above

The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Applications

In biology has two main applications in the fields of computational biology and bioinformatics.

References

  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv, 1999. Available here (http://citeseer.ist.psu.edu/jain99data.html)
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering (http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html). Also has an explanation on mixture of gaussians.

See also

k-means, ANN, Self-organizing map

Copyright 2009 WordIQ.com - Privacy Policy  :: Terms of Use  :: Contact Us  :: About Us
This article is licensed under the GNU Free Documentation License. It uses material from the this Wikipedia article.