聚类算法5——Hierarchy 层次聚类(算法步骤及matlab代码)
看了西关书的聚类算法,算法原理很容易明白,接下来就是整理成自己的理解思路,然后一步一步来实现算法,那么就来做吧。
hierarchyClustering算法
- 思想
不同层次对数据集进行划分,AGENS是一种自底而上的聚合策略。首先将数据集中的每个样本看作初始聚类簇,算法运行中找出距离最近的两个聚类簇进行合并,该过程不断重复直到预设的聚类簇个数。
- 算法步骤
输入:样本集D(m,n),聚类簇个数k
输出:聚类簇集C
Step1 初始化聚类簇集合,距离矩阵 init_hierarchy()
输入:样本集D; 输出:聚类簇集,距离矩阵
Step1.1 聚类簇集: 直接将原始数据集赋值给聚类簇集合。(聚类簇类型:结构体,包括样本集合,索引)
Step1.2计算距离矩阵(m,m-1),第一行表示第一个样本与其他样本的距离值(子函数:聚类簇距离度量函数d, get_dist(C1_data,C2_data))
Step2 聚类, hierarchy_clustering()
输入:聚类簇个数k,聚类簇集C,距离矩阵;输出:k个聚类簇集
While 当前聚类集中个数q>k
Step2.1 根据距离函数寻找最近的两个聚类簇Ci, Cj,合并Ci = Ci∪Cj(set_)
Stepg2.2更新聚类簇集合:删除第j个聚类簇,j:end的聚类簇索引依次-1
Step2.3更新距离矩阵M:删除M的第j行第j列,更新i行j(1:q-1)列,j(1:q-1)行i列
q=q-1
end while
ok啦,接下来不废话上代码(Matlab发布形式)
function Main() clc;clear;close all melon_data = load('melon4.0.txt'); melon_data(:,1)=[];k=4; %step1 [dist_matrix,cluster_set]= init_hierarchy(melon_data); %step2 cluster_ret = hierarchy_clustering(dist_matrix,cluster_set,k); show(cluster_ret); show_cluster_tree(melon_data); end
subfunction
%step1 function [dist_matrix,cluster_set]= init_hierarchy(melon_data) cluster_set=[]; [m,~] =size(melon_data); dist_matrix = zeros(m,m); for i=1:m cluster.ind = i; cluster.data = melon_data(i,:); cluster_set = [cluster_set;cluster]; end for i = 1:m for j = i+1:m dist_matrix(i,j) =get_dist(cluster_set(i),cluster_set(j)); dist_matrix(j,i) =dist_matrix(i,j); end end end %get two cluster dis function dist =get_dist(cluster1,cluster2) [m1,~]=size(cluster1.data);%2x2 dist = 10000; for i =1:m1 tm_dist = pdist2(cluster2.data,cluster1.data(i,:)); [min_value,~]=min(tm_dist); if min_value < dist dist = min_value; end end end %step2 function cluster = hierarchy_clustering(dist_matrix,cluster_set,k) while cluster_set(end).ind >k [Ci,Cj]= find_closest_cluster(dist_matrix); %Ci<Cj cluster_set = update_cluster_set(Ci,Cj,cluster_set); dist_matrix = update_dist_matrix(Ci,Cj,cluster_set,dist_matrix); end cluster = cluster_set; end function [Ci,Cj]= find_closest_cluster(dist_matrix) [m,n] = size(dist_matrix); min_dist = 10000; start=2;Ci=0;Cj=0; for i = 1:m-1 [tem_dist,min_ind] = min(dist_matrix(i,start:n)); if tem_dist<min_dist min_dist=tem_dist; Ci =i; Cj = min_ind+start-1; end start = start+1; end if Ci>Cj tem_c = Cj; Cj =Ci; Ci =tem_c; end end function cluster = update_cluster_set(Ci,Cj,cluster_set) for i =Cj+1:length(cluster_set) cluster_set(i).ind = cluster_set(i).ind -1;%update end % 合并聚类簇,序号小的 合并 序号大的 Ci < Cj cluster_set(Ci).data = [cluster_set(Ci).data;cluster_set(Cj).data]; cluster_set(Cj)=[]; %delete cluster = cluster_set; end function dist_m = update_dist_matrix(Ci,Cj,cluster_set,dist_matrix) %step2.3 dist_matrix(:,Cj)=[]; dist_matrix(Cj,:)=[]; %updat Ci行,1:cluster_set(end).ind 列 for j = 1:cluster_set(end).ind dist_matrix(Ci,j) = get_dist(cluster_set(Ci),cluster_set(j)); dist_matrix(j,Ci) = dist_matrix(Ci,j); end dist_m = dist_matrix; end function show(cluster_ret) plot(cluster_ret(1).data(:,1),cluster_ret(1).data(:,2),'or');hold on plot(cluster_ret(2).data(:,1),cluster_ret(2).data(:,2),'og'); plot(cluster_ret(3).data(:,1),cluster_ret(3).data(:,2),'sr'); plot(cluster_ret(4).data(:,1),cluster_ret(4).data(:,2),'sg'); xlabel('density');ylabel('sugar rate'); end function show_cluster_tree(melon_data) Y=pdist(melon_data); Z = linkage(Y,'single'); dendrogram(Z) xlabel('sample number');ylabel('cluster distance'); end