Python：如何在k-means中将特定数据点的初始质心？

问题描述：

import pandas as pd 
import random 
import matplotlib.pyplot as plt 

df = pd.DataFrame() 
df['x'] = [3, 2, 4, 3, 4, 6, 8, 7, 8, 9] 
df['y'] = [3, 2, 3, 4, 5, 6, 5, 4, 4, 3] 
df['val'] = [1, 10, 1, 1, 1, 8, 1, 1, 1, 1] 

k = 2 
centroids = {i + 1: [np.random.randint(0, 10), np.random.randint(0, 10)] for i in range(k)} 

plt.scatter(df['x'], df['y'], color='blue') 
for i in centroids.keys(): 
    plt.scatter(*centroids[i], color='red', marker='^') 
plt.show()

我希望把数据点的初始质心与最高值。然后，在这种情况下，质心应位于坐标为（2,2）和（6,6）的数据点上。

您在使用scikit的KMeans估计器学习？如果是这样，你可以通过一个数组给予初始中心。请参阅'init'参数[here]（http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html）。或者你是问如何构建这个数组呢？ –

@MarkDickinson是的，我问如何编写Python代码让我把质心放在具有最高值的节点上，因为我没有在这里使用scikit学习。我为kmeans写了自己的代码。 – arizamoona

答

您可以通过val列进行排序数据框获得顶级k值的索引，然后切片用df.iloc数据帧。

以降序排序：通过highest_points_as_centroids.values

array([[2, 2], 
     [6, 6]], dtype=int64)

k=2 # Number of centroids 
highest_points_as_centroids = df.iloc[0:k,[0,1]] 

print(highest_points_as_centroids) 

    x y 
1 2 2 
5 6 6

可以得到X，Y的值作为numpy的数组：

df = df.sort_values('val', ascending=False) 
print(df) 

    x y val 
1 2 2 10 
5 6 6 8 
0 3 3 1 
2 4 3 1 
3 3 4 1 
4 4 5 1 
6 8 5 1 
7 7 4 1 
8 8 4 1 
9 9 3 1

切片数据帧

EDIT1：

更简洁（由@sharatpc建议）

df.nlargest(2, 'val')[['x','y']].values 
array([[2, 2], 
    [6, 6]], dtype=int64)

EDIT2：

由于OP评论说，他们想要的重心是在一本字典：

centroids = highest_points_as_centroids.reset_index(drop=True).T.to_dict('list') 
print(centroids) 
{0: [2L, 2L], 1: [6L, 6L]}

如果字典键严格需要从1开始：

highest_points_as_centroids.reset_index(drop=True, inplace=True) 
highest_points_as_centroids.index +=1 
centroids = highest_points_as_centroids.T.to_dict('list') 
print(centroids) 
{1: [2L, 2L], 2: [6L, 6L]}

您不需要切分数据帧。只需使用nlargest即可获得前2名：'df.nlargest（2，'val'）';或'df.sort_values（'val'，ascending = False）.head（2）' – skrubber

如果你想要输出x和y，那么：'df.nlargest（k，'val'）[['x'，' y']]'或'df.sort_values（'val'，ascending = False）[['x'，'y']]。头（k）' – skrubber

谢谢！不知道“最大”。我补充说，答案。 – akilat90

答

只是回答@ arzamoona的其他问题，在同一个地方：

import pandas as pd 
import random 
import matplotlib.pyplot as plt 

df = pd.DataFrame() 
df['x'] = [3, 2, 4, 3, 4, 6, 8, 7, 8, 9] 
df['y'] = [3, 2, 3, 4, 5, 6, 5, 4, 4, 3] 
df['val'] = [1, 10, 1, 1, 1, 8, 1, 1, 1, 1] 

k = 2 
centroids=df.nlargest(k, 'val')[['x','y']] 

plt.scatter(df['x'], df['y'], color='blue') 
plt.scatter(centroids.x, centroids.y, color='red', marker='^') 
plt.show()

然后到质心值添加到字典：

{i:v for i,v in enumerate(centroids.values.tolist())} 
{0: [2, 2], 1: [6, 6]}

您可以使用'to_dict'将质心转换为没有for循环的字典。 – akilat90

但这会产生差异：'{'x'：{1：2，5：6}，'y'：{1：2,5：6}} – skrubber

您必须更改'orient'参数。检查我的回答 – akilat90

Python：如何在k-means中将特定数据点的初始质心？

相关推荐