如何将住得很近（但不要太近）的人聚集在一起？[关闭]-解网

问：

想改进这个问题吗？更新问题，使其仅通过编辑这篇文章来关注一个问题。

上个月关闭。

改进此问题

我有什么：

我有一个带有列的 pandas 数据帧，它们代表人们家的空间坐标。latitudelongitude

这可能是一个示例：

import pandas as pd

data = {
"latitude": [49.5659508, 49.568089, 49.5686342, 49.5687609, 49.5695834, 49.5706579, 49.5711228, 49.5716422, 49.5717749, 49.5619579, 49.5619579, 49.5628938, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.566359, 49.56643220000001, 49.56643220000001, 49.5672061, 49.567729, 49.5677449, 49.5679685, 49.5679685, 49.5688543, 49.5690616, 49.5713705],
"longitude": [10.9873409, 10.9894035, 10.9896749, 10.9887881, 10.9851579, 10.9853273, 10.9912959, 10.9910182, 10.9867083, 10.9995758, 10.9995758, 11.000319, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0003023, 10.9999593, 10.9999593, 10.9935709, 11.0011213, 10.9954016, 10.9982288, 10.9982288, 10.9975928, 10.9931367, 10.9939141],
}
df = pd.DataFrame(data)

df.head(11)

    latitude    longitude
0   49.565951   10.987341 
1   49.568089   10.989403 
2   49.568634   10.989675
3   49.568761   10.988788
4   49.569583   10.985158
5   49.570658   10.985327 
6   49.571123   10.991296
7   49.571642   10.991018
8   49.571775   10.986708
9   49.561958   10.999576
10  49.561958   10.999576

我需要什么：

我需要将人员分组到集群大小等于 9 的集群中。这样我就得到了邻居集群。但是，我不希望具有完全相同空间坐标的人位于同一聚类中。由于我的数据集中有 3000 多人，因此有很多人（大约几百人）具有完全相同的空间坐标。

如何聚集人员？完成聚类工作的一个很好的算法是 k-means-constrained。如本文所述，该算法允许将簇大小设置为 9。我花了几行线才把人聚集在一起。

问题：

居住在同一建筑物（具有相同空间坐标）的人总是聚类到同一个聚类中，因为目标是聚类住得彼此靠近的人。因此，我必须找到一种自动的方法，将这些人放入不同的集群中。但不仅仅是任何不同的集群，而是一个包含仍然住得相对较近的人的集群（见下图）。

下图总结了我的问题：

背景信息：

我是这样对人进行聚类的：

from k_means_constrained import KMeansConstrained

coordinates = np.column_stack((df["latitude"], df["longitude"]))

# Define the number of clusters and the number of points per cluster
n_clusters = len(df) // 9
n_points_per_cluster = 9

# Perform k-means-constrained clustering
kmc = KMeansConstrained(n_clusters=n_clusters, size_min=n_points_per_cluster, size_max=n_points_per_cluster, random_state=0)
kmc.fit(coordinates)

# Get cluster assignments
df["cluster"] = kmc.labels_

# Print the clusters
for cluster_num in range(n_clusters):
    cluster_data = df[df["cluster"] == cluster_num]["latitude", "longitude"]
    print(f"Cluster {cluster_num + 1}:")
    print(cluster_data)

python pandas 聚类分析 k-means 最近邻

# add a new feature
df['feature'] = df.groupby(['latitude', 'longitude']).cumcount()
# just for visually checking prints (can remove)
df['IsDuplicate'] = df.groupby(['latitude', 'longitude'])['feature'].transform('count') > 1
coordinates = np.column_stack((df["latitude"], df["longitude"], df['feature']))

因此，当您运行函数并打印所有列时，您可以看到重复项已分配给另一个集群：

Cluster 1:
    latitude  longitude  feature  IsDuplicate  cluster
0  49.565951  10.987341        0        False        0
1  49.568089  10.989403        0        False        0
2  49.568634  10.989675        0        False        0
3  49.568761  10.988788        0        False        0
4  49.569583  10.985158        0        False        0
5  49.570658  10.985327        0        False        0
6  49.571123  10.991296        0        False        0
7  49.571642  10.991018        0        False        0
8  49.571775  10.986708        0        False        0
Cluster 2:
     latitude  longitude  feature  IsDuplicate  cluster
10  49.561958  10.999576        1         True        1
12  49.562894  11.000319        1         True        1
18  49.566432  10.999959        1         True        1
19  49.567206  10.993571        0        False        1
21  49.567745  10.995402        0        False        1
23  49.567968  10.998229        1         True        1
24  49.568854  10.997593        0        False        1
25  49.569062  10.993137        0        False        1
26  49.571371  10.993914        0        False        1
Cluster 3:
     latitude  longitude  feature  IsDuplicate  cluster
9   49.561958  10.999576        0         True        2
11  49.562894  11.000319        0         True        2
13  49.563003  10.999100        0        False        2
14  49.563317  10.999382        0        False        2
15  49.563976  11.004145        0        False        2
16  49.566359  11.000302        0        False        2
17  49.566432  10.999959        0         True        2
20  49.567729  11.001121        0        False        2
22  49.567968  10.998229        0         True        2

应用您的方法将返回一个有用的解决方案。非常感谢！我注意到，并非所有集群都是干净的，因为我的数据集中有很多学生宿舍（因此很多人的位置相同）。这就是为什么我有时会在同一集群中遇到多达 6 人且位置相同的集群。我想知道，是否有办法专注于这些大型学生宿舍并确保我们减少集群的数量，每个集群超过 2 个重复项。换句话说：如果一个集群有 2 个重复项，但不会更多，我会没问题。如何防止学生宿舍里的人聚集在一起？

上一个：如何从 Ball Tree 获取质心？

下一个：minPts=1时调整DBSCAN的弯头方法

如何将住得很近（但不要太近）的人聚集在一起？[关闭]

How to cluster people who live close (but not too close) to each other? [closed]

评论

评论