基于具有约束的空间坐标对人员进行聚类-解网

问：

我有一个熊猫数据帧。列和表示人员的空间坐标。dflatitudelongitude

import pandas as pd

data = {
"latitude": [49.5619579, 49.5619579, 49.56643220000001, 49.5719721, 49.5748542, 49.5757358, 49.5757358, 49.5757358, 49.57586389999999, 49.57182530000001, 49.5719721, 49.572026, 49.5727859, 49.5740071, 49.57500899999999, 49.5751017, 49.5751468, 49.5757358, 49.5659508, 49.56611359999999, 49.5680586, 49.568089, 49.5687609, 49.5699217, 49.572154, 49.5724688, 49.5733994, 49.5678048, 49.5702381, 49.5707702, 49.5710414, 49.5711228, 49.5713705, 49.5723685, 49.5725714, 49.5746149, 49.5631496, 49.5677449, 49.572268, 49.5724273, 49.5726773, 49.5739391, 49.5748542, 49.5758151, 49.57586389999999, 49.5729483, 49.57321150000001, 49.5733375, 49.5745175, 49.574758, 49.5748055, 49.5748103, 49.5751023, 49.57586389999999, 49.56643220000001, 49.5678048, 49.5679685, 49.568089, 49.57182530000001, 49.5719721, 49.5724688, 49.5740071, 49.5757358, 49.5748542, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5619579, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.5642962, 49.56643220000001, 49.5679685, 49.570056, 49.5619579, 49.5724688, 49.5745175, 49.5748055, 49.5748055, 49.5748542, 49.5748542, 49.5751023, 49.5751023],
"longitude": [10.9995758, 10.9995758, 10.9999593, 10.9910787, 11.0172739, 10.9920322, 10.9920322, 10.9920322, 11.0244747, 10.9910398, 10.9910787, 10.9907713, 10.9885155, 10.9873742, 10.9861229, 10.9879312, 10.9872357, 10.9920322, 10.9873409, 10.9894231, 10.9882496, 10.9894035, 10.9887881, 10.984756, 10.9911384, 10.9850981, 10.9852771, 10.9954673, 10.9993329, 10.9965937, 10.9949475, 10.9912959, 10.9939141, 10.9916605, 10.9983124, 10.992722, 11.0056254, 10.9954016, 11.017472, 11.0180908, 11.0181911, 11.0175466, 11.0172739, 11.0249866, 11.0244747, 11.0200454, 11.019251, 11.0203055, 11.0183162, 11.0252416, 11.0260046, 11.0228523, 11.0243391, 11.0244747, 10.9999593, 10.9954673, 10.9982288, 10.9894035, 10.9910398, 10.9910787, 10.9850981, 10.9873742, 10.9920322, 11.0172739, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 10.9995758, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0039476, 10.9999593, 10.9982288, 10.9993409, 10.9995758, 10.9850981, 11.0183162, 11.0260046, 11.0260046, 11.0172739, 11.0172739, 11.0243391, 11.0243391]
}

df = pd.DataFrame(data)

我想根据人们的空间坐标对他们进行聚类。每个集群必须正好包含 9 人。但是，我想避免具有相同空间坐标的人滑入同一个集群。可能会发生这种情况，因为数据集包含一些完全相同的位置坐标，因此会自动分配给同一聚类。因此，目标是在聚类时防止这种情况发生。在随后的过程中，可能需要自动将人员转移到相邻的集群。

将我使用过的人聚集在一起。k-means-constrained!pip install k-means-constrained

from k_means_constrained import KMeansConstrained

coordinates = np.column_stack((df["latitude"], df["longitude"]))

# Define the number of clusters and the number of points per cluster
n_clusters = len(df) // 9
n_points_per_cluster = 9

# Perform k-means-constrained clustering
kmc = KMeansConstrained(n_clusters=n_clusters, size_min=n_points_per_cluster, size_max=n_points_per_cluster, random_state=42)
kmc.fit(coordinates)

# Get cluster assignments
df["cluster"] = kmc.labels_

为了验证结果，我检查了有多少人聚集在同一个集群中，尽管他们具有相同的空间坐标：

duplicate_rows = df[df.duplicated(subset=["cluster", "latitude", "longitude"], keep=False)]
duplicate_indices = duplicate_rows.index.tolist()

# Group by specified columns and count occurrences
count_occurrences = df.iloc[duplicate_indices].groupby(['latitude', 'longitude', 'cluster']).size().reset_index(name='count')

print("Number of rows with identical values in specified columns:")
print(count_occurrences)

例如，print-语句如下所示：

Number of rows with identical values in specified columns:
           latitude                   longitude        cluster  count
0     49.5619579000000030      10.9995758000000006        0      2
1     49.5748054999999965      11.0260046000000003        9      2
2     49.5748541999999972      11.0172738999999993        9      2
3     49.5751022999999975      11.0243391000000006        9      2
4     49.5757357999999968      10.9920322000000006        0      3
5     49.5758150999999998      11.0249866000000001        7      8

我们总共有（8+3+2+2+2+2）人与来自同一栋楼的邻居聚集在一起。我想尽量减少这个数字。或更少对我来说很好。它并不完美，但我可以处理它。但是（例如索引 5）是不行的。同一聚类中具有相同空间坐标的人太多。count = 2count > 2

Python Pandas 机器学习聚类分析 K-Means

答：

0赞 Téo 10/19/2023 #1

我认为您可能将这里的解决方案过于复杂。

首先，如果重复点对您来说是个问题，您应该弄清楚如何处理它们。这里不一定有一个“正确”的答案，因为这取决于你在做什么和你想要什么。

其次，使用聚类并精确固定聚类大小可能不合适。因为这可能会导致奇怪的聚类，在这种情况下，您强制一个点成为远处聚类的一部分，只是因为一个较近的聚类是“满的”。我认为您可能需要改进您正在尝试做的事情来解决这个问题（这也是我下面的解决方案的问题）。

在获取特定所有 9 个点的唯一聚类方面，您可以执行以下操作：

import numpy as np
import pandas as pd

data = {
"latitude": [49.5619579, 49.5619579, 49.56643220000001, 49.5719721, 49.5748542, 49.5757358, 49.5757358, 49.5757358, 49.57586389999999, 49.57182530000001, 49.5719721, 49.572026, 49.5727859, 49.5740071, 49.57500899999999, 49.5751017, 49.5751468, 49.5757358, 49.5659508, 49.56611359999999, 49.5680586, 49.568089, 49.5687609, 49.5699217, 49.572154, 49.5724688, 49.5733994, 49.5678048, 49.5702381, 49.5707702, 49.5710414, 49.5711228, 49.5713705, 49.5723685, 49.5725714, 49.5746149, 49.5631496, 49.5677449, 49.572268, 49.5724273, 49.5726773, 49.5739391, 49.5748542, 49.5758151, 49.57586389999999, 49.5729483, 49.57321150000001, 49.5733375, 49.5745175, 49.574758, 49.5748055, 49.5748103, 49.5751023, 49.57586389999999, 49.56643220000001, 49.5678048, 49.5679685, 49.568089, 49.57182530000001, 49.5719721, 49.5724688, 49.5740071, 49.5757358, 49.5748542, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5758151, 49.5619579, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.5642962, 49.56643220000001, 49.5679685, 49.570056, 49.5619579, 49.5724688, 49.5745175, 49.5748055, 49.5748055, 49.5748542, 49.5748542, 49.5751023, 49.5751023],
"longitude": [10.9995758, 10.9995758, 10.9999593, 10.9910787, 11.0172739, 10.9920322, 10.9920322, 10.9920322, 11.0244747, 10.9910398, 10.9910787, 10.9907713, 10.9885155, 10.9873742, 10.9861229, 10.9879312, 10.9872357, 10.9920322, 10.9873409, 10.9894231, 10.9882496, 10.9894035, 10.9887881, 10.984756, 10.9911384, 10.9850981, 10.9852771, 10.9954673, 10.9993329, 10.9965937, 10.9949475, 10.9912959, 10.9939141, 10.9916605, 10.9983124, 10.992722, 11.0056254, 10.9954016, 11.017472, 11.0180908, 11.0181911, 11.0175466, 11.0172739, 11.0249866, 11.0244747, 11.0200454, 11.019251, 11.0203055, 11.0183162, 11.0252416, 11.0260046, 11.0228523, 11.0243391, 11.0244747, 10.9999593, 10.9954673, 10.9982288, 10.9894035, 10.9910398, 10.9910787, 10.9850981, 10.9873742, 10.9920322, 11.0172739, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 11.0249866, 10.9995758, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0039476, 10.9999593, 10.9982288, 10.9993409, 10.9995758, 10.9850981, 11.0183162, 11.0260046, 11.0260046, 11.0172739, 11.0172739, 11.0243391, 11.0243391]
}

df = pd.DataFrame(data)

def grouping(lat_lon: np.ndarray, df: pd.core.frame.DataFrame, results: list):
    # First calculate the distance from a given point to all the rest
    distances = ((df - lat_lon)**2).sum(axis=1)**0.5
    # Sort these values so that they are closest -> furthest
    distances = distances.sort_values()
    # Take the top 9 (this will include the point itself
    top_9 = distances.index.values[:9]
    # Store the results in a list
    results.append(top_9)
    # Remove the values that are now in a cluster from
    # the dataframe, so that you don't add points to
    # multiple clusters
    df = df.drop(index = top_9)
    
    # Then, while there is still data left to cluster
    if len(df) != 0:
        # recurse this function with the next lat-lon point
        return grouping(df.iloc[0].values, df, results)
    # Otherwise if there is no data left
    else:
        return results

tmp_df = df.copy(deep=True)
clusters = grouping(tmp_df.iloc[0].values, tmp_df, [])

print(clusters)

输出：

[array([ 0, 81, 72,  1, 74, 73, 75, 78,  2]),
 array([ 3, 59, 10, 58,  9, 24, 11, 33, 31]),
 array([ 4, 87, 86, 63, 42, 41, 83, 48, 40]),
 array([ 5,  6,  7, 17, 62, 35, 15, 12, 32]),
 array([ 8, 53, 44, 71, 70, 69, 68, 67, 66]),
 array([13, 61, 16, 14, 26, 60, 25, 82, 23]),
 array([18, 19, 20, 21, 57, 22, 37, 55, 27]),
 array([28, 80, 79, 56, 34, 29, 54, 30, 77]),
 array([36, 76, 38, 39, 46, 45, 47, 51, 52]),
 array([43, 64, 65, 88, 89, 49, 50, 84, 85])]

这些结果应该以初始点在第 0 位进行组织 - 除非存在重复项，在这种情况下可能会存在一些问题（尽管根据我们如何选择初始点可能会很好）。

要演示每组固定 9 个点的问题：

for idx, vals in enumerate(clusters):
    yx = df.iloc[vals].values
    plt.scatter(yx[:,1], yx[:,0], c=f'C{idx}')

plt.show()

检查红色簇是如何围绕橙色簇分裂的，以及黄色/绿色是如何在较大的空间距离上分裂的。

这些问题可以通过更改集群的点顺序来缓解（例如，随机排列它们，然后重新运行，直到你得到看起来不错的东西），但如果你有大量数据，它可能会变得很费力。您可以通过创建一种识别它的方法来实现自动化。例如，您可以使用这些点创建一个 geopandas 数据帧，并为每个聚类形成一个凸包，并检查是否有重叠 - 如果有，请随机播放并重复聚类。但它不会特别有效。

0赞 PParker 10/20/2023

非常感谢您的详细方法。遗憾的是，您的方法会导致许多聚类，这些聚类在纬度和经度上具有相同的值。我的目标是尽量减少这个数字。我知道，如果您以最佳集群为目标，那么修复集群大小并不理想。但就我而言，修复集群大小非常重要。这对我来说很好，如果点之间的距离不是完美的最小值。更重要的是，就直接邻居而言，两个集群都相对干净，并且集群大小正好是 9。

上一个：训练随机森林分类器：对单个测量文件进行大规模排序

下一个：比较 scikit-learn 版本 1.2.2 和 1.3.1 之间的 k 均值聚类结果

基于具有约束的空间坐标对人员进行聚类

Cluster people based on spatial coordinates with constraints

评论

评论