在 python 中处理异常值-解网

问：

我希望在 kaggle 中对超市销售 CSV 数据集进行一些数据分析（链接如下）：

https://www.kaggle.com/datasets/laibaanwer/superstore-sales-dataset

我需要做的第一件事是通过处理异常值和缺失数据来清理数据。异常值主要集中在“销售额”列中。想知道 Python 中是否有更好的方法来过滤掉数据集中的异常值。

我尝试使用“matplotlib.pyplot”库在原始数据的“Sales”列上创建一个箱线图，我发现有很多异常值（附上屏幕截图）。

然后，我使用以下代码过滤掉异常值，结果删除了 9800 个条目中的 1145 个条目：

#cleaning the data

#finding out the lower and upper quantile

import matplotlib.pyplot as plt[[enter image description here](https://i.stack.imgur.com/4UAKx.jpg)](https://i.stack.imgur.com/EVont.jpg)

quantile1 = salesData['Sales'].quantile(0.25)
quantile3 = salesData['Sales'].quantile(0.75)

#finding the IQR
IQR = quantile3-quantile1

#finding out the lower and upper bounds
lower_value = quantile1 - 1.5 * IQR
upper_value = quantile3 + 1.5 * IQR


#filtering out the 'salesData' after removing the outliers
#storing it in a new dataset named 'cleanData'
cleanData = salesData[(salesData['Sales'] >= lower_value) & (salesData['Sales'] <= upper_value)]

plt.boxplot(cleanData['Sales'], vert = False)
plt.show()

#print the number of rows and columns after removing outliers
print(cleanData.shape)

我能够查看清理后的数据集的更好的箱线图，但仍然存在异常值。这是删除异常值的正确方法吗？我是否应该重复此清理过程，直到不再有任何异常值？上述步骤是否足以清理数据，以便将其用于进一步分析？

我欢迎使用任何第三方库来清理数据，但我不想过分依赖它们。我所期望的是依靠内置库来找到一种有效的方法来清理数据集。

原始数据“Sales”列的箱线图已清理数据“Sales”列的箱线图

Python 清理缺失异常值数据工程

在 python 中处理异常值

Handling outliers in python

评论