提问人:ChibiPhoenix 提问时间:1/17/2022 最后编辑:ChibiPhoenix 更新时间:1/19/2022 访问量:657
如何比较 2 个相似但不同的数据集
How to compare 2 similar but different sets of data
问:
编辑: 我解决了这个问题,并在下面添加了答案以供参考。
我很难找到有关我的问题的信息,因为我似乎找不到合适的术语。
我正在创建一个 C# WPF 应用程序来比较和查找 2 个 .csv 文件之间的差异。 我已经可以比较数据并找到匹配和差异,假设两者中的数据采用相同的格式。
但是,我还需要支持不同但相似的集合。
例如。(我需要支持的文件,理想情况下这也适用于其他场景)也需要正确比较。
问题在于,其中一个集合具有不同的数据格式化方式,因为它们由不同的人维护。另一组始终使用相同的格式。
对所有内容使用相同格式的集合以“31612345678”或“3197123456789”格式显示数字。
具有不同数据的集合可以以不同的方式格式化此数据,例如。 "0612345678" "612345678" "31612345678" "06 12 34 56 78" "097 123456789" "09 71 23 45 67 89" 等。
是否有标准化的步骤来“清理”像这样凌乱的数据,并最终得到使用相同格式的 2 组数据?
答:
0赞
ChibiPhoenix
1/17/2022
#1
我的解决方案涉及使用 DataTable 类,因为它们在使用 WPF DataGrid 时可以很好地显示 .csv 数据。
修复给定列的数据需要发生什么:
- 删除所有空条目
- 删除任何重复项
- 将无效条目替换为有效条目
- 重复步骤 2 和 3,直到数据采用您想要的格式。
我最终为 DataTable 创建了一个扩展类来处理上述步骤:
public static class MyDataTableExtensions
{
/// <summary>
/// Replaces the values in all rows in columnName using RegEx, can optionally remove any duplicate values as well.
/// </summary>
/// <param name="dt"></param>
/// <param name="columnName"></param>
/// <param name="pattern"></param>
/// <param name="replacement"></param>
/// <param name="removeDuplicates"></param>
/// <returns></returns>
public static DataTable ReplaceDataTableRows(this DataTable dt, string columnName, string pattern, string replacement, bool removeDuplicates = false)
{
if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));
if (string.IsNullOrEmpty(pattern)) throw new ArgumentNullException(nameof(pattern));
DataTable output = dt.Copy();
Regex regex = new Regex(pattern);
string item;
foreach (DataRow row in output.Rows)
{
foreach (DataColumn column in output.Columns)
{
if (column.ColumnName.ToLower().Equals(columnName.ToLower()))
{
if (column.ReadOnly) column.ReadOnly = false;
item = regex.Replace(row.Field<object>(column).ToString(), replacement);
if (string.IsNullOrEmpty(item))
{
row.Delete();
continue;
}
row.SetField(column, item);
}
}
}
output.AcceptChanges();
return removeDuplicates ? RemoveDuplicatesFromDataTableColumn(output, columnName) : output;
}
/// <summary>
/// Removes any empty values found in the rows in the given column
/// </summary>
/// <param name="dt"></param>
/// <param name="columnName"></param>
/// <returns></returns>
public static DataTable RemoveEmptyValuesFromDataTableColumn(this DataTable dt, string columnName)
{
DataTable output = dt.Copy();
foreach (DataRow row in output.Rows)
{
foreach (DataColumn column in output.Columns)
{
if (column.ColumnName.Equals(columnName))
{
if (column.ReadOnly) column.ReadOnly = false;
if (string.IsNullOrEmpty(row.Field<object>(column).ToString()))
{
row.Delete();
}
}
}
}
output.AcceptChanges();
return output;
}
/// <summary>
/// Removes any duplicate rows found in the column with 'columnName'
/// </summary>
/// <param name="dt"></param>
/// <param name="columnName"></param>
/// <returns></returns>
public static DataTable RemoveDuplicatesFromDataTableColumn(this DataTable dt, string columnName)
{
if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));
return dt.AsEnumerable().GroupBy(x => x.Field<object>(columnName))
.Select(x => x.First()).CopyToDataTable();
}
/// <summary>
/// Prepends the valueToPrepend to the row in columnName if the value in the row does NOT match the given pattern
/// </summary>
/// <param name="dt"></param>
/// <param name="columnName"></param>
/// <param name="pattern"></param>
/// <param name="valueToPrepend"></param>
/// <returns></returns>
public static DataTable PrependValueToDataTableColumn(this DataTable dt, string columnName, string pattern, string valueToPrepend)
{
if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));
if (string.IsNullOrEmpty(pattern)) throw new ArgumentNullException(nameof(pattern));
if (string.IsNullOrEmpty(valueToPrepend)) throw new ArgumentNullException(nameof(valueToPrepend));
DataTable output = dt.Copy();
Regex regex = new Regex(pattern);
string item;
foreach (DataRow row in output.Rows)
{
foreach (DataColumn column in output.Columns)
{
if (column.ColumnName.ToLower().Equals(columnName.ToLower()))
{
if (column.ReadOnly) column.ReadOnly = false;
item = row.Field<object>(column).ToString();
if (regex.IsMatch(item) == false)
{
row.SetField(column, $"{ valueToPrepend }{ item }");
}
}
}
}
output.AcceptChanges();
return output;
}
}
之后,对于我的数据集,解决方案实际上相对简单:
private void Button_Click(object sender, RoutedEventArgs e)
{
// eventually the column will be selectable through the use of a ComboBox
var selection = cb.SelectedItem;
string columnName = selection is not null ? cb.SelectedItem.ToString() : "Objectgebondennummer";
if (!string.IsNullOrEmpty(columnName))
{
// solution:
var resultDt = dt.Copy();
resultDt = resultDt.RemoveEmptyValuesFromDataTableColumn(columnName) // step 1
.RemoveDuplicatesFromDataTableColumn(columnName) // step 2
.ReplaceDataTableRows(columnName, "[^0-9]", string.Empty, true) // at this point we just have rows with mixed values, remove any non-numeric characters from the rows
.ReplaceDataTableRows(columnName, "^0", "31", true) // now we just have numbers (yay), first I replaced all the values starting with '0' to start with '31' instead
.PrependValueToDataTableColumn(columnName, "^31", "31") // my data was luckily pretty clean at this point, all that remained were numbers in the format "612345678" and "97123456789", so I just had to prepend 31 to them
.RemoveDuplicatesFromDataTableColumn(columnName); // make sure we don't have any new duplicates
dataGrid.ItemsSource = resultDt.AsDataView(); // finally, update the datagrid with the fixed data!
}
}
评论