如何使用 Java 8 或更高版本逐个字段比较自定义对象的列表/映射,以通用方式为非常大的数据集创建不匹配报告?

How to compare the list/map of custom objects, field by field to create mismatch report for very big data set in generic way using Java 8 or more?

提问人:Pooja 提问时间:1/8/2023 更新时间:1/9/2023 访问量:234

问:

我一直在研究 Java 中 2 个不同数据库源之间的数据比较。由于其他一些挑战,我无法直接在数据库中进行比较。

  • 我有 50 张桌子要比较。
  • 表数从 10k 到 500k 不等。 (需要高效的算法)
  • 每个表的列数和字段名称也会有所不同(当然)

我使用for循环编写了以下代码,这是限制,例如:

  1. 由于某些表的数据量可能很大,因此 for 循环解决方案效率不高。
  2. 每个表的列数会有所不同,因此我编写的逻辑不适用于所有人,我需要对不同的表重复它。大量的样板代码。
  3. 假设任何新列被添加到某个表中,比较逻辑也需要更新

我的要求:

  1. 我想编写一个有效的代码,用于查找提供的自定义对象列表的逐个字段的不匹配报告。
  2. 比较代码应该能够比较任何类型的自定义对象列表。(不知道该怎么做)
  3. 能够通过引用一些属性文件来创建表对象 POJO,该属性文件将包含所有表的列列表。
public void loadDummyTableObjects() {
        table1DataList =
                Arrays.asList(new TestTable1("1","1","One","Blue"),
                        new TestTable1("2","2","Two","Red"),
                        new TestTable1("3","3","Three","Black"),
                        new TestTable1("4","4","Four","Green"),
                        new TestTable1("5","5","Five","White"));

        table2DataList =
                Arrays.asList(new TestTable2("1","1","One","Blue"),
                        new TestTable2("2","2","Two","Red1"),
                        new TestTable2("3","3","Three","Black"),
                        new TestTable2("4","4","Four","Green"),
                        new TestTable2("5","5","Two","White"));
    }

   public void compareDataWithForLoop() {
        loadDummyTableObjects();
        List<MismatchReport> mismatchReport = new ArrayList<>();
        for (TestTable1 t1Row: table1DataList) {
            for (TestTable2 t2Row: table2DataList) {
                if (t1Row.getId().equals(t2Row.getId())) {
                    if (!(t1Row.getColumn1().equals(t2Row.getColumn1()))) {
                        MismatchReport result = getMismatchReport("Table1", "Column1", t1Row.getColumn1(), t2Row.getColumn1());
                        mismatchReport.add(result);
                    }
                    if (!(t1Row.getColumn2().equals(t2Row.getColumn2()))) {
                        MismatchReport result = getMismatchReport("Table1", "Column2", t1Row.getColumn2(), t2Row.getColumn2());
                        mismatchReport.add(result);
                    }
                    if (!(t1Row.getColumn3().equals(t2Row.getColumn3()))) {
                        MismatchReport result = getMismatchReport("Table1", "Column3", t1Row.getColumn3(), t2Row.getColumn3());
                        mismatchReport.add(result);
                    }
                }
            }
        }
        System.out.println(mismatchReport);
    }

    private static MismatchReport getMismatchReport(String tableNme, String Db1Table1Column1, String t1Row, String t2Row) {
        MismatchReport result = new MismatchReport();
        result.setTableNme(tableNme);
        result.setColumnNme(Db1Table1Column1);
        result.setDb1Value(t1Row);
        result.setDb2Value(t2Row);
        return result;
    }

    public static void main(String[] args) {
        DataComparatorService service = new DataComparatorService();
        service.compareDataWithForLoop();
    }

每个表比较的输出格式应相同。结果应包含字段(TableName、ColumnName、Db1Value、Db2Value),以了解发现差异的列和不匹配值。 以上代码的输出为:


[MismatchReport{tableNme='Table1', columnNme='Column3', db1Value='Red', db2Value='Red1'}, 
MismatchReport{tableNme='Table1', columnNme='Column2', db1Value='Five', db2Value='Two'}]

任何关于如何实现上述要求的线索都将非常有帮助。

Java 性能 泛型 集合 比较

评论


答:

1赞 Eritrean 1/8/2023 #1

如果我是你,我不会重新发明轮子,而是会使用第三方库,如JaVers。

JaVers 文档

JaVers GitHub

Javers Maven

它是一个功能强大而轻量级的库。它可以做更多的事情,但你也可以把它作为一个纯粹的对象差异工具。作为起点,我采用了您的一些示例输入来展示如何将其应用于您的用例。

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.function.Function;
import java.util.stream.Collectors;

import org.javers.core.Javers;
import org.javers.core.JaversBuilder;
import org.javers.core.diff.Diff;

import lombok.AllArgsConstructor;
import lombok.Getter;

public final class Example {

    public static void main(String[] args) {

        //Just copied your sample input but used only one custom class as the second is not really needed
        List<TestTable1> dataDB1 = Arrays.asList(new TestTable1("1","1","One","Blue"),
                      new TestTable1("2","2","Two","Red"),
                      new TestTable1("3","3","Three","Black"),
                      new TestTable1("4","4","Four","Green"),
                      new TestTable1("5","5","Five","White"));

        List<TestTable1> dataDB2 = Arrays.asList(new TestTable1("1","1","One","Blue"),
                      new TestTable1("2","2","Two","Red1"),
                      new TestTable1("3","3","Three","Black"),
                      new TestTable1("4","4","Four","Green"),
                      new TestTable1("5","5","Two","White"));

        //create a map from your input for a faster access of objects by id
        Map<String, TestTable1> db1Map = dataDB1.stream()
                                           .collect(Collectors.toMap(TestTable1::getId, Function.identity()));
        Map<String, TestTable1> db2Map = dataDB2.stream()
                                           .collect(Collectors.toMap(TestTable1::getId, Function.identity()));

        // do your comparison using JaVers
        Javers javers = JaversBuilder.javers().build();

        db1Map.keySet().forEach(key -> {
            Diff diff = javers.compare(db1Map.get(key), db2Map.get(key));
            if (diff.hasChanges()){
                System.out.println("Changes for id: " + key);
                System.out.println(diff.prettyPrint());
                System.out.println("********************************************************");
                System.out.println();
            }
        });
    }

    // a simple POJO for your data
    @AllArgsConstructor
    @Getter
    public static class TestTable1 {
        String id;
        String column1;
        String column2;
        String column3;
    }
}

输出:

Changes for id: 2
Diff:
* changes on com.mycompany.Example$TestTable1/ :
  - 'column3' changed: 'Red' -> 'Red1'

********************************************************

Changes for id: 5
Diff:
* changes on com.mycompany.Example$TestTable1/ :
  - 'column2' changed: 'Five' -> 'Two'

********************************************************

我只是曾经得到一个标准输出,但你可以配置它以满足你的需求prettyPrint