覆盖初始自动映射

Overwrite Initial automated mapping

提问人:Johan VS 提问时间:10/24/2023 最后编辑:Jason AllerJohan VS 更新时间:10/25/2023 访问量:16

问:

我正在尝试将数据集中的列名映射到标准中的名称,标准名称位于 yaml 文件中,因此您可以根据需要更改它们。我能够进行初始自动映射,但我还需要用户能够选择他们想要手动映射的列,以防映射不正确。因此,我能够在初始映射后进行输入,让用户决定手动映射哪些列以及他们为所选列提供哪些选项。

问题在于,用户的选择不会覆盖初始映射,因此列不会更改为用户希望的内容。

所以我使用 fuzzywuzzy 进行映射,一切正常,唯一的问题是用户输入不会更改初始映射。

import pandas as pd
import fuzzywuzzy.process as fuzz
import re
import yaml

try:
    df = pd.read_excel('Mock Data_typos (2).xlsx')
except Exception as e:
    print(f"Error loading data: {str(e)}")
    df = None

if df is not None:
    last_name_pattern = re.compile(r'\b[lL][aA][sS][tT]\s?[nN][aA][mM][eE]\b|\bLST\s?NM\b')

    # Function to find and match columns
    def match_columns(df, reference_columns, initial_score_threshold=70):
        matched_columns = {}  # Initialize the matched_columns dictionary

        input_columns = df.columns.tolist()

        for column in input_columns:
            matches = fuzz.extractBests(column, reference_columns, score_cutoff=initial_score_threshold)
            if matches:
                matched_columns[column] = matches

        return matched_columns

    with open('standard_columns.yml', 'r') as yaml_file:
        try:
            reference_columns = yaml.safe_load(yaml_file)
        except Exception as e:
            print(f"Error loading reference columns: {str(e)}")
            reference_columns = []

    if reference_columns:
        matched_columns = match_columns(df, reference_columns)

        # Initial automated mapping
        for index, (column, matches) in enumerate(matched_columns.items()):
            if len(matches) > 0:
                best_match = matches[0][0]  # Get the full matched column name
                df.rename(columns={column: best_match}, inplace=True)
                print(f"{index}. Column '{column}' mapped to '{best_match}'")

        print("Initial mapping finished.")

        # Remove columns that are not in reference_columns
        columns_to_remove = [col for col in df.columns if col not in reference_columns]
        df.drop(columns=columns_to_remove, inplace=True)

        # Check if "Last Name" doesn't exist and create it
        if "Last Name" not in df.columns:
            df["Last Name"] = ""

        # Allow the user to specify columns for modification
        change_columns_input = input("Enter a list of columns to modify (e.g., '0, 5, 7') or 'none' to skip: ")

        if change_columns_input.lower() != 'none':
            change_columns_list = [int(col.strip()) for col in change_columns_input.split(',')]

            for column_index in change_columns_list:
                if 0 <= column_index < len(matched_columns):
                    selected_column = list(matched_columns.keys())[column_index]

                    print(f"Mapping options for column {column_index}: '{selected_column}':")
                    for j, (match, score) in enumerate(matched_columns[selected_column]):
                        print(f"  {j}. Map to '{match}' (Score: {score})")  # Display the full match

                    while True:
                        match_choice = input("Enter the number for the mapping, or 'skip' to keep as is: ")
                        if match_choice.lower() == 'skip':
                            break
                        elif match_choice.isdigit():
                            match_index = int(match_choice)
                            if 0 <= match_index < len(matched_columns[selected_column]):
                                chosen_mapping = matched_columns[selected_column][match_index][0]
                                df.rename(columns={selected_column: chosen_mapping}, inplace=True)
                                print(f"Column {column_index}: '{selected_column}' has been mapped to '{chosen_mapping}'.")
                                break
                            else:
                                print("Invalid input. Please enter a valid number.")
                        else:
                            print("Invalid input. Please enter a valid number or 'skip'.")

        print("Mapping and cleanup finished. Updated DataFrame:")
        print(df)

    else:
        print("No reference columns loaded. Please check the reference columns file.")
输入 映射 覆盖 fuzzywuzzy

评论


答: 暂无答案