提问人:Johan VS 提问时间:10/24/2023 最后编辑:Jason AllerJohan VS 更新时间:10/25/2023 访问量:16
覆盖初始自动映射
Overwrite Initial automated mapping
问:
我正在尝试将数据集中的列名映射到标准中的名称,标准名称位于 yaml 文件中,因此您可以根据需要更改它们。我能够进行初始自动映射,但我还需要用户能够选择他们想要手动映射的列,以防映射不正确。因此,我能够在初始映射后进行输入,让用户决定手动映射哪些列以及他们为所选列提供哪些选项。
问题在于,用户的选择不会覆盖初始映射,因此列不会更改为用户希望的内容。
所以我使用 fuzzywuzzy 进行映射,一切正常,唯一的问题是用户输入不会更改初始映射。
import pandas as pd
import fuzzywuzzy.process as fuzz
import re
import yaml
try:
df = pd.read_excel('Mock Data_typos (2).xlsx')
except Exception as e:
print(f"Error loading data: {str(e)}")
df = None
if df is not None:
last_name_pattern = re.compile(r'\b[lL][aA][sS][tT]\s?[nN][aA][mM][eE]\b|\bLST\s?NM\b')
# Function to find and match columns
def match_columns(df, reference_columns, initial_score_threshold=70):
matched_columns = {} # Initialize the matched_columns dictionary
input_columns = df.columns.tolist()
for column in input_columns:
matches = fuzz.extractBests(column, reference_columns, score_cutoff=initial_score_threshold)
if matches:
matched_columns[column] = matches
return matched_columns
with open('standard_columns.yml', 'r') as yaml_file:
try:
reference_columns = yaml.safe_load(yaml_file)
except Exception as e:
print(f"Error loading reference columns: {str(e)}")
reference_columns = []
if reference_columns:
matched_columns = match_columns(df, reference_columns)
# Initial automated mapping
for index, (column, matches) in enumerate(matched_columns.items()):
if len(matches) > 0:
best_match = matches[0][0] # Get the full matched column name
df.rename(columns={column: best_match}, inplace=True)
print(f"{index}. Column '{column}' mapped to '{best_match}'")
print("Initial mapping finished.")
# Remove columns that are not in reference_columns
columns_to_remove = [col for col in df.columns if col not in reference_columns]
df.drop(columns=columns_to_remove, inplace=True)
# Check if "Last Name" doesn't exist and create it
if "Last Name" not in df.columns:
df["Last Name"] = ""
# Allow the user to specify columns for modification
change_columns_input = input("Enter a list of columns to modify (e.g., '0, 5, 7') or 'none' to skip: ")
if change_columns_input.lower() != 'none':
change_columns_list = [int(col.strip()) for col in change_columns_input.split(',')]
for column_index in change_columns_list:
if 0 <= column_index < len(matched_columns):
selected_column = list(matched_columns.keys())[column_index]
print(f"Mapping options for column {column_index}: '{selected_column}':")
for j, (match, score) in enumerate(matched_columns[selected_column]):
print(f" {j}. Map to '{match}' (Score: {score})") # Display the full match
while True:
match_choice = input("Enter the number for the mapping, or 'skip' to keep as is: ")
if match_choice.lower() == 'skip':
break
elif match_choice.isdigit():
match_index = int(match_choice)
if 0 <= match_index < len(matched_columns[selected_column]):
chosen_mapping = matched_columns[selected_column][match_index][0]
df.rename(columns={selected_column: chosen_mapping}, inplace=True)
print(f"Column {column_index}: '{selected_column}' has been mapped to '{chosen_mapping}'.")
break
else:
print("Invalid input. Please enter a valid number.")
else:
print("Invalid input. Please enter a valid number or 'skip'.")
print("Mapping and cleanup finished. Updated DataFrame:")
print(df)
else:
print("No reference columns loaded. Please check the reference columns file.")
答: 暂无答案
评论