从文件中获取列名列表的精确匹配-解网

问：

我创建了一个脚本，在其中我从需要删除的列名列表中读取，并指定了一个目录列表来扫描 sql 和 python 文件以使用 python 查找匹配项 - 但我使用了 python IN 子句，它返回的不是特定的匹配项。需要一些帮助来使用正则表达式来做到这一点。

sql 和 pythonfiles 中的列名可以以空格、单引号或逗号开头和结尾。提前欣赏它。

这是我的代码 - 我标记了需要更改为正则表达式的两行代码

    import glob        

    # read the list of column_fields_to_delete.txt into a list
    columns_to_delete_file = 'column_fields_to_delete.txt'
    
    with open(columns_to_delete_file) as f:
        columns_file = [line.strip() for line in f]
    

    delete_columns_list = [column for column in columns_file]
    
    # specify directories from which to scan all sql files from
    directories = ['/users/xx/sql/**/*.sql',
                   '/users/xx/Python/**/*.py']
    
    output_lines = list()
    
    for directory in directories:
       for file in glob.glob(directory, recursive=True):
        try:
            with open(file, 'r') as f:
                contents = f.read()
            exception_columns = list()
            for column_name in delete_columns_list:
                if column_name in contents.upper():        #--------- this needs to be changed to re.findall()
                   exception_columns.append(column_name)   #--------- this may need to be modified as well 
    
            if exception_columns:
                print(f"{file} file contains exception columns {exception_columns}\n\n")
            
        except:
        pass

预期的输出是打印引用delete_column_list中任何列的每个 sql 或 python 文件，后跟具有完全匹配的实际列。

正则表达式 python-3.7

import glob
import re        

# read the list of column_fields_to_delete.txt into a list
columns_to_delete_file = 'column_fields_to_delete.txt'

with open(columns_to_delete_file) as f:
    columns_file = [line.strip() for line in f]

delete_columns_list = [column for column in columns_file]

# specify directory from which to scan all sql files from
directory = '/users/xx/sql/**/*.sql'

output_lines = list()

for file in glob.glob(directory, recursive=True):
    try:
        with open(file, 'r') as f:
            contents = f.read()
        exception_columns = list()
        for column_name in delete_columns_list:
            matches = re.findall(r'[\s,\'"]' + column_name + r'[\s,\'"]', contents, re.IGNORECASE)
            if matches:
                exception_columns.append(column_name)

        if exception_columns:
            print(f"{file} file contains exception columns {exception_columns}\n\n")
    except:
        pass

上述代码修改说明：

我使用的正则表达式模式与任何用空格、逗号、单引号或双引号括起来的列名匹配。
该函数返回文件内容中所有匹配项的列表。re.findall()
我还使用标志使搜索不区分大小写。您可以根据需要进行更改。re.IGNORECASE
如果有任何匹配项：
- 将列名追加到exception_columns列表中，然后打印文件名和异常列列表。

非常感谢！我会在早上尝试一下。真的很喜欢你的解释，re.findall（）的使用非常有帮助。关于您的问题 - 我们使用的列名只有一个特殊字符 - 下划线（例如，first_and_last_name），这在大多数 DBMS 中几乎是标准的。还有 re.IGNORECASE标志也是一个非常好的标志。荣誉！

1赞 mandy8055 10/14/2023

_通常不包含在特殊字符中。但是，如果有任何其他特殊字符，则可以使用re.escape

1赞 punsoca 10/15/2023

谢谢！我已经保存了自己的帖子，以方便使用这些信息。不要经常使用正则表达式，但它们肯定非常有用。

上一个：安装python 3.7.9版本的py2exe时出错

下一个：Python 使用实时覆盖输出运行多个子进程

从文件中获取列名列表的精确匹配

getting exact match for list of columns name from files

评论

评论