使用 Python 将文本 csv 文件拆分为另一个 csv 文件，其中包含文本表示的变量-解网

问：

我有一个如下所示的 csv 文件（见下文）。每一行都只是文本，我想将每行拆分为它实际代表的三个变量。该文件显示客户在特定日期发表的评论及其识别号：每一行都只是文本，显示客户在某个日期对他们拥有的银行的评论。所以，我想将这个 csv 文件转换为另一个 csv 文件，它有三个变量（、、）。第一行的内容显示了我要生成的最终版本应该具有的这三个预期变量名称/列，如下所示：customer_iddatecomments


“customer_id日期评论”,,,
“216604 2022-08-22 总体上”，这家银行是满意的。
《259276 2022-11-23 浙银行分行好找》,,,
"58770 2022-03-13 ",,,
"318031 2022-08-08 ",,,
“380865 2022-11-20 考虑另一家银行..”,,,

我是 Python 的绝对初学者。一个月前刚开始。所以，这可能是一个简单的任务，但我就是找不到将后者转换为三列文件的方法，如下所示：

customer_id	日期	评论
216604	2022-08-22	总的来说，这家银行是令人满意的,,,
259276	2022-11-23	很容易找到浙银行的分行,,,
58770	2022-03-13	,,,
318031	2022-08-08	,,,
380865	2022-11-20	考虑另一家银行..

或者，换句话说。我必须将原始文本分为三个字段：一个、一个类型和一个带有注释语料库的文本。IDdate

任何建议都非常欢迎。

谢谢。

python 文件文本拆分

import regex as re
import pandas as pd 


def split_line(line):
    # We split the text by date (element with common structure 
    # in all entries YYYYY-MM-DD) using regex.

    date_pattern = r"[0-9]{4}\-[0-9]{2}\-[0-9]{2}"

    # We search the fields `customer_id` and `comments` by
    # splitting the text with date pattern
    customer_id, comments = re.split(date_pattern, line)

    # We search the date number using the regex search
    date = re.search(date_pattern, line).group(0)

    return {
        "customer_id": customer_id.strip(),
        "date": date.strip(),
        "comments": comments.strip()
    }
    


if __name__ == "__main__":

    # If you have the text as a python variable of type docstring
    text = """"customer_id date comments
    216604 2022-08-22 Overal, this bank is satisfactory,
    259276 2022-11-23 Easy to find zhe bank ' s branches
    380865 2022-11-20 Seriously considerin switching to a rival bank
    """
    all_lines = text.split("\n")[1:]

    # If you have the text as a .txt file 
    # with open("path/to/txt/file", "r") as f:
    #     all_lines = f.readlines()[1:]

    # Note that we index the text lines from [1:] to remove the header 
    
    all_parsed_lanes = []
    for line in all_lines:
        
        #We measure the length of the line, eliminating spaces with .strip() 
        #to verify that it is not an empty line. 

        if len(line.strip()) > 0:
            extracted_fields = split_line(line)
            all_parsed_lanes.append(extracted_fields)

    # We convert the list of dictionaries into a ordered and redeable
    # dataframe using pandas module.
    df = pd.DataFrame(all_parsed_lanes)
    print(df)

返回为输出：

  customer_id        date                                        comments
0      216604  2022-08-22              Overall, this bank is satisfactory,
1      259276  2022-11-23              Easy to find zhe bank ' s branches
2      380865  2022-11-20  Seriously considering switching to a rival bank

data = []
filename = "yourfile.txt"
with open(filename) as f:
    header = f.readline()[:-1]
    header = header.split(" ")
    data.append(header)
    for line in f.readlines():
        line = line[:-1].split(" ")
        v1 = line[0]
        v2 = line[1]
        v3 = " ".join(line[2:])
        data.append([v1, v2, v3])

第二个块将带有制表符的文件保存为分隔符。这也可以更改为分号。

filename = "output.csv"
with open(filename, "w") as f:
    for line in data:
        for val in line:
            f.write(val)
            f.write("\t")
        f.write("\n")

使用 Python 将文本 csv 文件拆分为另一个 csv 文件，其中包含文本表示的变量

Spliting a text csv file into another csv with the variables the text represents with Python

评论

评论

评论

评论