FTP 制表符分隔的文本文件无法另存为 utf-8 编码

FTP Tab Delimited Text file unable to save as utf-8 encoded

提问人:Shenanigator 提问时间:8/6/2023 更新时间:8/6/2023 访问量:79

问:

首先 - 不,我无法更改FTP设置。它完全被锁定,因为它嵌入在设备中并且非常古老。它不支持 PASV、TLS 或任何更现代的服务器。该软件WS_FTP 4.8(据我所知),我什至找不到该版本日期,但它可能在 20 年左右。我已经联系了拥有这些设备的公司,看看他们是否会做正确的事情,并进行固件更新,在他们的硬件中放置更好的FTP服务器,但我没有收到回复。

发出一组 PASV 会得到 502,所以我肯定它不支持它;我什至不确定这与这个问题有什么关系。我认为这与此设备上的底层操作系统有关。

FTP 登录消息:

Connection established, waiting for welcome message...
Status: Insecure server, it does not support FTP over TLS.
Status: Server does not support non-ASCII characters.

我将发布我尝试做的不同事情来解决这个问题:

        with open(local_temp_file, 'wb', encoding='UTF-8', errors='replace') as local_file:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', local_file.write)

FTP 日志:

*resp* '200 Type set to I.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening BINARY mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

追踪:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrbinary('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 441, in retrbinary
    callback(data)
TypeError: write() argument must be str, not bytes

好吧,很酷 - str 不是字节

        with open(local_temp_file, 'w', encoding='UTF-8', errors='replace') as local_file:
            conn.retrlines('RETR ' + filename_convention
                           + yesterday + '.txt', local_file.write)

FTP 日志:

*resp* '200 Type set to A.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening ASCII mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

追踪:

{'UnicodeDecodeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrlines('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 465, in retrlines
    line = fp.readline(self.maxline + 1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

现在我有一个 ftp 客户端,我已经用它成功下载了这些文件,并且我发送到 S3 的代码已经成功发送。问题是我不知道编码到底是什么。

用 Open Office Calc 打开它,它只显示 Unicode。

我试过了

def detect_encoding(file):
    detector = chardet.universaldetector.UniversalDetector()
    with open(file, "rb") as f:
        for line in f:
            detector.feed(line)
            if detector.done:
                break
        detector.close()
    return detector.result

然后 f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()
        f.encode('utf-8')
        print(detect_encoding(f))

追踪:

{'AttributeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 139, in get_log_data
    f.encode('utf-8')
    ^^^^^^^^
AttributeError: '_io.BufferedWriter' object has no attribute 'encode'

我还尝试了上述功能

        f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()

追踪:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 140, in get_log_data
    print(detect_encoding(f))
          ^^^^^^^^^^^^^^^^^^
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 69, in detect_encoding
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not BufferedWriter

归根结底,这就是为什么这很重要 - 这些制表符分隔的文本文件被放入 S3 并被胶水爬虫抓取,目前它无法正确读取列,它只能看到一列,因为在文件打开时它显示??在钻石中,它告诉我它是以某种方式编码的,爬虫使用“\t”分隔符无法将其识别为 CSV(是的,我为此设置了一个分类器)。我还需要将 2 个字段(列)附加到制表符分隔的文件中,以提供站点名称和每行数据的时间戳,而我不能只对字符串执行此操作。

我可能遗漏了一些简单的东西,但我在谷歌上搜索了很多 SO 帖子,我似乎找不到解决方案。

Python Unicode UTF-8 FTP

评论


答:

0赞 Shenanigator 8/6/2023 #1

非常感谢 Martin、TDelaney 和 Ed_ 让我以不同的方式思考一些事情。首先,我下载了该文件并在 Windows 上的记事本中打开它,当时我记得它在右下角告诉您编码。TDelaney 赢得 UTF-16。Ed_指出我实际上不是在发送文件,而是在发送文件路径的字符串。Martin 确保我没有将 FTP 作为问题重点。干杯!

以下是他们推动的结果:

def convert_to_utf8(check_file, output_file):

    with open(check_file, 'rb') as of:
        chardet_data = chardet.detect(of.read())
        fileencoding = (chardet_data['encoding'])
        print('fileencoding', fileencoding)

    if fileencoding in ['utf-8', 'ascii']:
        return {'re-encoded': False, 'encoding': fileencoding}

    else:
        with open(check_file, 'r', encoding=fileencoding) as of, \
                open(output_file, 'w', encoding='utf-8') as cf:
            cf.write(of.read())

        return {'re-encoded': True, 'encoding': fileencoding}

.....

        # Create the temporary file paths for later use
        temp_dir = tempfile.gettempdir()

        if not os.path.exists(temp_dir):
            tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
            temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'

        inp_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.txt')

        temp_file = os.path.join(temp_dir, 'temp_file.tsv')

        op_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.tsv')

        # Step 4: Connect to the current FTP, Download the File,
        # Write to s3 and clean up
        # Connect to the FTP site
        conn = ftp_connection(ftp_ip, ftp_username, ftp_password)

        # Change the directory
        conn.cwd(file_directory)
        conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                     + yesterday + '.txt')

        # Download the file to the local temporary directory
        with open(inp_file, 'wb') as f:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', f.write)

        s3_file = convert_to_utf8(inp_file,
                                  temp_file)

        print(s3_file['re-encoded'], s3_file['encoding'])

        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:


        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:

            csv_reader = csv.reader(tsv_inp, delimiter='\t')
            csv_writer = csv.writer(tsv_op, delimiter='\t')

            # Read the first row
            headers = next(csv_reader, None)
            csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])

            for row in csv_reader:
                row.append('{}'.format(ftp_site_name))
                row.append('{}'.format(timestamp))
                csv_writer.writerow(row)

我已经确认输出现在是 UTF-8。是的,我知道 CSV 附加还不起作用;

如果您愿意,可能会使转换器更加灵活,只需将其称为convert_encoding并传入所需的编码即可。类似的东西

def convert_encoding(check_file, output_file, desiredencoding):

    with open(check_file, 'rb') as of:
        chardet_data = chardet.detect(of.read())
        fileencoding = (chardet_data['encoding'])
        print('fileencoding', fileencoding)

    if fileencoding == desiredencoding:
        return {'re-encoded': False, 'encoding': fileencoding}

    else:
        with open(check_file, 'r', encoding=fileencoding) as of, \
                open(output_file, 'w', encoding=desiredencoding) as cf:
            cf.write(of.read())

        return {'re-encoded': True, 'encoding': fileencoding}

现在关闭以添加文件清理并添加日志记录而不是打印。如果有人对这里感兴趣,这里是带有一些匿名化的代码(是的,它需要更多的工作,但我都是自学成才的,仍在尝试理解主要和类):

from datetime import datetime as dt
from datetime import timedelta
import base64
import boto3
import botocore
import chardet
import csv
import datetime
import ftplib
import os
import tempfile
import traceback


def assume_role_with_iam_user(access_key, secret_key, role_arn,
                              session_name='AssumedSession'):
"""
Assumes an IAM role using IAM User's access and secret keys.

Parameters:
    access_key (str): IAM User's access key.
    secret_key (str): IAM User's secret key.
    role_arn (str): ARN of the IAM role you want to assume.
    session_name (str): Name of the assumed role session (optional).

Returns:
    boto3.Session: A session with the assumed role credentials.
"""
sts_client = boto3.client('sts',
                          aws_access_key_id=base64.b64decode(
                              access_key).decode('utf-8'),
                          aws_secret_access_key=base64.b64decode(
                              secret_key).decode('utf-8'),
                          region_name='us-west-1')

# Assume the role
assumed_role = sts_client.assume_role(
    RoleArn=role_arn,
    RoleSessionName=session_name
)

# Create a new session with the assumed role credentials
assumed_credentials = assumed_role['Credentials']
session = boto3.Session(
    aws_access_key_id=assumed_credentials['AccessKeyId'],
    aws_secret_access_key=assumed_credentials['SecretAccessKey'],
    aws_session_token=assumed_credentials['SessionToken']
)

return session


def ftp_connection(HOST, USER, PASS):

try:
    ftp = ftplib.FTP(source_address=())
    ftp.connect(HOST)
    ftp.login(USER, PASS)
    ftp.set_pasv(False)
    ftp.set_debuglevel(4)

except ftplib.all_errors as ex:
    print(str(ex))
    raise

return ftp


def convert_to_utf8(check_file, output_file):

with open(check_file, 'rb') as of:
    chardet_data = chardet.detect(of.read())
    fileencoding = (chardet_data['encoding'])
    print('fileencoding', fileencoding)

if fileencoding in ['utf-8', 'ascii']:
    return {'re-encoded': False, 'encoding': fileencoding}

else:
    with open(check_file, 'r', encoding=fileencoding) as of, \
            open(output_file, 'w', encoding='utf-8') as cf:
        cf.write(of.read())

    return {'re-encoded': True, 'encoding': fileencoding}


def get_log_data():

# Define variables for later use:
yesterday = dt.strftime(dt.today() - timedelta(days=1), '%y%m%d')
# Create snapshot time
timestamp = dt.utcnow().isoformat()
# Where we pick up the ftp config file and it's name
config_s3_bucket_name = 'ftp-config-file'
config_file_key = 'ftp_config.csv'
# Where we want the data to go
data_s3_bucket_name = 'company-data'
# This is the name of the crawler we need to run
crawler_name = 'HMILogs'

# These are our AWS keys Base64 Encoded; need to impore security later
ak = 'NoKeyForYou'
sk = 'StillNoKeyForYou'
role = 'arn:aws:iam::123456789:role/service-role/DataRetrieve-role-lh22tofx'

# Step 1: Assume role to get creds
aws_session = assume_role_with_iam_user(ak, sk, role)

try:
    # Step 2: Connect to S3 and download the config file
    s3 = aws_session.client('s3',
                            config=boto3.session
                            .Config(signature_version='s3v4'))
    config_file_obj = s3.get_object(Bucket=config_s3_bucket_name,
                                    Key=config_file_key)
    config_file_data = config_file_obj['Body'] \
        .read().decode('utf-8').splitlines()
    config_reader = csv.DictReader(config_file_data)

    # Step 3: Loop through each row in the config file
    for row in config_reader:
        ftp_site_name = row['ftp_site_name']
        ftp_ip = row['ftp_ip_address']
        ftp_username = row['ftp_username']
        ftp_password = row['ftp_password']
        file_directory = row['ftp_log_directory']
        filename_convention = row['filename_convention']

        # Create the temporary file paths for later use
        temp_dir = tempfile.gettempdir()

        if not os.path.exists(temp_dir):
            tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
            temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'

        inp_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.txt')

        temp_file = os.path.join(temp_dir, 'temp_file.tsv')

        op_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.tsv')

        # Step 4: Connect to the current FTP, Download the File,
        # Write to s3 and clean up
        # Connect to the FTP site
        conn = ftp_connection(ftp_ip, ftp_username, ftp_password)

        # Change the directory
        conn.cwd(file_directory)
        conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                     + yesterday + '.txt')

        # Download the file to the local temporary directory
        with open(inp_file, 'wb') as f:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', f.write)

        s3_file = convert_to_utf8(inp_file,
                                  temp_file)

        print(s3_file['re-encoded'], s3_file['encoding'])

        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:

            csv_reader = csv.reader(tsv_inp, delimiter='\t')
            csv_writer = csv.writer(tsv_op, delimiter='\t')

            # Read the first row
            headers = next(csv_reader, None)
            csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])

            for row in csv_reader:
                row.append('{}'.format(ftp_site_name))
                row.append('{}'.format(timestamp))
                csv_writer.writerow(row)

        # Upload the file from the local temporary directory to S3
        s3_key = '{}/dt={}/{}.csv'.format(ftp_site_name, yesterday,
                                          filename_convention + '.tsv')
        # s3.upload_file(local_temp_file_op, Bucket=data_s3_bucket_name,
        # Key=s3_key)

        try:
            s3.head_object(Bucket=data_s3_bucket_name,
                           Key=s3_key)

        except botocore.exceptions.ClientError as error:
            if error.response['Error']['Code'] == 404:
                print("Object does not exist!")

        print('ENABLE DELETE COMMAND IN THE FUTURE')
        # conn.sendcmd('DELETE ' + filename_convention
        # + yesterday + '.txt')

        # Close connection, Close local File, and remove
        conn.close()
        # os.remove(local_temp_directory + '\\' + local_file)

    # Step 5: Crawl the new data into the Table
    glue = aws_session.client('glue', region_name='us-west-1')

    if not glue.get_crawler(Name=crawler_name)['Crawler']['State'] \
            == 'RUNNING':
        glue.start_crawler(Name=crawler_name)
        print('Crawler Started')

except Exception as ex:
    print(str({type(ex).__name__}))
    traceback.print_exception(ex)

return 'Function executed successfully.'


get_log_data()