如何在Python中解析分层数据并将其格式化为TSV文件？-解网

问：

我有一个包含分层信息和 KO 编号的数据集，我希望将此数据格式化为 Python 中的 TSV（制表符分隔值）文件，其中第一列包含 KO 编号，第二列包含描述，第三列包含基于输入数据中最近的“A”部分的层次结构。层次结构应包括从“A”、“B”和“C”开始到最近的“C”部分的元素。此外，如果存在相同的 KO 数，则该 hirarchy 应由 |在同一行下输入数据为 file.keg formate 输入数据：

A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]

预期输出：

    KO      metadata_KEGG_Description        metadata_KEGG_Pathways
K00844  HK; hexokinase [EC:2.7.1.1]     Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K12407  GCK; glucokinase [EC:2.7.1.2]   Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]    Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis|Metabolism, Lipid metabolism, Fatty acid degradation
K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]  Genetic Information Processing, Transcription, RNA polymerase
K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]  Genetic Information Processing, Transcription, RNA polymerase

我将不胜感激有关如何根据提供的分层信息将此数据正确处理为所需 TSV 文件的任何帮助或指导。感谢您的帮助！

这是我的代码

data = """A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""


lines = data.split('\n')

result = []

ko = None
description = None
hierarchy_names = []

for line in lines:
    parts = line.strip().split()
    if parts:
        if parts[0].startswith('A'):
            # Reset hierarchy for a new 'A' section
            hierarchy_names = [" ".join(parts[1:])]
        elif parts[0] == 'K':
            ko = parts[0]
            description = " ".join(parts[1:])
        elif parts[0] == 'D' and len(parts) >= 3:
            ko = parts[1]
            description = " ".join(parts[2:])
        else:
            hierarchy_names.append(" ".join(parts[1:]))

    if ko and description:
        hierarchy_str = ", ".join(hierarchy_names)
        result.append([ko, description, hierarchy_str])

# Add the header row
result.insert(0, ["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])

# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"

with open(tsv_filename, 'w') as tsv_file:
    for row in result:
        tsv_file.write("\t".join(row) + "\n")

print(f"Data saved to {tsv_filename}")

python for-loop while-loop pandas networkx

import re

data = """A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""

lines = data.split('\n')

result = []

ko = None
unique_ko_hierarchy = {}

current_hierarchy = []
current_group = None

for line in lines:
    parts = line.strip().split()
    if parts:
        if parts[0].startswith('A'):
            current_group = " ".join(parts[1:])
        elif parts[0].startswith('B'):
            current_hierarchy = [current_group, " ".join(parts[1:])]
        elif parts[0] == 'K':
            ko = parts[1]
            description = " ".join(parts[2:])
            if ko in unique_ko_hierarchy:
                unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
            else:
                unique_ko_hierarchy[ko] = {
                    "metadata_KEGG_Description": description,
                    "metadata_KEGG_Pathways": [",".join(current_hierarchy)],
                }
        elif parts[0] == 'D' and len(parts) >= 3:
            ko = parts[1]
            description = " ".join(parts[2:])
            if ko in unique_ko_hierarchy:
                unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
            else:
                unique_ko_hierarchy[ko] = {
                    "metadata_KEGG_Description": description,
                    "metadata_KEGG_Pathways": [",".join(current_hierarchy)],
                }
        else:
            current_hierarchy.append(" ".join(parts[1:]))

# Add the header row
result.append(["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])

for ko, data in unique_ko_hierarchy.items():
    # Remove numeric prefixes and "[PATH:...]" patterns from the pathway string
    pathways = [re.sub(r'\d+\s*', '', pathway.split('[PATH:')[0].strip()) for pathway in data["metadata_KEGG_Pathways"]]
    result.append([ko, data["metadata_KEGG_Description"], "|".join(pathways)])

# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"

with open(tsv_filename, 'w') as tsv_file:
    for row in result:
        tsv_file.write("\t".join(row) + "\n")

print(f"Data saved to {tsv_filename}")

2赞 Timeless 11/6/2023 #2

我建议您检查处理 KEGG 文件的 Orange Bioinformatics 的 DBGETEntryParser。否则，如果你想在一些正则表达式的帮助下使用 pandas，你可以试试这个：

import re

with open("file.keg") as f:
    pat = r"^([A-D]) *(\S+)\s*(.+?)\s*(\[.+\])?(?=$)"
    data = re.findall(pat, f.read(), flags=re.MULTILINE)  ^{regex101-demo}

import pandas as pd

tmp = pd.DataFrame(data, columns=["section", "name", "attribute", "path"])
mA = tmp["section"].eq("A"); mD = tmp["section"].eq("D")
df = tmp.assign(entry= tmp["attribute"].where(mA).ffill()).loc[~mA]

parents = (df["entry"].str.cat(df["attribute"].groupby(
        mD.ne(mD.shift()).cumsum(), sort=False)
            .transform(", ".join).where(~mD).ffill(), sep=", ")
            .rename("parents"))

edges = df[["name"]].join(parents).loc[mD, ["parents", "name"]]

out = (df.join(parents).loc[mD].assign(metadata_KEGG_Pathways=
        lambda x: x["attribute"].str.cat(x["path"], sep=" "))
           .groupby("name", sort=False, as_index=False).agg(
               metadata_KEGG_Description=("metadata_KEGG_Pathways", "first"),
               metadata_KEGG_Pathways=("parents", "|".join)))

# out.to_csv("file.tsv", sep="\t", index=False) # uncomment to make a `.tsv`

输出（表格格式）：

名字	metadata_KEGG_Description	metadata_KEGG_Pathways
编号：K00844	IKZ公司己糖激酶 [EC：2.7.1.1]	代谢，碳水化合物代谢，糖酵解/糖异生
K12407型	GCK;葡萄糖激酶 [EC：2.7.1.2]	代谢，碳水化合物代谢，糖酵解/糖异生
编号：K00001	E1.1.1.1，adh;乙醇脱氢酶 [EC：1.1.1.1]	代谢，碳水化合物代谢，糖酵解/糖异生\|代谢，脂质代谢，脂肪酸降解
编号：K03043	rpoB的;DNA 定向 RNA 聚合酶亚基 β [EC：2.7.7.6]	遗传信息处理、转录、RNA聚合酶
K13797型	rpoBC;DNA 定向 RNA 聚合酶亚基 β-β' [EC：2.7.7.6]	遗传信息处理、转录、RNA聚合酶

使用 networkx 的图形可视化：

import networkx as nx
from itertools import chain, pairwise

G = nx.from_edgelist(
    chain.from_iterable(
        [pairwise(vals) for vals in edges.agg(
            ",".join, axis=1).str.split(",").to_numpy()]),
    create_using=nx.DiGraph
)

上一个：当条件从用户那里获取输入时终止于循环

下一个：Python Selenium：在网页抓取中到达可滚动 div 的末尾时如何停止 while 循环

如何在Python中解析分层数据并将其格式化为TSV文件？

How to Parse Hierarchical Data and Format it into a TSV File in Python?

评论