如何在Python中解析分层数据并将其格式化为TSV文件?

How to Parse Hierarchical Data and Format it into a TSV File in Python?

提问人:Umar 提问时间:11/5/2023 最后编辑:Umar 更新时间:11/7/2023 访问量:79

问:

我有一个包含分层信息和 KO 编号的数据集,我希望将此数据格式化为 Python 中的 TSV(制表符分隔值)文件,其中第一列包含 KO 编号,第二列包含描述,第三列包含基于输入数据中最近的“A”部分的层次结构。层次结构应包括从“A”、“B”和“C”开始到最近的“C”部分的元素。此外,如果存在相同的 KO 数,则该 hirarchy 应由 |在同一行下 输入数据为 file.keg formate 输入数据:

A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]

预期输出:

    KO      metadata_KEGG_Description        metadata_KEGG_Pathways
K00844  HK; hexokinase [EC:2.7.1.1]     Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K12407  GCK; glucokinase [EC:2.7.1.2]   Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]    Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis|Metabolism, Lipid metabolism, Fatty acid degradation
K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]  Genetic Information Processing, Transcription, RNA polymerase
K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]  Genetic Information Processing, Transcription, RNA polymerase

我将不胜感激有关如何根据提供的分层信息将此数据正确处理为所需 TSV 文件的任何帮助或指导。感谢您的帮助!

这是我的代码

data = """A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""


lines = data.split('\n')

result = []

ko = None
description = None
hierarchy_names = []

for line in lines:
    parts = line.strip().split()
    if parts:
        if parts[0].startswith('A'):
            # Reset hierarchy for a new 'A' section
            hierarchy_names = [" ".join(parts[1:])]
        elif parts[0] == 'K':
            ko = parts[0]
            description = " ".join(parts[1:])
        elif parts[0] == 'D' and len(parts) >= 3:
            ko = parts[1]
            description = " ".join(parts[2:])
        else:
            hierarchy_names.append(" ".join(parts[1:]))

    if ko and description:
        hierarchy_str = ", ".join(hierarchy_names)
        result.append([ko, description, hierarchy_str])

# Add the header row
result.insert(0, ["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])

# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"

with open(tsv_filename, 'w') as tsv_file:
    for row in result:
        tsv_file.write("\t".join(row) + "\n")

print(f"Data saved to {tsv_filename}")
python for-loop while-loop pandas networkx

评论

0赞 Umar 11/5/2023
我必须发布我的尝试代码吗
1赞 Timeless 11/5/2023
@Umar,你能展示一些努力并正确格式化你的输入和输出吗?顺便问一下,输入的类型是什么?另外,您能解释一下为什么应该在?K03043Genetic Information Processing, Transcription, RNA polymerase|Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesismetadata_KEGG_Pathways
0赞 Umar 11/5/2023
感谢您指出,我更正了,输入数据在我发布时处于分层状态,
0赞 Timeless 11/5/2023
感谢您的修改,但您的输入是什么类型?一个文件还是什么?.txt
0赞 Umar 11/5/2023
它的 .keg 类似于 txt ya

答:

0赞 Umar 11/5/2023 #1
import re

data = """A09100 Metabolism
B
B  09101 Carbohydrate metabolism
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      K00844  HK; hexokinase [EC:2.7.1.1]
D      K12407  GCK; glucokinase [EC:2.7.1.2]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B  09103 Lipid metabolism
C    00071 Fatty acid degradation [PATH:ko00071]
D      K00001  E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B  09121 Transcription
C    03020 RNA polymerase [PATH:ko03020]
D      K03043  rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D      K13797  rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""

lines = data.split('\n')

result = []

ko = None
unique_ko_hierarchy = {}

current_hierarchy = []
current_group = None

for line in lines:
    parts = line.strip().split()
    if parts:
        if parts[0].startswith('A'):
            current_group = " ".join(parts[1:])
        elif parts[0].startswith('B'):
            current_hierarchy = [current_group, " ".join(parts[1:])]
        elif parts[0] == 'K':
            ko = parts[1]
            description = " ".join(parts[2:])
            if ko in unique_ko_hierarchy:
                unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
            else:
                unique_ko_hierarchy[ko] = {
                    "metadata_KEGG_Description": description,
                    "metadata_KEGG_Pathways": [",".join(current_hierarchy)],
                }
        elif parts[0] == 'D' and len(parts) >= 3:
            ko = parts[1]
            description = " ".join(parts[2:])
            if ko in unique_ko_hierarchy:
                unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
            else:
                unique_ko_hierarchy[ko] = {
                    "metadata_KEGG_Description": description,
                    "metadata_KEGG_Pathways": [",".join(current_hierarchy)],
                }
        else:
            current_hierarchy.append(" ".join(parts[1:]))

# Add the header row
result.append(["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])

for ko, data in unique_ko_hierarchy.items():
    # Remove numeric prefixes and "[PATH:...]" patterns from the pathway string
    pathways = [re.sub(r'\d+\s*', '', pathway.split('[PATH:')[0].strip()) for pathway in data["metadata_KEGG_Pathways"]]
    result.append([ko, data["metadata_KEGG_Description"], "|".join(pathways)])

# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"

with open(tsv_filename, 'w') as tsv_file:
    for row in result:
        tsv_file.write("\t".join(row) + "\n")

print(f"Data saved to {tsv_filename}")
2赞 Timeless 11/6/2023 #2

我建议您检查处理 KEGG 文件的 Orange BioinformaticsDBGETEntryParser。否则,如果你想在一些正则表达式的帮助下使用 ,你可以试试这个:

import re

with open("file.keg") as f:
    pat = r"^([A-D]) *(\S+)\s*(.+?)\s*(\[.+\])?(?=$)"
    data = re.findall(pat, f.read(), flags=re.MULTILINE)  regex101-demo
import pandas as pd

tmp = pd.DataFrame(data, columns=["section", "name", "attribute", "path"])
mA = tmp["section"].eq("A"); mD = tmp["section"].eq("D")
df = tmp.assign(entry= tmp["attribute"].where(mA).ffill()).loc[~mA]

parents = (df["entry"].str.cat(df["attribute"].groupby(
        mD.ne(mD.shift()).cumsum(), sort=False)
            .transform(", ".join).where(~mD).ffill(), sep=", ")
            .rename("parents"))

edges = df[["name"]].join(parents).loc[mD, ["parents", "name"]]

out = (df.join(parents).loc[mD].assign(metadata_KEGG_Pathways=
        lambda x: x["attribute"].str.cat(x["path"], sep=" "))
           .groupby("name", sort=False, as_index=False).agg(
               metadata_KEGG_Description=("metadata_KEGG_Pathways", "first"),
               metadata_KEGG_Pathways=("parents", "|".join)))

# out.to_csv("file.tsv", sep="\t", index=False) # uncomment to make a `.tsv`

输出(表格格式):

名字 metadata_KEGG_Description metadata_KEGG_Pathways
编号:K00844 IKZ公司己糖激酶 [EC:2.7.1.1] 代谢, 碳水化合物代谢, 糖酵解/糖异生
K12407型 GCK;葡萄糖激酶 [EC:2.7.1.2] 代谢, 碳水化合物代谢, 糖酵解/糖异生
编号:K00001 E1.1.1.1,adh;乙醇脱氢酶 [EC:1.1.1.1] 代谢, 碳水化合物代谢, 糖酵解/糖异生|代谢, 脂质代谢, 脂肪酸降解
编号:K03043 rpoB的;DNA 定向 RNA 聚合酶亚基 β [EC:2.7.7.6] 遗传信息处理、转录、RNA聚合酶
K13797型 rpoBC;DNA 定向 RNA 聚合酶亚基 β-β' [EC:2.7.7.6] 遗传信息处理、转录、RNA聚合酶

的图形可视化:

import networkx as nx
from itertools import chain, pairwise

G = nx.from_edgelist(
    chain.from_iterable(
        [pairwise(vals) for vals in edges.agg(
            ",".join, axis=1).str.split(",").to_numpy()]),
    create_using=nx.DiGraph
)

enter image description here