提问人:Umar 提问时间:11/5/2023 最后编辑:Umar 更新时间:11/7/2023 访问量:79
如何在Python中解析分层数据并将其格式化为TSV文件?
How to Parse Hierarchical Data and Format it into a TSV File in Python?
问:
我有一个包含分层信息和 KO 编号的数据集,我希望将此数据格式化为 Python 中的 TSV(制表符分隔值)文件,其中第一列包含 KO 编号,第二列包含描述,第三列包含基于输入数据中最近的“A”部分的层次结构。层次结构应包括从“A”、“B”和“C”开始到最近的“C”部分的元素。此外,如果存在相同的 KO 数,则该 hirarchy 应由 |在同一行下 输入数据为 file.keg formate 输入数据:
A09100 Metabolism
B
B 09101 Carbohydrate metabolism
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B 09103 Lipid metabolism
C 00071 Fatty acid degradation [PATH:ko00071]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B 09121 Transcription
C 03020 RNA polymerase [PATH:ko03020]
D K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]
预期输出:
KO metadata_KEGG_Description metadata_KEGG_Pathways
K00844 HK; hexokinase [EC:2.7.1.1] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K12407 GCK; glucokinase [EC:2.7.1.2] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis|Metabolism, Lipid metabolism, Fatty acid degradation
K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6] Genetic Information Processing, Transcription, RNA polymerase
K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6] Genetic Information Processing, Transcription, RNA polymerase
我将不胜感激有关如何根据提供的分层信息将此数据正确处理为所需 TSV 文件的任何帮助或指导。感谢您的帮助!
这是我的代码
data = """A09100 Metabolism
B
B 09101 Carbohydrate metabolism
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B 09103 Lipid metabolism
C 00071 Fatty acid degradation [PATH:ko00071]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B 09121 Transcription
C 03020 RNA polymerase [PATH:ko03020]
D K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""
lines = data.split('\n')
result = []
ko = None
description = None
hierarchy_names = []
for line in lines:
parts = line.strip().split()
if parts:
if parts[0].startswith('A'):
# Reset hierarchy for a new 'A' section
hierarchy_names = [" ".join(parts[1:])]
elif parts[0] == 'K':
ko = parts[0]
description = " ".join(parts[1:])
elif parts[0] == 'D' and len(parts) >= 3:
ko = parts[1]
description = " ".join(parts[2:])
else:
hierarchy_names.append(" ".join(parts[1:]))
if ko and description:
hierarchy_str = ", ".join(hierarchy_names)
result.append([ko, description, hierarchy_str])
# Add the header row
result.insert(0, ["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])
# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"
with open(tsv_filename, 'w') as tsv_file:
for row in result:
tsv_file.write("\t".join(row) + "\n")
print(f"Data saved to {tsv_filename}")
答:
0赞
Umar
11/5/2023
#1
import re
data = """A09100 Metabolism
B
B 09101 Carbohydrate metabolism
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B 09103 Lipid metabolism
C 00071 Fatty acid degradation [PATH:ko00071]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B 09121 Transcription
C 03020 RNA polymerase [PATH:ko03020]
D K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""
lines = data.split('\n')
result = []
ko = None
unique_ko_hierarchy = {}
current_hierarchy = []
current_group = None
for line in lines:
parts = line.strip().split()
if parts:
if parts[0].startswith('A'):
current_group = " ".join(parts[1:])
elif parts[0].startswith('B'):
current_hierarchy = [current_group, " ".join(parts[1:])]
elif parts[0] == 'K':
ko = parts[1]
description = " ".join(parts[2:])
if ko in unique_ko_hierarchy:
unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
else:
unique_ko_hierarchy[ko] = {
"metadata_KEGG_Description": description,
"metadata_KEGG_Pathways": [",".join(current_hierarchy)],
}
elif parts[0] == 'D' and len(parts) >= 3:
ko = parts[1]
description = " ".join(parts[2:])
if ko in unique_ko_hierarchy:
unique_ko_hierarchy[ko]["metadata_KEGG_Pathways"].append(",".join(current_hierarchy))
else:
unique_ko_hierarchy[ko] = {
"metadata_KEGG_Description": description,
"metadata_KEGG_Pathways": [",".join(current_hierarchy)],
}
else:
current_hierarchy.append(" ".join(parts[1:]))
# Add the header row
result.append(["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])
for ko, data in unique_ko_hierarchy.items():
# Remove numeric prefixes and "[PATH:...]" patterns from the pathway string
pathways = [re.sub(r'\d+\s*', '', pathway.split('[PATH:')[0].strip()) for pathway in data["metadata_KEGG_Pathways"]]
result.append([ko, data["metadata_KEGG_Description"], "|".join(pathways)])
# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"
with open(tsv_filename, 'w') as tsv_file:
for row in result:
tsv_file.write("\t".join(row) + "\n")
print(f"Data saved to {tsv_filename}")
2赞
Timeless
11/6/2023
#2
我建议您检查处理 KEGG 文件的 Orange Bioinformatics 的 DBGETEntryParser
。否则,如果你想在一些正则表达式的帮助下使用 pandas,你可以试试这个:
import re
with open("file.keg") as f:
pat = r"^([A-D]) *(\S+)\s*(.+?)\s*(\[.+\])?(?=$)"
data = re.findall(pat, f.read(), flags=re.MULTILINE)
regex101-demo
import pandas as pd
tmp = pd.DataFrame(data, columns=["section", "name", "attribute", "path"])
mA = tmp["section"].eq("A"); mD = tmp["section"].eq("D")
df = tmp.assign(entry= tmp["attribute"].where(mA).ffill()).loc[~mA]
parents = (df["entry"].str.cat(df["attribute"].groupby(
mD.ne(mD.shift()).cumsum(), sort=False)
.transform(", ".join).where(~mD).ffill(), sep=", ")
.rename("parents"))
edges = df[["name"]].join(parents).loc[mD, ["parents", "name"]]
out = (df.join(parents).loc[mD].assign(metadata_KEGG_Pathways=
lambda x: x["attribute"].str.cat(x["path"], sep=" "))
.groupby("name", sort=False, as_index=False).agg(
metadata_KEGG_Description=("metadata_KEGG_Pathways", "first"),
metadata_KEGG_Pathways=("parents", "|".join)))
# out.to_csv("file.tsv", sep="\t", index=False) # uncomment to make a `.tsv`
输出(表格格式):
名字 | metadata_KEGG_Description | metadata_KEGG_Pathways |
---|---|---|
编号:K00844 | IKZ公司己糖激酶 [EC:2.7.1.1] | 代谢, 碳水化合物代谢, 糖酵解/糖异生 |
K12407型 | GCK;葡萄糖激酶 [EC:2.7.1.2] | 代谢, 碳水化合物代谢, 糖酵解/糖异生 |
编号:K00001 | E1.1.1.1,adh;乙醇脱氢酶 [EC:1.1.1.1] | 代谢, 碳水化合物代谢, 糖酵解/糖异生|代谢, 脂质代谢, 脂肪酸降解 |
编号:K03043 | rpoB的;DNA 定向 RNA 聚合酶亚基 β [EC:2.7.7.6] | 遗传信息处理、转录、RNA聚合酶 |
K13797型 | rpoBC;DNA 定向 RNA 聚合酶亚基 β-β' [EC:2.7.7.6] | 遗传信息处理、转录、RNA聚合酶 |
使用 networkx 的图形可视化:
import networkx as nx
from itertools import chain, pairwise
G = nx.from_edgelist(
chain.from_iterable(
[pairwise(vals) for vals in edges.agg(
",".join, axis=1).str.split(",").to_numpy()]),
create_using=nx.DiGraph
)
评论
K03043
Genetic Information Processing, Transcription, RNA polymerase|Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
metadata_KEGG_Pathways
.txt