提问人:Shafaque Zahra 提问时间:10/19/2023 最后编辑:markp-fusoShafaque Zahra 更新时间:10/19/2023 访问量:109
根据公共列合并多个行不相等的文件,形成计数矩阵文件
Merging Multiple files with unequal rows based on the common column to form a count matrix file
问:
我有与 https://superuser.com/questions/1245094/merging-multiple-files-based-on-the-common-column 类似的合并多个文件的问题。我非常接近解决方案,但我是 python 的新手。我需要帮助调整代码以连接多个文件。 单个文件的 ID 和列如下所示:
文件1.txt
id SRR1071717
chr1:15039:-::chr1:15795:- 2
chr1:15948:-::chr1:16606:- 6
文件2.txt
id SRR1079830
chr1:11672:+::chr1:12009:+ 10
chr1:11845:+::chr1:12009:+ 7
chrY:9756574:+::chrY:9757796:+ 0
我想要的输出
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0
我的代码:Matrix.py
import sys
columns = []
data = {}
ids = set()
for filename in sys.argv[1:]:
with open(filename, 'rU') as f:
key = next(f).strip().split()[1]
columns.append(key)
data[key] = {}
for line in f:
if line.strip():
id, value = line.strip().split()
try:
data[key][int(id)] = value
except ValueError as exc:
raise ValueError(
"Problem in line: '{}' '{}' '{}'".format(
id, value, line.rstrip()))
ids.add(int(id))
print('\t'.join(['ID'] + columns))
for id in sorted(ids):
line = []
for column in columns:
line.append(data[column].get(id, '0'))
print('\t'.join([str(id)] + line))
我运行了如图所示的 python 代码,但它无法正常工作(不熟悉 python)。电流输出(只有两行!
python3 matrix.py File\*.txt
电流输出
id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0
答:
1赞
Ömer Sezer
10/19/2023
#1
Pandas Dataframe 的另一种方式:
import sys
import glob
import pandas as pd
file_pattern = sys.argv[1]
file_list = glob.glob(file_pattern)
merged_df = None
for filename in file_list:
column = filename.split('.')[0]
df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])
df.set_index('id', inplace=True)
if merged_df is None:
merged_df = df
else:
merged_df = merged_df.join(df, how='outer')
merged_df.fillna(0, inplace=True)
merged_df = merged_df[~merged_df.index.str.startswith('id')]
print(merged_df.to_string(na_rep='0'))
跑:
python matrix.py "File*.txt"
输出:
编辑:
如果文件间距为 tab (\t),请使用以下命令:
df = pd.read_csv(filename, sep='\t', header=None, names=['id', column])
而不是这一行:
df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])
我检查了一下,使用制表符间距,它也可以使用。
评论
0赞
Shafaque Zahra
10/19/2023
它不起作用,我想因为我的熟食是标签。
0赞
Ömer Sezer
10/19/2023
@ShafaqueZahra如果您的文件中有 \t(制表符)间距,请更新pd.read_csv函数参数。我在帖子(编辑部分)中提到了这一点。
0赞
Shafaque Zahra
10/19/2023
是的。谢谢奥马尔。更新!它的工作!
3赞
markp-fuso
10/19/2023
#2
既然你已经标记了之前(和类似的)问题,我将假设一个解决方案是可以接受的......awk
awk
为两个输入文件添加一个公共 ():ID
chr8:77777:-
$ head File*.txt
==> File1.txt <==
id SRR1071717
chr1:15039:-::chr1:15795:- 2
chr1:15948:-::chr1:16606:- 6
chr8:77777:-::chr1:16606:- 6
==> File2.txt <==
id SRR1079830
chr1:11672:+::chr1:12009:+ 10
chr1:11845:+::chr1:12009:+ 7
chrY:9756574:+::chrY:9757796:+ 0
chr8:77777:-::chr1:16606:- 17
一个想法:awk
awk '
BEGIN { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
fcnt++ # keep track of number of files; will serve as index of 2nd dimension of array
next }
{ values[$1][fcnt] = $2 } # populate 2-dimensional array
END { print hdr
for (id in values) { # loop through id values
printf "%s%s", id, OFS
for (i=1; i<=fcnt; i++) # loop through 2nd dimension of array
printf "%s%s", (i in values[id] ? values[id][i] : 0), (i<fcnt ? OFS : ORS)
}
}
' File*.txt
笔记:
- 需要支持多维数组
GNU awk
(values[id][i] ? values[id][i] : 0)
- 如果数组条目已填充,则打印它,否则打印默认值0
(i<fcnt ? OFS : ORS)
- 打印输出字段分隔符 (),但最后一次通过循环 () 除外,在这种情况下,打印输出记录分隔符 (OFS
i==fcnt
ORS
)
这将产生:
id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0
chr1:11845:+::chr1:12009:+ 0 7
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:15948:-::chr1:16606:- 6 0
chr1:15039:-::chr1:15795:- 2 0
添加数组,以便我们可以按照读取 ID 的相同顺序生成输出......idorder[]
awk '
BEGIN { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
fcnt++
next }
{ if (! ($1 in values))
idorder[++idcnt] = $1
values[$1][fcnt] = $2
}
END { print hdr
for (i=1; i<=idcnt; i++) {
id = idorder[i]
printf "%s%s", id, OFS
for (j=1; j<=fcnt; j++)
printf "%s%s", (j in values[id] ? values[id][j] : 0), (j<fcnt ? OFS : ORS)
}
}
' File*.txt
这将产生:
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0
评论
0赞
Shafaque Zahra
10/19/2023
非常感谢您的帮助。非常感谢。它工作得很完美!!!!!!!!!!!!!!!
1赞
Ed Morton
10/19/2023
#3
使用任何 awk:
$ cat tst.awk
FNR == 1 { ++numCols }
{
if ( !($1 in ids2rows) ) {
rows2ids[++numRows] = $1
ids2rows[$1] = numRows
}
rowNr = ids2rows[$1]
vals[rowNr,numCols] = $2
}
END {
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
id = rows2ids[rowNr]
printf "%s", id
for ( colNr=1; colNr<=numCols; colNr++ ) {
val = ( (rowNr,colNr) in vals ? vals[rowNr,colNr] : 0 )
printf "%s%s", OFS, val
}
print ""
}
}
$ awk -f tst.awk File1.txt File2.txt
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0
评论
1赞
Shafaque Zahra
10/19/2023
非常感谢莫顿。此脚本功能齐全。我为我的问题准备了两个写得很好的脚本。超级开心。
0赞
Freeman
10/19/2023
#4
还要在 Python 中检查一下:
#reading the contents of File1.txt and File2.txt
with open('File1.txt', 'r') as file1, open('File2.txt', 'r') as file2:
lines1 = file1.readlines()
lines2 = file2.readlines()
#extract the IDs from the first line of each file
ids1 = lines1[0].split()[1:]
ids2 = lines2[0].split()[1:]
#make a dictionary to store the values for each ID
data = {}
#process the lines of File1.txt
for line in lines1[1:]:
columns = line.split()
data[columns[0]] = [columns[1]] + ['0'] * len(ids2)
#process the lines of File2.txt
for line in lines2[1:]:
columns = line.split()
id = columns[0]
if id in data:
data[id][1] = columns[1]
else:
data[id] = ['0'] * len(ids1) + [columns[1]]
#printing the header
print('id', *ids1, *ids2)
#printing the data
for id, values in data.items():
print(id, *values)
输出
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0
评论