根据公共列合并多个行不相等的文件,形成计数矩阵文件

Merging Multiple files with unequal rows based on the common column to form a count matrix file

提问人:Shafaque Zahra 提问时间:10/19/2023 最后编辑:markp-fusoShafaque Zahra 更新时间:10/19/2023 访问量:109

问:

我有与 https://superuser.com/questions/1245094/merging-multiple-files-based-on-the-common-column 类似的合并多个文件的问题。我非常接近解决方案,但我是 python 的新手。我需要帮助调整代码以连接多个文件。 单个文件的 ID 和列如下所示:

文件1.txt

id  SRR1071717
chr1:15039:-::chr1:15795:-  2
chr1:15948:-::chr1:16606:-  6

文件2.txt

id  SRR1079830
chr1:11672:+::chr1:12009:+  10
chr1:11845:+::chr1:12009:+  7
chrY:9756574:+::chrY:9757796:+  0

我想要的输出

id  SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

我的代码:Matrix.py

import sys

columns = []
data = {}
ids = set()
for filename in sys.argv[1:]:
    with open(filename, 'rU') as f:
        key = next(f).strip().split()[1]
        columns.append(key)
        data[key] = {}
        for line in f:
            if line.strip():
                id, value = line.strip().split()
                try:
                    data[key][int(id)] = value
                except ValueError as exc:
                    raise ValueError(
                        "Problem in line: '{}' '{}' '{}'".format(
                            id, value, line.rstrip()))

                ids.add(int(id))

print('\t'.join(['ID'] + columns))

for id in sorted(ids):
    line = []
    for column in columns:
        line.append(data[column].get(id, '0'))
    print('\t'.join([str(id)] + line))

我运行了如图所示的 python 代码,但它无法正常工作(不熟悉 python)。电流输出(只有两行!

python3 matrix.py File\*.txt

电流输出

id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0
python linux bash awk

评论

0赞 Ed Morton 10/19/2023
编辑您的问题以描述将您从示例输入到预期输出的过程(file1 中的此字段与 file2 中的该字段匹配,某些值 foo 必须在某个范围内等),不要让我们阅读非工作代码来试图弄清楚它应该做什么。
0赞 Diego Torres Milano 10/19/2023
你为什么要标记 bash?
0赞 Ed Morton 10/19/2023
您是否只对 python 解决方案或使用其他 Unix 工具的解决方案感兴趣?
0赞 Shafaque Zahra 10/19/2023
Bash 解决方案总是很酷。我需要的任何解决方案,无论是bash还是python,或两者兼而有之
0赞 Shafaque Zahra 10/19/2023
chyY 在那里,请向右滚动。.我猜一些格式问题

答:

1赞 Ömer Sezer 10/19/2023 #1

Pandas Dataframe 的另一种方式:

import sys
import glob
import pandas as pd

file_pattern = sys.argv[1]
file_list = glob.glob(file_pattern)
merged_df = None

for filename in file_list:
    column = filename.split('.')[0] 
    df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])
    df.set_index('id', inplace=True)
    if merged_df is None:
        merged_df = df
    else:
        merged_df = merged_df.join(df, how='outer')

merged_df.fillna(0, inplace=True)
merged_df = merged_df[~merged_df.index.str.startswith('id')]

print(merged_df.to_string(na_rep='0'))

跑:

python matrix.py "File*.txt"

输出:

enter image description here

编辑:

如果文件间距为 tab (\t),请使用以下命令:

df = pd.read_csv(filename, sep='\t', header=None, names=['id', column])

而不是这一行:

df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])

我检查了一下,使用制表符间距,它也可以使用。

评论

0赞 Shafaque Zahra 10/19/2023
它不起作用,我想因为我的熟食是标签。
0赞 Ömer Sezer 10/19/2023
@ShafaqueZahra如果您的文件中有 \t(制表符)间距,请更新pd.read_csv函数参数。我在帖子(编辑部分)中提到了这一点。
0赞 Shafaque Zahra 10/19/2023
是的。谢谢奥马尔。更新!它的工作!
3赞 markp-fuso 10/19/2023 #2

既然你已经标记了之前(和类似的)问题,我将假设一个解决方案是可以接受的......awkawk


为两个输入文件添加一个公共 ():IDchr8:77777:-

$ head File*.txt
==> File1.txt <==
id  SRR1071717
chr1:15039:-::chr1:15795:-  2
chr1:15948:-::chr1:16606:-  6
chr8:77777:-::chr1:16606:-  6

==> File2.txt <==
id  SRR1079830
chr1:11672:+::chr1:12009:+  10
chr1:11845:+::chr1:12009:+  7
chrY:9756574:+::chrY:9757796:+  0
chr8:77777:-::chr1:16606:-  17

一个想法:awk

awk '
BEGIN  { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
         fcnt++                                     # keep track of number of files; will serve as index of 2nd dimension of array 
         next }
       { values[$1][fcnt] = $2 }                    # populate 2-dimensional array

END    { print hdr
         for (id in values) {                       # loop through id values
             printf "%s%s", id, OFS
             for (i=1; i<=fcnt; i++)                # loop through 2nd dimension of array
                 printf "%s%s", (i in values[id] ? values[id][i] : 0), (i<fcnt ? OFS : ORS)
         }
       }
' File*.txt

笔记:

  • 需要支持多维数组GNU awk
  • (values[id][i] ? values[id][i] : 0)- 如果数组条目已填充,则打印它,否则打印默认值0
  • (i<fcnt ? OFS : ORS)- 打印输出字段分隔符 (),但最后一次通过循环 () 除外,在这种情况下,打印输出记录分隔符 (OFSi==fcntORS)

这将产生:

id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0
chr1:11845:+::chr1:12009:+ 0 7
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:15948:-::chr1:16606:- 6 0
chr1:15039:-::chr1:15795:- 2 0

添加数组,以便我们可以按照读取 ID 的相同顺序生成输出......idorder[]

awk '
BEGIN  { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
         fcnt++
         next }
       { if (! ($1 in values))
            idorder[++idcnt] = $1
         values[$1][fcnt] = $2
       }

END    { print hdr
         for (i=1; i<=idcnt; i++) {
             id = idorder[i]
             printf "%s%s", id, OFS
             for (j=1; j<=fcnt; j++)
                 printf "%s%s", (j in values[id] ? values[id][j] : 0), (j<fcnt ? OFS : ORS)
         }
       }
' File*.txt

这将产生:

id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

评论

0赞 Shafaque Zahra 10/19/2023
非常感谢您的帮助。非常感谢。它工作得很完美!!!!!!!!!!!!!!!
1赞 Ed Morton 10/19/2023 #3

使用任何 awk:

$ cat tst.awk
FNR == 1 { ++numCols }
{
    if ( !($1 in ids2rows) ) {
        rows2ids[++numRows] = $1
        ids2rows[$1] = numRows
    }

    rowNr = ids2rows[$1]
    vals[rowNr,numCols] = $2
}
END {
    for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
        id = rows2ids[rowNr]
        printf "%s", id
        for ( colNr=1; colNr<=numCols; colNr++ ) {
            val = ( (rowNr,colNr) in vals ? vals[rowNr,colNr] : 0 )
            printf "%s%s", OFS, val
        }
        print ""
    }
}

$ awk -f tst.awk File1.txt File2.txt
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

评论

1赞 Shafaque Zahra 10/19/2023
非常感谢莫顿。此脚本功能齐全。我为我的问题准备了两个写得很好的脚本。超级开心。
0赞 Freeman 10/19/2023 #4

还要在 Python 中检查一下:

#reading the contents of File1.txt and File2.txt
with open('File1.txt', 'r') as file1, open('File2.txt', 'r') as file2:
    lines1 = file1.readlines()
    lines2 = file2.readlines()

#extract the IDs from the first line of each file
ids1 = lines1[0].split()[1:]
ids2 = lines2[0].split()[1:]

#make a dictionary to store the values for each ID
data = {}

#process the lines of File1.txt
for line in lines1[1:]:
    columns = line.split()
    data[columns[0]] = [columns[1]] + ['0'] * len(ids2)

#process the lines of File2.txt
for line in lines2[1:]:
    columns = line.split()
    id = columns[0]
    if id in data:
        data[id][1] = columns[1]
    else:
        data[id] = ['0'] * len(ids1) + [columns[1]]

#printing the header
print('id', *ids1, *ids2)

#printing the data
for id, values in data.items():
    print(id, *values)

输出

id  SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0