根据公共列合并多个行不相等的文件，形成计数矩阵文件-解网

问：

我有与 https://superuser.com/questions/1245094/merging-multiple-files-based-on-the-common-column 类似的合并多个文件的问题。我非常接近解决方案，但我是 python 的新手。我需要帮助调整代码以连接多个文件。单个文件的 ID 和列如下所示：

文件1.txt

id  SRR1071717
chr1:15039:-::chr1:15795:-  2
chr1:15948:-::chr1:16606:-  6

文件2.txt

id  SRR1079830
chr1:11672:+::chr1:12009:+  10
chr1:11845:+::chr1:12009:+  7
chrY:9756574:+::chrY:9757796:+  0

我想要的输出

id  SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

我的代码：Matrix.py

import sys

columns = []
data = {}
ids = set()
for filename in sys.argv[1:]:
    with open(filename, 'rU') as f:
        key = next(f).strip().split()[1]
        columns.append(key)
        data[key] = {}
        for line in f:
            if line.strip():
                id, value = line.strip().split()
                try:
                    data[key][int(id)] = value
                except ValueError as exc:
                    raise ValueError(
                        "Problem in line: '{}' '{}' '{}'".format(
                            id, value, line.rstrip()))

                ids.add(int(id))

print('\t'.join(['ID'] + columns))

for id in sorted(ids):
    line = []
    for column in columns:
        line.append(data[column].get(id, '0'))
    print('\t'.join([str(id)] + line))

我运行了如图所示的 python 代码，但它无法正常工作（不熟悉 python）。电流输出（只有两行！

python3 matrix.py File\*.txt

电流输出

id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0

python linux bash awk

import sys
import glob
import pandas as pd

file_pattern = sys.argv[1]
file_list = glob.glob(file_pattern)
merged_df = None

for filename in file_list:
    column = filename.split('.')[0] 
    df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])
    df.set_index('id', inplace=True)
    if merged_df is None:
        merged_df = df
    else:
        merged_df = merged_df.join(df, how='outer')

merged_df.fillna(0, inplace=True)
merged_df = merged_df[~merged_df.index.str.startswith('id')]

print(merged_df.to_string(na_rep='0'))

跑：

python matrix.py "File*.txt"

输出：

编辑：

如果文件间距为 tab （\t），请使用以下命令：

df = pd.read_csv(filename, sep='\t', header=None, names=['id', column])

而不是这一行：

df = pd.read_csv(filename, delim_whitespace=True, header=None, names=['id', column])

我检查了一下，使用制表符间距，它也可以使用。

awk '
BEGIN  { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
         fcnt++                                     # keep track of number of files; will serve as index of 2nd dimension of array 
         next }
       { values[$1][fcnt] = $2 }                    # populate 2-dimensional array

END    { print hdr
         for (id in values) {                       # loop through id values
             printf "%s%s", id, OFS
             for (i=1; i<=fcnt; i++)                # loop through 2nd dimension of array
                 printf "%s%s", (i in values[id] ? values[id][i] : 0), (i<fcnt ? OFS : ORS)
         }
       }
' File*.txt

笔记：

需要支持多维数组GNU awk
(values[id][i] ? values[id][i] : 0)- 如果数组条目已填充，则打印它，否则打印默认值0
(i<fcnt ? OFS : ORS)- 打印输出字段分隔符（），但最后一次通过循环（）除外，在这种情况下，打印输出记录分隔符（OFSi==fcntORS)

这将产生：

id SRR1071717 SRR1079830
chrY:9756574:+::chrY:9757796:+ 0 0
chr1:11845:+::chr1:12009:+ 0 7
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:15948:-::chr1:16606:- 6 0
chr1:15039:-::chr1:15795:- 2 0

添加数组，以便我们可以按照读取 ID 的相同顺序生成输出......idorder[]

awk '
BEGIN  { hdr = "id" }
FNR==1 { hdr = hdr OFS $2
         fcnt++
         next }
       { if (! ($1 in values))
            idorder[++idcnt] = $1
         values[$1][fcnt] = $2
       }

END    { print hdr
         for (i=1; i<=idcnt; i++) {
             id = idorder[i]
             printf "%s%s", id, OFS
             for (j=1; j<=fcnt; j++)
                 printf "%s%s", (j in values[id] ? values[id][j] : 0), (j<fcnt ? OFS : ORS)
         }
       }
' File*.txt

这将产生：

id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr8:77777:-::chr1:16606:- 6 17
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

$ cat tst.awk
FNR == 1 { ++numCols }
{
    if ( !($1 in ids2rows) ) {
        rows2ids[++numRows] = $1
        ids2rows[$1] = numRows
    }

    rowNr = ids2rows[$1]
    vals[rowNr,numCols] = $2
}
END {
    for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
        id = rows2ids[rowNr]
        printf "%s", id
        for ( colNr=1; colNr<=numCols; colNr++ ) {
            val = ( (rowNr,colNr) in vals ? vals[rowNr,colNr] : 0 )
            printf "%s%s", OFS, val
        }
        print ""
    }
}

$ awk -f tst.awk File1.txt File2.txt
id SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

#reading the contents of File1.txt and File2.txt
with open('File1.txt', 'r') as file1, open('File2.txt', 'r') as file2:
    lines1 = file1.readlines()
    lines2 = file2.readlines()

#extract the IDs from the first line of each file
ids1 = lines1[0].split()[1:]
ids2 = lines2[0].split()[1:]

#make a dictionary to store the values for each ID
data = {}

#process the lines of File1.txt
for line in lines1[1:]:
    columns = line.split()
    data[columns[0]] = [columns[1]] + ['0'] * len(ids2)

#process the lines of File2.txt
for line in lines2[1:]:
    columns = line.split()
    id = columns[0]
    if id in data:
        data[id][1] = columns[1]
    else:
        data[id] = ['0'] * len(ids1) + [columns[1]]

#printing the header
print('id', *ids1, *ids2)

#printing the data
for id, values in data.items():
    print(id, *values)

输出

id  SRR1071717 SRR1079830
chr1:15039:-::chr1:15795:- 2 0
chr1:15948:-::chr1:16606:- 6 0
chr1:11672:+::chr1:12009:+ 0 10
chr1:11845:+::chr1:12009:+ 0 7
chrY:9756574:+::chrY:9757796:+ 0 0

上一个：按匹配其他文件的顺序打印行 [已关闭]

下一个：在 bash 中比较两个大文件并寻找性能 [已关闭]

根据公共列合并多个行不相等的文件，形成计数矩阵文件

Merging Multiple files with unequal rows based on the common column to form a count matrix file

评论

评论

评论

评论