我怎样才能将 2 个文件的行与 Pandas 的特定规则相匹配-解网

问：

有 2 个文件。我需要从文件 A 中获取每一行，然后遍历文件 B 的每一行，寻找最佳匹配项。以下是它现在如何工作的代码片段：.csv

for row2 in file2:
        final_k = 0
        row3 = []
        for row1 in file1:
            m1 = m2 = m3 = m4 = m5 = 0
            if row2[3] and (row1[2] and (
                    row2[3].upper().startswith(row1[2].upper()) or row1[2].upper().startswith(row2[3].upper()))):
                m1 = 100
            if row2[3] and row1[2].upper() == row2[3].upper():
                m1 = 110
            if row2[4] and (row1[4] and row2[4].upper().startswith(row1[4].upper())):
                m2 = 30
            if row2[4] and row1[4].upper() == row2[4].upper():
                m2 = 35
            if row2[5] and (
                    row1[3].upper() == row2[5].upper() or (row1[3] and row2[5].upper().startswith(row1[3].upper()))):
                m3 = 20
            if row2[5] and row1[3].upper() == row2[5].upper():
                m3 = 25
            if row2[6] and row1[6] == row2[6]:
                m4 = 500
            if row2[7] and row1[7] == row2[7]:
                m5 = 100
            k = m1 + m2 + m3 + m4 + m5
            if k > final_k:
                final_k = k
                row3.clear()
                row3.extend(row2)
                row3.append(row1[0])
                row3.append(row1[2])
                row3.append(row1[4])
                row3.append(row1[3])
                row3.append(row1[6])
                row3.append(row1[7])
                row3.append(m1)
                row3.append(m2)
                row3.append(m3)
                row3.append(m4)
                row3.append(m5)
        if final_k == 0:
            row2.append('Unmatched')
            lines.append(row2)
        else:
            lines.append(row3)

我需要使用 Pandas 重写这个逻辑，因为循环处理大量数据需要很多时间。我对 Python、NumPY 和 Pandas 了解不多。

我考虑过使用 or ，但与循环相比，它没有给出显着的结果。我知道可以使用，，函数。applyiterrowsmergegroupbymap

我已经编写了sql脚本，以便更好地理解我需要什么。但是我需要带有 pandas 的 python 脚本（也许通过更快的方式）。

select b_id, iff(total_score = 0, 'Unmatched', sc_id::text) as matched, total_score
from (with b as (select id            as b_id,
                        upper(title)  as b_title,
                        upper(artist) as b_artist,
                        upper(album)  as b_album,
                        isrc          as b_isrc,
                        upc           as b_upc,
                        covered_rows
                 from blocklist
                 where company_id = 18),
           sc as (select id                  as sc_id,
                         upper(track_title)  as sc_title,
                         upper(track_artist) as sc_artist,
                         upper(album_title)  as sc_album,
                         isrc                as sc_isrc,
                         upc                 as sc_upc
                  from source_catalog
                  where company_id = 18
                    and ownership != 'No')
      select *,
-- track
             case
                 when b.b_title = sc.sc_title and b.b_title != '' then 110
                 when (startswith(b.b_title, sc.sc_title) or startswith(sc.sc_title, b.b_title)) and b.b_title != '' and
                      sc.sc_title != ''
                     then 100
                 else 0 end                                                            as titles_score,
-- artist
             case
                 when b.b_artist = sc.sc_artist and b.b_artist != '' then 35
                 when (startswith(b.b_artist, sc.sc_artist) or startswith(sc.sc_artist, b.b_artist)) and
                      b.b_artist != '' and
                      sc.sc_artist != '' then 30
                 else 0 end                                                            as artists_score,
-- album
             case
                 when b.b_album = sc.sc_album and b.b_album != '' then 25
                 when (startswith(b.b_album, sc.sc_album) or startswith(sc.sc_album, b.b_album)) and b.b_album != '' and
                      sc.sc_album != ''
                     then 20
                 else 0 end                                                            as albums_score,
-- isrc
             IFF(b.b_isrc = sc.sc_isrc and b.b_isrc != '', 500, 0)                     as isrcs_score,
             IFF(b.b_upc = sc.sc_upc and b.b_upc is not null and b.b_upc != 0, 100, 0) as upcs_score,
             titles_score + artists_score + albums_score + isrcs_score + upcs_score    as total_score
      from sc
               cross join b)
qualify ROW_NUMBER() OVER (PARTITION BY b_id
    ORDER BY b_id, total_score desc) = 1;

您能告诉我哪些算法对解决这个问题最有用吗？

python pandas numpy csv 匹配

我怎样才能将 2 个文件的行与 Pandas 的特定规则相匹配

How can I match rows of 2 files with specific rules by Pandas

评论