我怎样才能将 2 个文件的行与 Pandas 的特定规则相匹配

How can I match rows of 2 files with specific rules by Pandas

提问人:Dmitry Shulga 提问时间:10/24/2023 最后编辑:Dmitry Shulga 更新时间:10/24/2023 访问量:75

问:

有 2 个文件。我需要从文件 A 中获取每一行,然后遍历文件 B 的每一行,寻找最佳匹配项。以下是它现在如何工作的代码片段:.csv

for row2 in file2:
        final_k = 0
        row3 = []
        for row1 in file1:
            m1 = m2 = m3 = m4 = m5 = 0
            if row2[3] and (row1[2] and (
                    row2[3].upper().startswith(row1[2].upper()) or row1[2].upper().startswith(row2[3].upper()))):
                m1 = 100
            if row2[3] and row1[2].upper() == row2[3].upper():
                m1 = 110
            if row2[4] and (row1[4] and row2[4].upper().startswith(row1[4].upper())):
                m2 = 30
            if row2[4] and row1[4].upper() == row2[4].upper():
                m2 = 35
            if row2[5] and (
                    row1[3].upper() == row2[5].upper() or (row1[3] and row2[5].upper().startswith(row1[3].upper()))):
                m3 = 20
            if row2[5] and row1[3].upper() == row2[5].upper():
                m3 = 25
            if row2[6] and row1[6] == row2[6]:
                m4 = 500
            if row2[7] and row1[7] == row2[7]:
                m5 = 100
            k = m1 + m2 + m3 + m4 + m5
            if k > final_k:
                final_k = k
                row3.clear()
                row3.extend(row2)
                row3.append(row1[0])
                row3.append(row1[2])
                row3.append(row1[4])
                row3.append(row1[3])
                row3.append(row1[6])
                row3.append(row1[7])
                row3.append(m1)
                row3.append(m2)
                row3.append(m3)
                row3.append(m4)
                row3.append(m5)
        if final_k == 0:
            row2.append('Unmatched')
            lines.append(row2)
        else:
            lines.append(row3)

我需要使用 Pandas 重写这个逻辑,因为循环处理大量数据需要很多时间。我对 Python、NumPY 和 Pandas 了解不多。

我考虑过使用 or ,但与循环相比,它没有给出显着的结果。我知道可以使用 , , 函数。applyiterrowsmergegroupbymap

我已经编写了sql脚本,以便更好地理解我需要什么。但是我需要带有 pandas 的 python 脚本(也许通过更快的方式)。

select b_id, iff(total_score = 0, 'Unmatched', sc_id::text) as matched, total_score
from (with b as (select id            as b_id,
                        upper(title)  as b_title,
                        upper(artist) as b_artist,
                        upper(album)  as b_album,
                        isrc          as b_isrc,
                        upc           as b_upc,
                        covered_rows
                 from blocklist
                 where company_id = 18),
           sc as (select id                  as sc_id,
                         upper(track_title)  as sc_title,
                         upper(track_artist) as sc_artist,
                         upper(album_title)  as sc_album,
                         isrc                as sc_isrc,
                         upc                 as sc_upc
                  from source_catalog
                  where company_id = 18
                    and ownership != 'No')
      select *,
-- track
             case
                 when b.b_title = sc.sc_title and b.b_title != '' then 110
                 when (startswith(b.b_title, sc.sc_title) or startswith(sc.sc_title, b.b_title)) and b.b_title != '' and
                      sc.sc_title != ''
                     then 100
                 else 0 end                                                            as titles_score,
-- artist
             case
                 when b.b_artist = sc.sc_artist and b.b_artist != '' then 35
                 when (startswith(b.b_artist, sc.sc_artist) or startswith(sc.sc_artist, b.b_artist)) and
                      b.b_artist != '' and
                      sc.sc_artist != '' then 30
                 else 0 end                                                            as artists_score,
-- album
             case
                 when b.b_album = sc.sc_album and b.b_album != '' then 25
                 when (startswith(b.b_album, sc.sc_album) or startswith(sc.sc_album, b.b_album)) and b.b_album != '' and
                      sc.sc_album != ''
                     then 20
                 else 0 end                                                            as albums_score,
-- isrc
             IFF(b.b_isrc = sc.sc_isrc and b.b_isrc != '', 500, 0)                     as isrcs_score,
             IFF(b.b_upc = sc.sc_upc and b.b_upc is not null and b.b_upc != 0, 100, 0) as upcs_score,
             titles_score + artists_score + albums_score + isrcs_score + upcs_score    as total_score
      from sc
               cross join b)
qualify ROW_NUMBER() OVER (PARTITION BY b_id
    ORDER BY b_id, total_score desc) = 1;

您能告诉我哪些算法对解决这个问题最有用吗?

python pandas numpy csv 匹配

评论

2赞 not_speshal 10/24/2023
“最佳匹配”是什么意思?您能否提供 csv 文件的示例和预期输出?要获得精确匹配,您可能只需要一个merge
0赞 Dmitry Shulga 10/24/2023
@not_speshal我的意思是,对于每条线,都有一些线将获得最高分。我举了一个带循环的代码示例。在那里,将文件 A 的每一行与文件 B 的每一行进行比较。分数是根据单元格中的值确定的,最佳匹配是最高分数。
0赞 not_speshal 10/24/2023
我评论的其余部分仍然有效。
0赞 Dmitry Shulga 10/24/2023
@not_speshal文件链接
0赞 Community 10/24/2023
请澄清您的具体问题或提供其他详细信息,以准确说明您的需求。正如目前所写的那样,很难确切地说出你在问什么。

答: 暂无答案