提问人:Dmitry Shulga 提问时间:10/24/2023 最后编辑:Dmitry Shulga 更新时间:10/24/2023 访问量:75
我怎样才能将 2 个文件的行与 Pandas 的特定规则相匹配
How can I match rows of 2 files with specific rules by Pandas
问:
有 2 个文件。我需要从文件 A 中获取每一行,然后遍历文件 B 的每一行,寻找最佳匹配项。以下是它现在如何工作的代码片段:.csv
for row2 in file2:
final_k = 0
row3 = []
for row1 in file1:
m1 = m2 = m3 = m4 = m5 = 0
if row2[3] and (row1[2] and (
row2[3].upper().startswith(row1[2].upper()) or row1[2].upper().startswith(row2[3].upper()))):
m1 = 100
if row2[3] and row1[2].upper() == row2[3].upper():
m1 = 110
if row2[4] and (row1[4] and row2[4].upper().startswith(row1[4].upper())):
m2 = 30
if row2[4] and row1[4].upper() == row2[4].upper():
m2 = 35
if row2[5] and (
row1[3].upper() == row2[5].upper() or (row1[3] and row2[5].upper().startswith(row1[3].upper()))):
m3 = 20
if row2[5] and row1[3].upper() == row2[5].upper():
m3 = 25
if row2[6] and row1[6] == row2[6]:
m4 = 500
if row2[7] and row1[7] == row2[7]:
m5 = 100
k = m1 + m2 + m3 + m4 + m5
if k > final_k:
final_k = k
row3.clear()
row3.extend(row2)
row3.append(row1[0])
row3.append(row1[2])
row3.append(row1[4])
row3.append(row1[3])
row3.append(row1[6])
row3.append(row1[7])
row3.append(m1)
row3.append(m2)
row3.append(m3)
row3.append(m4)
row3.append(m5)
if final_k == 0:
row2.append('Unmatched')
lines.append(row2)
else:
lines.append(row3)
我需要使用 Pandas 重写这个逻辑,因为循环处理大量数据需要很多时间。我对 Python、NumPY 和 Pandas 了解不多。
我考虑过使用 or ,但与循环相比,它没有给出显着的结果。我知道可以使用 , , 函数。apply
iterrows
merge
groupby
map
我已经编写了sql脚本,以便更好地理解我需要什么。但是我需要带有 pandas 的 python 脚本(也许通过更快的方式)。
select b_id, iff(total_score = 0, 'Unmatched', sc_id::text) as matched, total_score
from (with b as (select id as b_id,
upper(title) as b_title,
upper(artist) as b_artist,
upper(album) as b_album,
isrc as b_isrc,
upc as b_upc,
covered_rows
from blocklist
where company_id = 18),
sc as (select id as sc_id,
upper(track_title) as sc_title,
upper(track_artist) as sc_artist,
upper(album_title) as sc_album,
isrc as sc_isrc,
upc as sc_upc
from source_catalog
where company_id = 18
and ownership != 'No')
select *,
-- track
case
when b.b_title = sc.sc_title and b.b_title != '' then 110
when (startswith(b.b_title, sc.sc_title) or startswith(sc.sc_title, b.b_title)) and b.b_title != '' and
sc.sc_title != ''
then 100
else 0 end as titles_score,
-- artist
case
when b.b_artist = sc.sc_artist and b.b_artist != '' then 35
when (startswith(b.b_artist, sc.sc_artist) or startswith(sc.sc_artist, b.b_artist)) and
b.b_artist != '' and
sc.sc_artist != '' then 30
else 0 end as artists_score,
-- album
case
when b.b_album = sc.sc_album and b.b_album != '' then 25
when (startswith(b.b_album, sc.sc_album) or startswith(sc.sc_album, b.b_album)) and b.b_album != '' and
sc.sc_album != ''
then 20
else 0 end as albums_score,
-- isrc
IFF(b.b_isrc = sc.sc_isrc and b.b_isrc != '', 500, 0) as isrcs_score,
IFF(b.b_upc = sc.sc_upc and b.b_upc is not null and b.b_upc != 0, 100, 0) as upcs_score,
titles_score + artists_score + albums_score + isrcs_score + upcs_score as total_score
from sc
cross join b)
qualify ROW_NUMBER() OVER (PARTITION BY b_id
ORDER BY b_id, total_score desc) = 1;
您能告诉我哪些算法对解决这个问题最有用吗?
答: 暂无答案
上一个:使用 R 比较不同文件中的两列
评论
merge