同一列中的字符串数据匹配 - R

String data matching within the same column - R

提问人:maldini1990 提问时间:8/10/2023 最后编辑:maldini1990 更新时间:8/14/2023 访问量:45

问:

我有一个个人工作数据集以及一些关于某些职业工资的信息,我正在尝试创建一个子集,通过模糊匹配来标准化工作名称。具体来说,月薪为 4000 美元的名为“成本会计师”的职位和 5000 美元的“财务会计师”将在名为“会计师”的新列下匹配,该列计算具有相似名称的工作的平均值。

这是我到目前为止的代码: #upload 套餐

library(stringr)
library(dplyr)
# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我已经按照这里的说明进行了操作,这给了我一个良好的开端,因为它标记了其他已匹配的行/观察值,但我无法标准化我之前在示例中解释的职位。

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

当前 df 如下所示:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

期望的结果如下,其中平均工资基于所有会计工资的平均值,而不是“成本会计师”或“财务会计师”,所有会计工作都类似于“会计师”

jobtitle       avg_wage
Accountant     $4333  
DataFrame 机器学习 DPLYR 模式 匹配

评论

1赞 Mark 8/11/2023
嗨,内斯塔!你能告诉我你想要的输出是什么吗?
1赞 Mark 8/11/2023
谢谢你!另一件事 - 您包含的数据帧都没有“月薪为 4000 美元的”成本会计师“或”月薪为 5000 美元的“财务会计师”
0赞 maldini1990 8/11/2023
正确的@Mark,我只是把它作为我数据集中类似听起来工作的示例,希望能阐明我想要的输出。

答:

1赞 Mark 8/11/2023 #1

我想这就是你想要的?虽然我不完全确定:

library(tidyverse)

# the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

# same with this
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

# basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

输出:

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>

评论

1赞 maldini1990 8/11/2023
这正是我所追求的,非常感谢!