同一列中的字符串数据匹配

问：

我有一个个人工作数据集以及一些关于某些职业工资的信息，我正在尝试创建一个子集，通过模糊匹配来标准化工作名称。具体来说，月薪为 4000 美元的名为“成本会计师”的职位和 5000 美元的“财务会计师”将在名为“会计师”的新列下匹配，该列计算具有相似名称的工作的平均值。

这是我到目前为止的代码： #upload 套餐

library(stringr)
library(dplyr)

# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

输出：

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我已经按照这里的说明进行了操作，这给了我一个良好的开端，因为它标记了其他已匹配的行/观察值，但我无法标准化我之前在示例中解释的职位。

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)

# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

输出：

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

当前 df 如下所示：

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

期望的结果如下，其中平均工资基于所有会计工资的平均值，而不是“成本会计师”或“财务会计师”，所有会计工作都类似于“会计师”

jobtitle       avg_wage
Accountant     $4333

DataFrame 机器学习 DPLYR 模式匹配

library(tidyverse)

# the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

# same with this
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

# basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

输出：

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>

同一列中的字符串数据匹配 - R

String data matching within the same column - R

评论

评论