从字符串列创建类似于 JSON 的新列

Create new columns from String Columns that looks like JSON

提问人:KaraiKare 提问时间:11/6/2023 最后编辑:Guru StronKaraiKare 更新时间:11/8/2023 访问量:79

问:

我得到了一个包含 1 列的表,它是字符串类型,但内部看起来像 json 类型。

该值如下所示

带值的“old_id”

[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
 {"name":"Function","id":"10594","value":"People & Culture"},
 {"name":"Unit","id":"10595","value":"Organization Development"},
 {"name":"Tribe","id":"10602","value":"Shared Service"}
]

带值的“new_id”

[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
 {"name":"Function","id":"10594","value":"Finance"},
 {"name":"Unit","id":"10595","value":"Finance Operations"},
 {"name":"Tribe","id":"10602","value":"Commercial"}
]

我需要 sql athena 查询才能从这些 json 列中制作列old_name、old_id、old_value、new_name、new_id new_value

我试过使用

REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value

但它只生成 1 行,即使在列中,它也显示 4 个“数组”

查询应生成 4 行,如下所示enter image description here

编号 old_name new_name old_id new_id old_value new_value
一个 恩蒂塔斯·佩努加桑 恩蒂塔斯·佩努加桑 6415 6415 希吉拉 AFS的
一个 功能 功能 10594 10594 人与文化 金融
一个 单位 单位 10595 10595 组织发展 财务运营
一个 部落 部落 10602 10602 共享服务 商业

在 SQL Athena 中有什么方法可以做到这一点吗?


编辑:我在下面的查询中取得了一些进展

with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
    from my_table
    where 
    -- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
    id in (A)
),
splitted_data as (
    SELECT id, user_id,
    split(old_custom_fields, '},{') AS old_custom_field_id,
    split(new_custom_fields, '},{') AS new_custom_field_id
  FROM my_table
),
old_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
    select id, old_custom_field_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
    from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
    select id, new_custom_field_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
    from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
    old_id,new_id,
    old_value,new_value
    from old_custom_field_cleaned oc 
    join new_custom_field_cleaned nc on oc.id = nc.id

但这会导致重复的行,现在由于连接,我有 16 行,仍然需要帮助删除不需要的行enter image description here

sql json amazon-athena presto trino

评论

0赞 Tim Mylott 11/7/2023
old_id 和 new_id 的元素数量是否始终相同?我们是否假设您正在将 old_id[0] 与 new_id[0]、old_id[1] 与 new_id[1] 进行比较......如果是这样,是否有用于连接这些字段的字段,或者只是基于数据的顺序?
0赞 KaraiKare 11/7/2023
@TimMylott是的,它们总是具有相同数量的元素。是的,它是将 old_id[0] 与 new_id[0] 进行比较,依此类推。在这种情况下,它们属于同一unique_id A。现在我正在 join 子句中添加一些逻辑。但是这个条款仍然需要测试,因为我仍在检查old_id和new_id是否始终相同,只有old_value和new_value是可更改的on oc.unique_id = nc.unique_id and old_id = new_id

答:

1赞 Guru Stron 11/7/2023 #1

您的数据不仅看起来像 JSON,示例数据也是 JSON,因此请将其作为一个整体进行处理。根据所使用的 Presto/Trino 版本,实际处理可能会有所不同,但共同的部分始终是 - 解析 JSON 并将其转换为某种类型,然后取消嵌套。例如,您可以使用(替代方案可以是 just 或 或,具体取决于具体情况):arrayROW(name varchar, id varchar, value varchar)JSONMAP(varchar, varchar)MAP(varchar, JSON)

-- sample data
WITH dataset(old_id, new_id) AS (
    VALUES
        ('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
             {"name":"Function","id":"10594","value":"People & Culture"},
             {"name":"Unit","id":"10595","value":"Organization Development"},
             {"name":"Tribe","id":"10602","value":"Shared Service"}
          ]',
          '[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
             {"name":"Function","id":"10594","value":"Finance"},
             {"name":"Unit","id":"10595","value":"Finance Operations"},
             {"name":"Tribe","id":"10602","value":"Commercial"}
            ]'
         )
)

-- query
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
    cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...

输出:

old_name old_id old_value new_name new_id new_value
恩蒂塔斯·佩努加桑 6415 希吉拉 恩蒂塔斯·佩努加桑 6415 希吉拉
功能 10594 人与文化 功能 10594 人与文化
单位 10595 组织发展 单位 10595 组织发展
部落 10602 共享服务 部落 10602 共享服务

上面的代码来自数组以正确顺序具有“相同”数据的假设。就我个人而言,我会考虑根据 id 加入:

-- sample data
WITH dataset(old_id, new_id) AS (
   -- ..
),

-- query 
old_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
),
new_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
)

SELECT o.id,
       o.name old_name,
       o.value old_value,
       n.name new_name,
       n.value new_value
FROM old_values as o
full outer join new_values as n on o.id = n.id;

输出:

编号 old_name old_value new_name new_value
6415 恩蒂塔斯·佩努加桑 希吉拉 恩蒂塔斯·佩努加桑 AFS的
10594 功能 人与文化 功能 金融
10595 单位 组织发展 单位 财务运营
10602 部落 共享服务 部落 商业