提问人:KaraiKare 提问时间:11/6/2023 最后编辑:Guru StronKaraiKare 更新时间:11/8/2023 访问量:79
从字符串列创建类似于 JSON 的新列
Create new columns from String Columns that looks like JSON
问:
我得到了一个包含 1 列的表,它是字符串类型,但内部看起来像 json 类型。
该值如下所示
带值的“old_id”列
[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
{"name":"Function","id":"10594","value":"People & Culture"},
{"name":"Unit","id":"10595","value":"Organization Development"},
{"name":"Tribe","id":"10602","value":"Shared Service"}
]
带值的“new_id”列
[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
{"name":"Function","id":"10594","value":"Finance"},
{"name":"Unit","id":"10595","value":"Finance Operations"},
{"name":"Tribe","id":"10602","value":"Commercial"}
]
我需要 sql athena 查询才能从这些 json 列中制作列old_name、old_id、old_value、new_name、new_id new_value
我试过使用
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
但它只生成 1 行,即使在列中,它也显示 4 个“数组”
编号 | old_name | new_name | old_id | new_id | old_value | new_value |
---|---|---|---|---|---|---|
一个 | 恩蒂塔斯·佩努加桑 | 恩蒂塔斯·佩努加桑 | 6415 | 6415 | 希吉拉 | AFS的 |
一个 | 功能 | 功能 | 10594 | 10594 | 人与文化 | 金融 |
一个 | 单位 | 单位 | 10595 | 10595 | 组织发展 | 财务运营 |
一个 | 部落 | 部落 | 10602 | 10602 | 共享服务 | 商业 |
在 SQL Athena 中有什么方法可以做到这一点吗?
编辑:我在下面的查询中取得了一些进展
with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
from my_table
where
-- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
id in (A)
),
splitted_data as (
SELECT id, user_id,
split(old_custom_fields, '},{') AS old_custom_field_id,
split(new_custom_fields, '},{') AS new_custom_field_id
FROM my_table
),
old_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
select id, old_custom_field_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
select id, new_custom_field_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
old_id,new_id,
old_value,new_value
from old_custom_field_cleaned oc
join new_custom_field_cleaned nc on oc.id = nc.id
答:
1赞
Guru Stron
11/7/2023
#1
您的数据不仅看起来像 JSON,示例数据也是 JSON,因此请将其作为一个整体进行处理。根据所使用的 Presto/Trino 版本,实际处理可能会有所不同,但共同的部分始终是 - 解析 JSON 并将其转换为某种类型,然后取消嵌套。例如,您可以使用(替代方案可以是 just 或 或,具体取决于具体情况):array
ROW(name varchar, id varchar, value varchar)
JSON
MAP(varchar, varchar)
MAP(varchar, JSON)
-- sample data
WITH dataset(old_id, new_id) AS (
VALUES
('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
{"name":"Function","id":"10594","value":"People & Culture"},
{"name":"Unit","id":"10595","value":"Organization Development"},
{"name":"Tribe","id":"10602","value":"Shared Service"}
]',
'[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
{"name":"Function","id":"10594","value":"Finance"},
{"name":"Unit","id":"10595","value":"Finance Operations"},
{"name":"Tribe","id":"10602","value":"Commercial"}
]'
)
)
-- query
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...
输出:
old_name | old_id | old_value | new_name | new_id | new_value |
---|---|---|---|---|---|
恩蒂塔斯·佩努加桑 | 6415 | 希吉拉 | 恩蒂塔斯·佩努加桑 | 6415 | 希吉拉 |
功能 | 10594 | 人与文化 | 功能 | 10594 | 人与文化 |
单位 | 10595 | 组织发展 | 单位 | 10595 | 组织发展 |
部落 | 10602 | 共享服务 | 部落 | 10602 | 共享服务 |
上面的代码来自数组以正确顺序具有“相同”数据的假设。就我个人而言,我会考虑根据 id 加入:
-- sample data
WITH dataset(old_id, new_id) AS (
-- ..
),
-- query
old_values as (
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
),
new_values as (
SELECT t.*
FROM dataset,
unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
)
SELECT o.id,
o.name old_name,
o.value old_value,
n.name new_name,
n.value new_value
FROM old_values as o
full outer join new_values as n on o.id = n.id;
输出:
编号 | old_name | old_value | new_name | new_value |
---|---|---|---|---|
6415 | 恩蒂塔斯·佩努加桑 | 希吉拉 | 恩蒂塔斯·佩努加桑 | AFS的 |
10594 | 功能 | 人与文化 | 功能 | 金融 |
10595 | 单位 | 组织发展 | 单位 | 财务运营 |
10602 | 部落 | 共享服务 | 部落 | 商业 |
评论
on oc.unique_id = nc.unique_id and old_id = new_id