如何解析Snowflake中XML VARIANT列中的数组?

How to parse an array in an XML VARIANT column in Snowflake?

提问人:Johnny Stevens 提问时间:9/29/2023 最后编辑:Johnny Stevens 更新时间:10/5/2023 访问量:45

问:

我有一个带有 VARIANT 列的表,其中包含来自 Snowpipe 的 XML。对于我的大多数字段,我能够使用 XMLGET 和 GET 的不稳定组合来解析 XML。但是,有一个字段是一个数组,我只能提取标签值数组中的第一个值。每条记录在标签数组中可以有 0 - 100 个值,因此我需要足够动态的东西来涵盖所有这些场景。我花了几天时间试图解决这个问题,但我被难住了。

以下是一些命令,可帮助复制我的问题以及具有一些(但不是全部)值的查询。请注意,此查询将在 dbt 中用于定期将数据从 XML 转换为表格,并输入数百万行,因此我还想在进行此解析时牢记性能......我很好奇是否有人有比嵌套 XMLGET 命令更好的方法。我将在帖子底部添加所需的输出。

提前感谢任何可以帮助我的人!

create temp table if not exists temp_cf_final (RECORD_CONTENT variant);
insert into temp_cf_final
select parse_xml('<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?opentarget version=\"1.1\"?>\n<opentarget xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\n<txn id=\"20339559598\" msgIdx=\"76\" msgTot=\"86\" commitTime=\"2023-09-28T10:54:44\" userId=\"349\" oracleTxnId=\"122.17.359225\" />\n<tbl name=\"MAROON_DEV.CF_FINAL\">\n<cmd ops=\"upd\">\n<row id=\"AAIjTAAA7AAHuvTAAZ\">\n<col name=\"DATA\" type=\"varray\" />\n<varray>\n<value>1 street</value>\n<value>    </value>\n<value></value>\n<value>Nashville</value>\n<value>TN</value>\n<value>37211</value>\n<value></value>\n</varray>\n<lkup>\n<col name=\"CF_FINAL_ID\">14773</col>\n</lkup>\n</row>\n</cmd>\n</tbl>\n</opentarget>\n');
insert into temp_cf_final 
select parse_xml('<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?opentarget version=\"1.1\"?>\n<opentarget xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\n<txn id=\"20339559598\" msgIdx=\"77\" msgTot=\"86\" commitTime=\"2023-09-28T10:54:44\" userId=\"349\" oracleTxnId=\"122.17.359225\" />\n<tbl name=\"MAROON_DEV.CF_FINAL\">\n<cmd ops=\"upd\">\n<row id=\"AAIjTAAA7AAHuvTAAa\">\n<col name=\"DATA\" type=\"varray\" />\n<varray>\n<value>2486</value>\n</varray>\n<lkup>\n<col name=\"CF_FINAL_ID\">14774</col>\n</lkup>\n</row>\n</cmd>\n</tbl>\n</opentarget>\n');
select record_content
,TO_TIMESTAMP(GET(XMLGET(PARSE_XML(RECORD_CONTENT),'txn'),'@commitTime')) as last_commit_time
,GET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'@ops')::string as last_operation
,CASE 
    WHEN LAST_OPERATION='ins' THEN GET(XMLGET(XMLGET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'row'),'col',0),'$')
    WHEN LAST_OPERATION='upd' THEN GET(XMLGET(XMLGET(XMLGET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'row'),'lkup'),'col',0),'$') 
    END::number AS CF_FINAL_ID
,SUBSTR(GET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'@name'),1,CHARINDEX('.',GET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'@name'))-1)::string AS CUSTOMER_SCHEMA
,GET(XMLGET(XMLGET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'row'),'varray'),'$') as DATA_UPD --Works to get me either the single value of "value" or the array of values
,regexp_substr_all(GET(XMLGET(XMLGET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'row'),'varray'),'$'),$$"\$":\w+$$,1,1) as DU_values --Works on the array of values, but only pulls the last in the array
,GET(XMLGET(XMLGET(XMLGET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'row'),'varray'),'value'),'$')::VARCHAR as DU_first_line --Works great to get the single value, but returns only the first line of an array of values
from temp_cf_final;

从上面可以看出...

DATA_UPD列为我提供了奇异值和值数组。 DU_FIRST_LINE更进一步,生成了我想要的单数值的 VARCHAR,但只给了我值数组的第一行。 DU_VALUES是我尝试正则表达式从数组中获取值,但经过几个小时的尝试,这不会为我做。

我想看到的 DATA 是 string_agg 和逗号分隔的值数组(一旦我/我们/您弄清楚数组解析,我就可以这样做了)?

CF_FINAL_ID |数据

14773 |1街,,纳什维尔,田纳西州,37211, 14774 |2486

====更新====

我想提供我的最新进展。我收到了有关 StackOverflow 和 Snowflake 支持的反馈。考虑到我的业务需求(以及向利益相关者指出的性能问题),我决定使用部分扁平化和 PATH 变量来获取我需要的数据。

我还发现,对于具有相同时间戳的 INS(具有空 DATA)和 UPD(没有所有键,但包含 DATA)按顺序传递了多个 msg 值。我创建了一个operation_order值,该值将要求使用 DEL、UPD 和 INS 执行最近的操作,并赋予优先级。

最后,我添加了一些 CTE 来过滤结果,并在operation_order上添加了 QUALIFY 语句。

with cf_final_xml as (select record_metadata,parse_xml(record_content) as record_content from CARELOGIC_DEV.RAW_CL.cf_final where GET(XMLGET(XMLGET(PARSE_XML(RECORD_CONTENT),'tbl'),'cmd'),'@ops') in ('ins','upd','del'))
-- pre-limiting the CF_FINAL records to ins/upd/del reduces the number of rows included in the rest of the query logic
, cf_final_flat as (select * from cf_final_xml,lateral flatten(input => record_content['$'][1]['$']['$']['$'], recursive => false))
-- flattening the "value" node instead of the full table reduces the # of rows in the results, not a full cartesian
select SUBSTR(GET(XMLGET(RECORD_CONTENT,'tbl'),'@name'),1,CHARINDEX('.',GET(XMLGET((RECORD_CONTENT),'tbl'),'@name'))-1)::string AS CUSTOMER_SCHEMA
,RECORD_CONTENT['$'][1]['$']['@ops']::string as last_operation
,TO_TIMESTAMP(RECORD_CONTENT['$'][0]['@commitTime']) AS LAST_COMMIT_TIME
,RECORD_CONTENT['$'][0]['@msgIdx']::NUMBER AS msg_id
,CASE
    WHEN LAST_OPERATION='ins' THEN record_content['$'][1]['$']['$']['$'][1]['$']::NUMBER
END AS CF_EDIT_ID_INS
,CASE
    WHEN LAST_OPERATION='ins' THEN record_content['$'][1]['$']['$']['$'][2]['$']::NUMBER
END AS CF_FORM_DTL_ID_INS
,CASE 
    WHEN LAST_OPERATION='ins' then 0
    WHEN LAST_OPERATION='upd' THEN 1
    WHEN LAST_OPERATION='del' THEN 2
END AS OPERATION_ORDER
-- Using OPERATION_ORDER in QUALIFY statement to limit the results to the most recent edit
-- For example, INS and UPD come in with the same LAST_COMMIT_TIME. The OpOrder DESC helps define the last operation
,CASE 
    WHEN LAST_OPERATION='ins' THEN record_content['$'][1]['$']['$']['$'][0]['$']::NUMBER
    WHEN LAST_OPERATION='upd' THEN record_content['$'][1]['$']['$']['$'][2]['$']['$']::NUMBER
    WHEN LAST_OPERATION='del' THEN record_content['$'][1]['$']['$']['$']['$']['$']::NUMBER
END AS CF_FINAL_ID
, MAX(CF_EDIT_ID_INS) OVER (PARTITION BY CUSTOMER_SCHEMA,CF_FINAL_ID) AS CF_EDIT_ID
, MAX(CF_FORM_DTL_ID_INS) OVER (PARTITION BY CUSTOMER_SCHEMA,CF_FINAL_ID) AS CF_FORM_DTL_ID
-- Window functions to attribute the CF_EDIT_ID and CF_FORM_DTL_ID to UPD records
,CASE
    WHEN LAST_OPERATION='upd' THEN record_content['$'][1]['$']['$']['$'][1]::STRING
END AS VAL_ARRAY
,replace(replace(left(right(VAL_ARRAY,length(VAL_ARRAY)-8),length(VAL_ARRAY)-25),'<value>',''),'</value>',',')::STRING as "DATA"
-- ,RECORD_CONTENT
-- ,*
from cf_final_flat
QUALIFY ROW_NUMBER() OVER (PARTITION BY CUSTOMER_SCHEMA,CF_FINAL_ID ORDER BY LAST_COMMIT_TIME,OPERATION_ORDER DESC)=1
-- Only take the most recent operation by time/op across Customers and CF_FINAL_ID values
ORDER BY CUSTOMER_SCHEMA,CF_FINAL_ID;
数组 xml 解析 snowflake-cloud-data-platform

评论


答:

0赞 Felipe Hoffa 10/3/2023 #1

您可以将最后一个数组视为字符串,然后:regexp_substr_all

select xmlget(record_content, 'tbl') tbl
 , xmlget(tbl, 'cmd') cmd
 , xmlget(cmd, 'row') row1
 , xmlget(row1, 'varray') varray
 , regexp_substr_all(varray::string, '<value>([^<]+)', 1, 1, 'e', 1) arr
from temp_cf_final;

enter image description here