如何从 sql 表中检索架构(类型和列)并在突触笔记本中的表上强制实施它们?

How to retrieve the schema (type and columns) from a sql table and enforce them on a table in a synapse notebook?

提问人:Herwini 提问时间:11/7/2023 最后编辑:Herwini 更新时间:11/7/2023 访问量:63

问:

我在 sql 数据库中有几个表。所有这些表都是空的,但具有正确的架构。现在,我在 Azure 数据湖中有相应的表,其中包含数据但架构错误。

我想创建一个表,将数据湖表的数据与 sql 数据库表的架构相结合。

我尝试做什么:

我有一个带有pyspark的突触笔记本。

  • 我加载了 sql 表的架构。
  • 我从数据湖中提取匹配的表
  • 我尝试在从数据湖获取的表上强制执行 sql 表的架构
  • 我将数据写入 sql 数据库

在代码中:

container = "..."
storage_account = "..."
ls_name = "..."
adls_path = f"abfs://{container}@{storage_account}.dfs.core.windows.net/test_table.parquet"
#set variable to be used to connect the database
SERVER = ...
DATABASE= ...
USERNAME= ...
PASSWORD= ...

#read table data for sql database
sqldf = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://{SERVER};databaseName={DATABASE}") \
    .option("dbtable", 'test_table') \
    .option("user", USERNAME) \
    .option("password", PASSWORD) \
    .load()

# Read in data lake table
pandasdf = pd.read_parquet(adls_path, storage_options = {'linked_service' : ls_name})
pandasdf = pandasdf[sqldf.columns] # to make the columns match
sparkdf = spark.createDataFrame(pandasdf)

# Create spark dataframe with correct data and correct schema
final = spark.createDataFrame(sparkdf.collect(), sqldf.schema, verifySchema=True)

#Write table data into a spark dataframe
final.write.format("jdbc") \
    .option("url", f"jdbc:sqlserver://{SERVER};databaseName={DATABASE}") \
    .option("dbtable", 'test_table1') \
    .option("user", USERNAME) \
    .option("password", PASSWORD) \
    .save()

确实创建了表“test_table1”。但是,当我查看sql数据库并运行以下查询时:

SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'test_table1'

我看到以下内容:enter image description here

当我运行以下查询时:

SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'test_table'

我得到这个结果:

enter image description here

一切看起来都一样,但“CHARACTER_MAXIMU_LENGTH”和“CHARACTER_OCET_LENGTH”相差甚远。 -1 与一些固定数字。(-1 我认为表示可能的最大数量)

因此,这意味着架构并不完全相同。我需要“test_table1”的模式与“test_table”相同。

我该怎么做?

(

  • 我有多个包含大量列的表,我不想为所有这些表单独指定架构特征应该是什么。
  • 在其中一些表中,数据湖表中的类型是 doubletype,在 sql 中应为 int 或 float。这通常会引发无法转换列的错误。但是,如果我通过强制转换类型来手动执行此操作,它就会起作用 )
SQL 服务器 数据帧 Azure Pyspark

评论

0赞 JayashankarGS 11/7/2023
在问题中添加会有所帮助。sqldf.schema

答:

0赞 ChandraSaiKarthik 11/7/2023 #1

试试这段代码,它应该给你类似的结果。

SELECT TABLE_SCHEMA AS "SchemaName",
       TABLE_NAME AS "ObjectName",
       ORDINAL_POSITION AS "ColOrd",
       COLUMN_NAME AS "ColumnName",
       DOMAIN_NAME AS "UserDataType",
       DATA_TYPE AS "SystemDataType",
       CASE WHEN DATA_TYPE IN ('xml', 'hierarchyid', 'geography', 'sql_variant', 'image', 'text', 'ntext') THEN NULL ELSE CHARACTER_MAXIMUM_LENGTH END AS "Length",
       CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN NUMERIC_PRECISION ELSE NULL END AS "Precision",
       CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN NUMERIC_SCALE ELSE NULL END AS "Scale",
       DATA_TYPE + CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN CONCAT('(',NUMERIC_PRECISION,',',NUMERIC_SCALE,')')
                        WHEN DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'binary', 'varbinary') THEN CASE WHEN CHARACTER_MAXIMUM_LENGTH = -1 THEN '(max)' ELSE CONCAT('(',CHARACTER_MAXIMUM_LENGTH,')') END
                    ELSE '' END AS "ConfiguredDataType",
       CASE IS_NULLABLE WHEN 'YES' THEN 1 ELSE 0 END AS "isNull",
       SUBSTRING(COLUMN_DEFAULT, 2, LEN(COLUMN_DEFAULT)-2) AS "DefaultValue"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'categories'

查询以从 sys.columns 获取相同的结果

SELECT SCHEMA_NAME(O.schema_id) AS "SchemaName",
       OBJECT_NAME(O.object_id) AS "ObjectName",
       CASE WHEN O.type = 'U' THEN 'UserTable' WHEN O.type = 'V' THEN 'View' WHEN O.type = 'TF' THEN 'TableValuedFn' ELSE '' END AS "ObjectType",
       C.column_id AS "ColID",
       COLUMNPROPERTY(C.object_id, C.name, 'ordinal') AS "ColOrd",
       C.name AS "ColumnName",
       CASE WHEN C.user_type_id <> C.system_type_id THEN TYPE_NAME(C.user_type_id) ELSE '' END AS "UserDataType",
       ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id)) AS "SystemDataType",
       ISNULL(CAST(COLUMNPROPERTY(C.object_id, C.name, 'charmaxlen') AS NVARCHAR(10)), '') AS "Length",
       CASE WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN CAST(C.precision AS NVARCHAR(4)) ELSE '' END AS "Precision",
       CASE WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN CAST(C.scale AS NVARCHAR(4)) ELSE '' END AS "Scale",
       ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id))
            + CASE
                 WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN '(' + CAST(C.precision AS VARCHAR(4)) + ', ' + CAST(C.scale AS VARCHAR(4)) + ')'
                 WHEN C.max_length = -1 AND TYPE_NAME(C.system_type_id) NOT IN ('xml') THEN '(max)'
                 WHEN ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id)) IN ('xml', 'hierarchyid', 'geography', 'sql_variant', 'image', 'text', 'ntext') THEN ''
                 ELSE ISNULL('('+CAST(COLUMNPROPERTY(C.object_id, C.name, 'charmaxlen') AS NVARCHAR(10))+')', '')
              END AS "ConfiguredSystemType",
       ISNULL(C.is_nullable, 0) AS "isNull",
       ISNULL(SUBSTRING(OBJECT_DEFINITION(C.default_object_id), 2, LEN(OBJECT_DEFINITION(C.default_object_id)) - 2), '') AS "DefaultValue",
       C.is_identity AS "isIdentity",
       C.is_computed AS "isComputed"
FROM sys.columns AS C
INNER JOIN sys.objects AS O ON O.object_id = C.object_id
WHERE SCHEMA_NAME(O.schema_id) = 'dbo' AND O.name = 'categories'

评论

0赞 Herwini 11/7/2023
这显示了我在问题中已经显示的相同差异(-1 与某个设定的数字)。
1赞 ChandraSaiKarthik 11/7/2023
“-1”表示 (MAX),XML 除外
0赞 Thom A 11/7/2023
@Herwini,似乎真正的问题是,为什么您将所有列定义为可能包含超过 8,000 个字节(4,000 个字符)的值?nvarchar
0赞 Thom A 11/7/2023
然而,顺便说一句,你真的应该使用对象,@ChandraSaiKarthik,而不是对象;后者是为了兼容性,因此缺乏一些(很多)信息,并且(至少在历史上)有时被记录为不正确。sysINFORMATION_SCHEMA
0赞 ChandraSaiKarthik 11/7/2023
为什么考虑“CHARACTER_OCET_LENGTH”来表示数据存储?@ThomA,我可以参考 sys.columns 和 sys.types 来获取实际数据。我认为这会给分析带来更多的混乱。要获得系统数据类型和用户定义的数据类型,它更多的是连接,这有点奇怪。所以我使用 information_schema.columns DOMAIN_NAME作为 AS “UserDataType”,DATA_TYPE AS “SystemDataType”