提问人:Herwini 提问时间:11/7/2023 最后编辑:Herwini 更新时间:11/7/2023 访问量:63
如何从 sql 表中检索架构(类型和列)并在突触笔记本中的表上强制实施它们?
How to retrieve the schema (type and columns) from a sql table and enforce them on a table in a synapse notebook?
问:
我在 sql 数据库中有几个表。所有这些表都是空的,但具有正确的架构。现在,我在 Azure 数据湖中有相应的表,其中包含数据但架构错误。
我想创建一个表,将数据湖表的数据与 sql 数据库表的架构相结合。
我尝试做什么:
我有一个带有pyspark的突触笔记本。
- 我加载了 sql 表的架构。
- 我从数据湖中提取匹配的表
- 我尝试在从数据湖获取的表上强制执行 sql 表的架构
- 我将数据写入 sql 数据库
在代码中:
container = "..."
storage_account = "..."
ls_name = "..."
adls_path = f"abfs://{container}@{storage_account}.dfs.core.windows.net/test_table.parquet"
#set variable to be used to connect the database
SERVER = ...
DATABASE= ...
USERNAME= ...
PASSWORD= ...
#read table data for sql database
sqldf = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{SERVER};databaseName={DATABASE}") \
.option("dbtable", 'test_table') \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.load()
# Read in data lake table
pandasdf = pd.read_parquet(adls_path, storage_options = {'linked_service' : ls_name})
pandasdf = pandasdf[sqldf.columns] # to make the columns match
sparkdf = spark.createDataFrame(pandasdf)
# Create spark dataframe with correct data and correct schema
final = spark.createDataFrame(sparkdf.collect(), sqldf.schema, verifySchema=True)
#Write table data into a spark dataframe
final.write.format("jdbc") \
.option("url", f"jdbc:sqlserver://{SERVER};databaseName={DATABASE}") \
.option("dbtable", 'test_table1') \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.save()
确实创建了表“test_table1”。但是,当我查看sql数据库并运行以下查询时:
SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'test_table1'
当我运行以下查询时:
SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'test_table'
我得到这个结果:
一切看起来都一样,但“CHARACTER_MAXIMU_LENGTH”和“CHARACTER_OCET_LENGTH”相差甚远。 -1 与一些固定数字。(-1 我认为表示可能的最大数量)
因此,这意味着架构并不完全相同。我需要“test_table1”的模式与“test_table”相同。
我该怎么做?
(
- 我有多个包含大量列的表,我不想为所有这些表单独指定架构特征应该是什么。
- 在其中一些表中,数据湖表中的类型是 doubletype,在 sql 中应为 int 或 float。这通常会引发无法转换列的错误。但是,如果我通过强制转换类型来手动执行此操作,它就会起作用 )
答:
0赞
ChandraSaiKarthik
11/7/2023
#1
试试这段代码,它应该给你类似的结果。
SELECT TABLE_SCHEMA AS "SchemaName",
TABLE_NAME AS "ObjectName",
ORDINAL_POSITION AS "ColOrd",
COLUMN_NAME AS "ColumnName",
DOMAIN_NAME AS "UserDataType",
DATA_TYPE AS "SystemDataType",
CASE WHEN DATA_TYPE IN ('xml', 'hierarchyid', 'geography', 'sql_variant', 'image', 'text', 'ntext') THEN NULL ELSE CHARACTER_MAXIMUM_LENGTH END AS "Length",
CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN NUMERIC_PRECISION ELSE NULL END AS "Precision",
CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN NUMERIC_SCALE ELSE NULL END AS "Scale",
DATA_TYPE + CASE WHEN DATA_TYPE IN ('decimal', 'numeric') THEN CONCAT('(',NUMERIC_PRECISION,',',NUMERIC_SCALE,')')
WHEN DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'binary', 'varbinary') THEN CASE WHEN CHARACTER_MAXIMUM_LENGTH = -1 THEN '(max)' ELSE CONCAT('(',CHARACTER_MAXIMUM_LENGTH,')') END
ELSE '' END AS "ConfiguredDataType",
CASE IS_NULLABLE WHEN 'YES' THEN 1 ELSE 0 END AS "isNull",
SUBSTRING(COLUMN_DEFAULT, 2, LEN(COLUMN_DEFAULT)-2) AS "DefaultValue"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'categories'
查询以从 sys.columns 获取相同的结果
SELECT SCHEMA_NAME(O.schema_id) AS "SchemaName",
OBJECT_NAME(O.object_id) AS "ObjectName",
CASE WHEN O.type = 'U' THEN 'UserTable' WHEN O.type = 'V' THEN 'View' WHEN O.type = 'TF' THEN 'TableValuedFn' ELSE '' END AS "ObjectType",
C.column_id AS "ColID",
COLUMNPROPERTY(C.object_id, C.name, 'ordinal') AS "ColOrd",
C.name AS "ColumnName",
CASE WHEN C.user_type_id <> C.system_type_id THEN TYPE_NAME(C.user_type_id) ELSE '' END AS "UserDataType",
ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id)) AS "SystemDataType",
ISNULL(CAST(COLUMNPROPERTY(C.object_id, C.name, 'charmaxlen') AS NVARCHAR(10)), '') AS "Length",
CASE WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN CAST(C.precision AS NVARCHAR(4)) ELSE '' END AS "Precision",
CASE WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN CAST(C.scale AS NVARCHAR(4)) ELSE '' END AS "Scale",
ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id))
+ CASE
WHEN TYPE_NAME(C.system_type_id) IN ('decimal', 'numeric') THEN '(' + CAST(C.precision AS VARCHAR(4)) + ', ' + CAST(C.scale AS VARCHAR(4)) + ')'
WHEN C.max_length = -1 AND TYPE_NAME(C.system_type_id) NOT IN ('xml') THEN '(max)'
WHEN ISNULL(TYPE_NAME(C.system_type_id), TYPE_NAME(C.user_type_id)) IN ('xml', 'hierarchyid', 'geography', 'sql_variant', 'image', 'text', 'ntext') THEN ''
ELSE ISNULL('('+CAST(COLUMNPROPERTY(C.object_id, C.name, 'charmaxlen') AS NVARCHAR(10))+')', '')
END AS "ConfiguredSystemType",
ISNULL(C.is_nullable, 0) AS "isNull",
ISNULL(SUBSTRING(OBJECT_DEFINITION(C.default_object_id), 2, LEN(OBJECT_DEFINITION(C.default_object_id)) - 2), '') AS "DefaultValue",
C.is_identity AS "isIdentity",
C.is_computed AS "isComputed"
FROM sys.columns AS C
INNER JOIN sys.objects AS O ON O.object_id = C.object_id
WHERE SCHEMA_NAME(O.schema_id) = 'dbo' AND O.name = 'categories'
评论
0赞
Herwini
11/7/2023
这显示了我在问题中已经显示的相同差异(-1 与某个设定的数字)。
1赞
ChandraSaiKarthik
11/7/2023
“-1”表示 (MAX),XML 除外
0赞
Thom A
11/7/2023
@Herwini,似乎真正的问题是,为什么您将所有列定义为可能包含超过 8,000 个字节(4,000 个字符)的值?nvarchar
0赞
Thom A
11/7/2023
然而,顺便说一句,你真的应该使用对象,@ChandraSaiKarthik,而不是对象;后者是为了兼容性,因此缺乏一些(很多)信息,并且(至少在历史上)有时被记录为不正确。sys
INFORMATION_SCHEMA
0赞
ChandraSaiKarthik
11/7/2023
为什么考虑“CHARACTER_OCET_LENGTH”来表示数据存储?@ThomA,我可以参考 sys.columns 和 sys.types 来获取实际数据。我认为这会给分析带来更多的混乱。要获得系统数据类型和用户定义的数据类型,它更多的是连接,这有点奇怪。所以我使用 information_schema.columns DOMAIN_NAME作为 AS “UserDataType”,DATA_TYPE AS “SystemDataType”
评论
sqldf.schema