提问人:markfickett 提问时间:8/19/2023 更新时间:8/20/2023 访问量:63
有没有一个 Python 解析库可以解析类似 TOML 的格式,该格式使用 [ParentHeader_ChildSection] 指定嵌套字段?
Is there a Python parsing library that can parse a TOML-like format that specifies nested fields with [ParentHeader_ChildSection]?
问:
我想在 Python 中解析外部定义(和未记录)的文件格式。它看起来有点类似于 TOML,但文本样式不同,并且没有引用。例如:
[Schedule_Step122]
m_nMaxCurrent=0
m_szAddIn=Relay OFF
m_szLabel=06 - End Charge
m_uLimitNum=2
[Schedule_Step122_Limit0]
Equation0_szCompareSign=>=
Equation0_szRight=F_05_Charge_Capacity
Equation0_szLeft=PV_CHAN_Charge_Capacity
m_bStepLimit=1
m_szGotoStep=End Test
[Schedule_Step122_Limit1]
Equation0_szCompareSign=>=
Equation0_szLeft=PV_CHAN_Voltage
Equation0_szRight=3
m_bStepLimit=1
m_szGotoStep=End Test
(这是 Arbin 的测试计划格式。
我希望解析的结构是这样的:
"steps": [
{
"max_current": 0,
"add_in": RELAY_OFF,
"label": "09 - End Charge",
"limits": [
{
"equations": [
{
"left": PV_CHAN_CHARGE_CAPACITY,
"compare_sign": ">=",
"right": F_05_CHARGE_CAPACITY
}
],
"step_limit": 1,
"goto_step": END_TEST
},
{
"equations": [
{
"left": PV_CHAN_VOLTAGE,
"compare_sign": ">=",
"right": 6
}
],
"step_limit": 1,
"goto_step": END_TEST
}
]
}
]
从表面上看,格式似乎与 TOML 相似,包括一些嵌套,但字符串处理不同。我还想将某些值捕获为命名常量。
我还在研究定义一个与上下文无关的语法,并使用词法分析器/解析器,如 ANTLR、PLY、pyparsing 或 Lark。我熟悉阅读文档中的语法,但以前从未编写过或使用过解析器。但是,我不知道如何表示嵌套结构(例如成为 的成员)或相关键(如 Equation0_szLeft' 等)之间缺乏保证顺序。Schedule_Step122_Limit0
Schedule_Step122
Equation0_szCompareSign
有没有一个通用的解析工具可以为我编写定义,它会为我提供解析/结构化的输出?或者这里是编写自定义解析逻辑的最佳方法?
答:
0赞
Michael Dyck
8/20/2023
#1
像 ANTLR、PLY、pyparsing 或 Lark 这样的工具几乎不会给你带来任何帮助。configparser 可能会有所帮助,但我怀疑它会比它的价值更麻烦。
以下代码接近您想要的代码。您需要根据您对输入格式的发现以及您对输出结构的需求来调整它。
import re, json
def main():
obj = parse('input.txt')
print(json.dumps(obj, indent=2))
def parse(filename):
root_object = {}
current_object = None
for line in open(filename):
# trim trailing whitespace:
line = line.rstrip()
if line == '':
# blank line
pass
elif mo := re.fullmatch(r'\[(\w+)\]', line):
# header line
# This identifies, via a 'path' from the root object,
# the object that subsequent name-value lines are talking about.
header_path = mo.group(1)
header_pieces = header_path.split('_')
current_object = get_nested_object(root_object, header_pieces)
elif mo := re.fullmatch(r'([^=]+)=(.*)', line):
# name-value line
(name_part, value_str) = mo.groups()
# The {name_part} identifies a field in {current_object}
# or some object nested within {current_object}.
# The {value_str} encodes the value to be assigned to that field.
name_pieces = name_part.split('_')
prefix_pieces = name_pieces[:-1]
field_name_piece = name_pieces[-1]
if prefix_pieces == ['m']:
# This is an 'immediate' field of {current_object}
obj_w_field = current_object
else:
# This is a field of some object nested within {current_object}
obj_w_field = get_nested_object(current_object, prefix_pieces)
mo = re.fullmatch(r'([a-z]+)([A-Z][a-zA-Z]*)', field_name_piece)
(type_indicator, field_name_pc) = mo.groups()
field_name = to_snake_case(field_name_pc)
field_value = value_str
obj_w_field[field_name] = field_value
else:
assert 0, line
return root_object
def get_nested_object(base_object, header_pieces):
if header_pieces == []:
return base_object
else:
prefix_pieces = header_pieces[:-1]
last_piece = header_pieces[-1]
obj = get_nested_object(base_object, prefix_pieces)
if mo := re.fullmatch(r'[A-Za-z]+', last_piece):
# e.g. "Schedule"
# This identifies a field/property/member of {obj}
field_name = to_snake_case(last_piece)
# That field might or might not exist already.
if field_name not in obj:
# It doesn't exist yet.
# We assume that the value of the field is an object
obj[field_name] = {}
return obj[field_name]
elif mo := re.fullmatch(r'([A-Za-z]+)(\d+)', last_piece):
# e.g., "Step122", "Limit0"
# This identifies an element of an array that is a field of {obj}
# e.g., "Step122" implies that {obj} has a field named "steps",
# whose value is an array,
# and this identifies the element at index 122 in that array.
(array_field_name_pc, index_str) = mo.groups()
array_field_name = to_snake_case(array_field_name_pc) + 's'
index = int(index_str)
if array_field_name not in obj:
obj[array_field_name] = {}
# In practice, you might want to make this a list.
array = obj[array_field_name]
if index not in array:
array[index] = {}
return array[index]
else:
assert 0, last_piece
assert 0
# "_pc" suffix denotes a Pascal-cased name, e.g. "MaxCurrent"
def to_snake_case(name_pc):
assert '_' not in name_pc
def replfunc(mo):
cap_letter = mo.group(0)
low_letter = cap_letter.lower()
if mo.start() == 0:
return low_letter
else:
return '_' + low_letter
return re.sub(r'[A-Z]', replfunc, name_pc)
main()
对于示例输入,它打印:
{
"schedule": {
"steps": {
"122": {
"max_current": "0",
"add_in": "Relay OFF",
"label": "06 - End Charge",
"limit_num": "2",
"limits": {
"0": {
"equations": {
"0": {
"compare_sign": ">=",
"right": "F_05_Charge_Capacity",
"left": "PV_CHAN_Charge_Capacity"
}
},
"step_limit": "1",
"goto_step": "End Test"
},
"1": {
"equations": {
"0": {
"compare_sign": ">=",
"left": "PV_CHAN_Voltage",
"right": "3"
}
},
"step_limit": "1",
"goto_step": "End Test"
}
}
}
}
}
}
评论
0赞
markfickett
9/7/2023
感谢您确认现有解析器在这里没有用处。
上一个:使用 pcl 可视化工具的点云
评论
configparser