从空格对齐的表格数据中解析多行表格单元格-解网

问：

我生成了一个有点凌乱的文件，它只是将所有内容转储到 HTML 标签中，并决定将标题分成 2 行。我是 Python 和正则表达式新手，很难找到一种方法将这 2 行正确地合并为一行，以便将列标题放在一行上并匹配，最终目标是将整个文件解析为字段。<pre>

以下是它在网络上的外观示例：

我希望能够做的是将字段匹配在一行中。例如，如果我只是去掉多余的空格，“clock”将与 Finisher 而不是 Time 匹配。我想要的是：

编号# |地点 |上课地点 |整理器 |时钟时间 |净时间 |步伐

下面是实际的 HTML：

</B>             CLASS                                            CLOCK       NET    
  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE

python html 解析

# read in the 2 lines:
line1 = '             CLASS                                            CLOCK       NET    '
line2 = '  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE  '

# pad the shorter among the lines, so that both are equally long:
linediff = len(line1) - len(line2)
if linediff > 0:
    line2 += ' ' * linediff
else:
    line1 += ' ' * (-linediff)
length = len(line1)

# go through both lines character-by-character:
top, bottom = [], []
i = 0
while i < length:
    # skip indices where both lines have a space:
    if line1[i] == ' ' and line2[i] == ' ':
        i += 1
    else:
        # find the first j to the right of i for which
        # both lines have a space:
        j = i
        while (j < length) and (line1[j] != ' ' or line2[j] != ' '):
            j += 1
        # copy the lines from position i (inclusive)
        # to j (exclusive) into top and bottom:
        top.append(line1[i:j])
        bottom.append(line2[i:j])
        # we are done with one heading and advance i:
        i = j

# top:
# ['   ', '     ', 'CLASS', '        ', ' CLOCK', '  NET', '    ']
# bottom:
# ['ID#', 'PLACE', 'PLACE', 'FINISHER', 'TIME  ', 'TIME ', 'PACE']

headers = []
for str1, str2 in zip(top, bottom):
    # remove leading/trailing spaces from each partial heading:
    s1, s2 = str1.strip(), str2.strip()
    # merge partial headings
    # (strip is needed because one of the two might be empty):
    headers.append((s1 + ' ' + s2).strip())

# headers:
# ['ID#', 'PLACE', 'CLASS PLACE', 'FINISHER', 'CLOCK TIME', 'NET TIME', 'PACE']

请注意，该问题实际上与 HTML 无关，因此不需要任何特殊的 HTML 处理。

你太棒了。非常感谢！什么样的主题会更好地帮助我掌握这些概念？我想是数学，我已经离开高中/大学一段时间了，:)。也仅供参考，似乎长度在 Python 中保留为变量名称，所以我更改了它。可能只想为将来可能有人拥有的复制/粘贴工作编辑它。

1赞 Lover of Structure 3/19/2023

@Steve 如果我输入一个新的 Python 会话，我会得到，所以我想在您的例子中，您的系统上还定义了一些其他初始化代码。lengthNameError: name 'length' is not definedlength

0赞 Lover of Structure 3/19/2023

@Steve 在上面的代码中，我的算法非常不Python ic。我想一个诀窍是先尝试一个朴素的解决方案（这里：一个不使用 Python 习语或正则表达式的解决方案）。顺便说一句，请注意我是如何添加完成的字符串和 to 和的（而不是逐个字符构建它们，由于构造多个不可变的字符串对象，这会很慢）。line1[i:j]line2[i:j]topbottom

上一个：使用 Selenium 从网页中抓取表格数据

下一个：Angular：获取 html 解析错误。自动将一些“_ng..”空属性渲染到 html 中

从空格对齐的表格数据中解析多行表格单元格

Parsing multi-line table cells from space-aligned table data

评论

评论