一种使用真实段落修复字幕.srt文件的算法

An algorithm to fix a subtitle.srt file with ground truth paragraph

提问人:thedebugger 提问时间:8/9/2023 更新时间:8/9/2023 访问量:51

问:

我有一个字幕.srt文件,但它的内容没有应有的准确。同时,我还有一组段落是准确的,但在时间上不同步。

不准确可能是由于多种原因造成的,包括:

  • 资本错配,
  • 多余的单词或字符,
  • 缺失的单词或字符,
  • 缺少标点符号
  • 等。

我可以通过哪种方法修复带有真实文本的 srt 文件?任何算法建议都与编码语言无关。

我非常感谢您提供的任何帮助。

例:

字幕.srt

1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission.

基本事实文本:

The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions.

这是预期的:subtitle_corrected.srt

1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.
字符串比较 相似度 字幕 纠错

评论


答:

0赞 Marijn 8/9/2023 #1

这项任务称为对齐,它是生物学(比较两个 DNA 序列)和自然语言处理(例如当前具有两个平行字幕源的示例)等领域的常见任务。

这项任务已经研究了很多年(可以追溯到 1970 年代),并且已经开发了许多算法。这些算法已在所有主要编程语言中实现。

例如,Python库实现了动态规划算法Smith-Waterman和Needleman-Wunsch。下面的代码显示了如何在字幕上使用史密斯-沃特曼算法(在库中调用)。此算法使用正确的文本作为查询,使用不准确的文本作为目标来生成以下类型的对齐方式:text_alignment_toolLocalAlignmentAlgorithm

syntax: position, character in query > position, character in target
119 T > 114 t
120 h > 115 h
121 e > 116 e
122   > 117  
123 h > 118 h
124 e > 119 e
125 a > 120 a
126 t > 121 t
127   > 122  
128 w > 123 w
129 a > 124 a
130 v > 125 v
131 e > 126 e
[...]
162 o > 157 o
163 f > 158 f
164   > 159  
165 c > 160 c
166 l > 161 l
167 i > 162 i
168 m > 163 m
169 a > 164 a
170 t > 165 t
171 e > 168 g

大部分代码是簿记,以生成不准确字幕的纯文本版本,同时跟踪字符位置和时间戳,然后重建字幕格式。

# Import the tool and necessary classes
from text_alignment_tool import (
    TextAlignmentTool,
    StringTextLoader,
    LocalAlignmentAlgorithm,
)

subtitle_srt = """1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission."""

correct_text = """The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions."""

# list with information about each subtitle fragment
fragments_info = []
# keep track of character positions for each sentence in the original subtitles 
current_pos = 0
# collect a list of just the text without the number and timestamp
all_lines = list()

# split on two newlines to get each block of nr+timestamp_sentence
for fragment in subtitle_srt.split("\n\n"):
    # split each block into number, timestamp and sentence
    (fragment_nr, timestamp, fragment_txt) = fragment.splitlines()
    # add the sentence to the list of sentences
    all_lines.append(fragment_txt)
    # keep track of new position: old position plus length of current sentence
    newpos = current_pos + len(fragment_txt)
    # add number, timestamp and position to the list with information about fragments
    fragments_info.append({"number": fragment_nr, "timestamp": timestamp, "end_position": newpos})
    # update position variable to use in next iteration
    current_pos = newpos + 1

# create a multi-line string with only the sentences to use for alignment
target_text = "\n".join(all_lines)

print(target_text)
print("---------------------------")
print(fragments_info)
print("---------------------------")

# load the two text strings for use in the alignment library
query_1 = StringTextLoader(correct_text)
target_1 = StringTextLoader(target_text)
# initialize the alignment for the two texts
aligner_1 = TextAlignmentTool(query_1, target_1)
# select an alignment algorithm
local_alignment_algorithm = LocalAlignmentAlgorithm()
# perform the actual alignment
aligner_1.align_text(local_alignment_algorithm)

# extract character-level alignment positions
alm = aligner_1.collect_all_alignments()
alm_idxs = alm[0][0]

# reconstruct the subtitles using the alignment

# keep track of the fragment number and the position in the correct text
fragment_nr = 0
start_pos = 0
# loop over each aligned character pair
for x in alm_idxs.query_to_target_mapping.alignments:
    # if the position in the original subtitle (=target) is the end of a fragment
    # then write a subtitle line using the position in the correct text (=query) 
    if x.target_idx >= fragments_info[fragment_nr]["end_position"]-1:
        print(fragments_info[fragment_nr]["number"])
        print(fragments_info[fragment_nr]["timestamp"])
        print(correct_text[start_pos:x.query_idx+1])
        # update the start position and fragment number for the next fragment
        start_pos = x.query_idx + 2
        fragment_nr += 1

代码的输出,显示不准确字幕的纯文本版本、包含每个片段信息的列表以及重建的字幕:

Heat wave is expect to continue for the next a few
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating
change, the need to take action to reduce greenhouse gas emission.
---------------------------
[{'number': '1', 'timestamp': '00:00:00,000 --> 00:00:04,320', 'end_position': 50}, {'number': '2', 'timestamp': '00:00:04,320 --> 00:00:07,920', 'end_position': 169}, {'number': '3', 'timestamp': '00:00:07,920 --> 00:00:13,760', 'end_position': 236}]
---------------------------
1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.

这段代码是用 Python 编写的,该库有一些特殊性,并且没有很好的文档记录(免责声明:我与该库没有任何关系)。该代码用作概念证明,但它可能不是所有情况下的最佳解决方案。text_alignment_tool

然而,如上所述,这些算法在许多不同编程语言的库中广泛可用,因此使用正确的搜索词(alignment、Needleman-Wunsch),您应该能够编写适合您需求的类似内容。