简化 ngram 循环以压缩给定一组固定的 ngram 字符串-解网

问：

在字符列表和字符元组列表中给出，即list('Hello▁world▁')

[（'l'， 'l'），（'ell'， 'o '），（'地狱'， 'o'），（'w'， 'or'），（'o'， 'r'），（'e'， 'l'），（'el'， 'l'），（'H'， 'ell'），（'H'， 'e'），（'他'， 'll'），（'worl'， 'd '），（'wor'， 'l'），（'l'， 'd '），（'d'， ' '），（'wor'， 'ld '），（'H'， 'el'），（'o'， ' '），（'w'， 'o'），（'l'， 'o '），（'l'， 'o'）]

目标是遍历元组，如果字符匹配，则折叠字符列表。我已经尝试过这个，它有效：

import copy

def matcher(s, ngram):
  while s:
    window = tuple(s[:2]) # Since the string tuples are in pairs.
    if window == ngram:
      yield "".join(window)
      s = s[2:]
    else:
      yield s[0]
      s = s[1:]

def combine_ngrams(s, vocab):
  prev = copy.copy(s)
  while True:
    for v in vocab:
      s = list(matcher(s, v))
    if s == prev:
      break
    else:
      prev = s
  return s


vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), 
         ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

s = list('Hello▁world▁')

combine_ngrams(s, vocab)

[输出]：

['Hello▁', 'world▁']

但是外部函数和内部函数中的多个 while 循环看起来可以很容易地简化。combine_ngrams()matcher()

或者，也许操作不需要遍历元组，也许一些迭代应用词汇替换的正则表达式方法会起作用。有没有办法在函数中简单地使用嵌套的 while 循环？combine_ngrams

下面是更多输入/输出示例：

[在]：

s = list('abcde'); vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('abcde'); vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('aaab'); vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]

s = list('Hello▁ポケモンセンター▁world▁'); vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

[输出]：

['ab', 'c', 'd', 'e']

['abcde']

['aa', 'a', 'b']

['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁']

P/S：对于任何感兴趣的人，这与字节对编码算法有关，如果有一种更算法而不是 pythonic 循环的方法来解决这个问题，请提出建议。

python 字符串 while-loop 字节对编码

def combine_ngrams2(s, vocab):
    sep = "\uE000"
    ss = "".join(s)
    while sep in ss:
        sep += chr(0xE000 + len(sep))
    assert sep not in ss and len(set(sep)) == len(sep)

    prev = None
    current = f"{sep}{sep.join(s)}{sep}"
    while prev != current:
        prev = current
        for v1, v2 in vocab:
            current = current.replace(f"{sep}{v1}{sep}{v2}{sep}", f"{sep}{v1}{v2}{sep}")

    return current[len(sep) : -len(sep)].split(sep)

这个解决方案可能看起来像一个愚蠢的实现，但经过了很好的优化，很难找到更好的替代方案。str.replace

0赞 Sandesh-Gowda0094 5/5/2023 #2

是的，有一种方法可以简化 combine_ngrams（）函数中的嵌套 while 循环。一种方法是使用正则表达式在输入字符串中查找词汇表中每个元组的最长匹配项。这将消除对 matcher（）函数和内部 while 循环的需求。

import re

def combine_ngrams(s, vocab):
    pattern = '|'.join(['(?:' + re.escape(a) + ')' for a, b in vocab]) # create regex pattern
    while True:
        prev = ''.join(s)
        s = re.sub(pattern, lambda m: vocab[m.lastindex - 1][1], prev) # apply substitution
        if s == prev:
            break
        s = list(s)
    return s

以下是将输入输入到函数时发生的情况

s = list('Hello▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'),          ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'world▁']

s = list('abcde')
vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('abcde')
vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('aaab')
vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]
print(combine_ngrams(s, vocab)) # output: ['b']

s = list('Hello▁ポケモンセンター▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁world▁']

希望这能帮助您使用正则表达式模式完成此操作......

上一个：使用方法反转用户输入字符串

下一个：如何结束在 python 3.11 中提供输入的无限循环

简化 ngram 循环以压缩给定一组固定的 ngram 字符串

Simplifying ngram loops to compress the string given a fix set of ngrams

评论