在字符串构建中可靠地检测子字符串模式

Reliably detecting a substring pattern in string building

提问人:questionaire 提问时间:11/13/2023 最后编辑:questionaire 更新时间:11/13/2023 访问量:99

问:

我正在尝试执行编程任务,但我无法满足性能要求。以下是它的完整描述:

语言模型将一个 n 个字母的单词 S 和一个参数 k(一个具有 1 ≤ k < n 的整数)作为输入,然后根据下面指定的规则生成该单词的延续。

假设我们已经有一个单词 S′,它是 S 通过某些字母的扩展,添加一个新字母的工作原理如下(另请参阅下面的示例):我们考虑单词 S′ 的 k 字母后缀 R,并查看单词 S′ 中之前出现的所有 R(作为连续的子字符串)。 然后,对于字母表中的每个字母,我们计算它在单词 S′ 中直接出现在 R 之后的次数。设 l 为出现频率最高的字母。关系的解决有利于字母表中较早出现的字母,如果 R 没有出现在单词 S′ 中的其他任何地方,那么我们假设 l = 'a'。最后,我们通过将字母 l 附加到其末尾来扩展单词 S′

例如,设 S = “abaaabababa” 和 k = 3。我们有 S′ = S,R = aba,R 出现得更早,下一个字母是 abaa、abab、abab。最常见的字母是 b,因此我们将 b 附加到 S′。现在 S′ = “abaaabababab”, R = bab,R 出现得更早,下一个字母是 baba, baba,所以我们将 'a' 附加到 S′

输入: 第一行输入包含四个整数 nk、a 和 b(2 ≤ n ≤ 10^6,1 ≤ k < n, n < a < b < 10^ 18,b + 1 − a ≤ 10^6)。 第二行输入包含一个由小写英文字母字符 ('a' – 'z') 组成的 n 个字母字符串,表示单词 S

输出: 输出应为 b + 1 − a 字符的序列,表示扩展单词 S′ 中从第 a 个到 b 个(含)位置的字母。换句话说,假设 b − n 个字母已添加到初始单词 S 中,您希望打印这些添加字母的最后一个 b + 1 − a

例: 对于输入数据:

11 3 12 13 阿巴巴

正确的结果是:

我设法创建了一个正确构建字符串的程序,但在 (b - n) - 我们假设添加的字母数 - 很大时未能在合理的时间内这样做。b 可以大到 10^18,n 只能大到 10^6,剩下 10^18 - 10^6 个字符要生成。另一方面,我们只关心 b - a + 1 个字符,最多只有一百万个字符,我适合及时这样做。我相信我在如何处理这个问题上的想法是错误的。我敢肯定,我们实际上不应该生成整个字符串,因为我运行的所有测试似乎都会生成一个迟早开始的模式。也许我们应该只生成所需数量的字符并旋转字符串(滚动它)?任何形式的帮助都是值得赞赏的。如果你相信你知道可以帮助我的资源,我会很高兴。我很难寻找合适的解决方案。 这个任务是关于一个虚构的聊天机器人,也许一些机器学习算法可以解决问题?

这是我的简化代码(在原始代码中,我使用带有自定义哈希的基数 trie,但我相信这并不重要,两个程序都会生成正确的结果)。

[[nodiscard]] static std::string Model(
    const size_t n,
    const size_t k,
    const size_t a,
    const size_t b,
    std::string& word) noexcept(true)
{
    /* k-lettered substring -> character after the substring -> count */
    std::unordered_map<std::string, std::unordered_map<char, size_t>> occurences;

    for (size_t i{ 0U }; i < n - k; ++i)
        /* Extract a k-lettered substring at position i -> extract the letter after it -> increate count */
        ++occurences[word.substr(i, k)][word[i + k]];

    std::string lettersToPrint{};
    for (size_t i{ 0U }; i < (b - n) /* Do we really have to generate all of them? */; ++i)
    {
        /* k-lettered substring at the end of the "word". */
        std::string suffix(word.substr(word.length() - k, k));

        /* Detect the most popular character occurring after each suffix substring. */
        size_t mostPopularSuffixAppendLetterCount{ 0U };
        char mostPopularSuffixAppendLetter{ 'a' };

        for (char j{ 0 }; j < 26; ++j)
        {
            // careful! works with ASCII!!!
            const char currentCharacter{ char('z' - j) };
            const auto currentCharacterCount{ occurences[suffix][currentCharacter] };

            if (currentCharacterCount >= mostPopularSuffixAppendLetterCount)
            {
                mostPopularSuffixAppendLetterCount = currentCharacterCount;
                mostPopularSuffixAppendLetter = currentCharacter;
            }
        }

        word += mostPopularSuffixAppendLetter;

        if ((b - n) - i <= b + 1 - a)
            lettersToPrint += mostPopularSuffixAppendLetter;
    }

    return lettersToPrint;
}

int main()
{
    size_t n{ 65 }, k{ 2 }, a{ 67 }, b{ 783 };
    std::string word{ "ffdedffddfdefedddfdfddfddeedfdededddffeefffdeedfeddeddddfeefdffee" };

    assert(Model(n, k, a, b, word) == "dfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddf");

    return 0;
}
C++ 字符串 算法 数据结构 循环

评论

1赞 Some programmer dude 11/13/2023
Unrelated, but please note that your code only works with certain character encodings (like ASCII). The C++ standard only specifies that digit characters are encoded contiguously. There are encodings still in active use where something like will not work.'z' - j
1赞 questionaire 11/13/2023
I added a note in the code.
0赞 PaulMcKenzie 11/13/2023
b can be as big as 10^18 and n can be only as big as 10^6, leaving 10^18 - 10^6 characters to generate -- Well, I know it's too late, but you shouldn't have written a single line of code knowing these are the constraints, and instead solely concentrate on the proper data structure(s) to use. Anytime you see a number in the billions, quadrillions, etc., then writing loops that iterate that many times -- you know will not work and may take hours, days, and sometimes years to actually complete the run.
0赞 PaulMcKenzie 11/13/2023
++occurences[word.substr(i, k)][word[i + k]]; -- std::string suffix(word.substr(word.length() - k, k)); -- Each time you call , a brand new has to be constructed. If you are going to stick with the code similar to this, reduce or eliminate the need to call , so that the overhead of constructing a isn't done. Maybe use ?substrstd::stringsubstrstd::stringstd::string_view
0赞 questionaire 11/13/2023
The code i provided is simplified because the original is long. I don't use any stl there besides a smart pointer. I coded the solution hoping I would come up with a better idea.

答:

0赞 Alois Christen 11/13/2023 #1

As this is for homeworks, I won't give you the full answer, to let you understand by yourself.

Here are some pointers :

  1. Is there a repetition that appears in the string continuation ?
  2. Can you prove it ?
  3. If there is a repetition, what it's maximum length ?
  4. Can you detect where the repetition starts without generating the whole continuation ?
  5. Can you use these informations to find the continuation in the last (b + 1 - a) characters ?