在字符串构建中可靠地检测子字符串模式-解网

问：

我正在尝试执行编程任务，但我无法满足性能要求。以下是它的完整描述：

语言模型将一个 n 个字母的单词 S 和一个参数 k（一个具有 1 ≤ k < n 的整数）作为输入，然后根据下面指定的规则生成该单词的延续。

假设我们已经有一个单词 S′，它是 S 通过某些字母的扩展，添加一个新字母的工作原理如下（另请参阅下面的示例）：我们考虑单词 S′ 的 k 字母后缀 R，并查看单词 S′ 中之前出现的所有 R（作为连续的子字符串）。然后，对于字母表中的每个字母，我们计算它在单词 S′ 中直接出现在 R 之后的次数。设 l 为出现频率最高的字母。关系的解决有利于字母表中较早出现的字母，如果 R 没有出现在单词 S′ 中的其他任何地方，那么我们假设 l = 'a'。最后，我们通过将字母 l 附加到其末尾来扩展单词 S′。

例如，设 S = “abaaabababa” 和 k = 3。我们有 S′ = S，R = aba，R 出现得更早，下一个字母是 abaa、abab、abab。最常见的字母是 b，因此我们将 b 附加到 S′。现在 S′ = “abaaabababab”， R = bab，R 出现得更早，下一个字母是 baba， baba，所以我们将 'a' 附加到 S′。

输入：第一行输入包含四个整数 n、k、a 和 b（2 ≤ n ≤ 10^6,1 ≤ k < n， n < a < b < 10^ 18，b + 1 − a ≤ 10^6）。第二行输入包含一个由小写英文字母字符（'a' – 'z'）组成的 n 个字母字符串，表示单词 S。

输出：输出应为 b + 1 − a 字符的序列，表示扩展单词 S′ 中从第 a 个到第 b 个（含）位置的字母。换句话说，假设 b − n 个字母已添加到初始单词 S 中，您希望打印这些添加字母的最后一个 b + 1 − a。

例：对于输入数据：

11 3 12 13 阿巴巴

正确的结果是：

八

我设法创建了一个正确构建字符串的程序，但在（b - n） - 我们假设添加的字母数 - 很大时未能在合理的时间内这样做。b 可以大到 10^18，n 只能大到 10^6，剩下 10^18 - 10^6 个字符要生成。另一方面，我们只关心 b - a + 1 个字符，最多只有一百万个字符，我适合及时这样做。我相信我在如何处理这个问题上的想法是错误的。我敢肯定，我们实际上不应该生成整个字符串，因为我运行的所有测试似乎都会生成一个迟早开始的模式。也许我们应该只生成所需数量的字符并旋转字符串（滚动它）？任何形式的帮助都是值得赞赏的。如果你相信你知道可以帮助我的资源，我会很高兴。我很难寻找合适的解决方案。这个任务是关于一个虚构的聊天机器人，也许一些机器学习算法可以解决问题？

这是我的简化代码（在原始代码中，我使用带有自定义哈希的基数 trie，但我相信这并不重要，两个程序都会生成正确的结果）。

[[nodiscard]] static std::string Model(
    const size_t n,
    const size_t k,
    const size_t a,
    const size_t b,
    std::string& word) noexcept(true)
{
    /* k-lettered substring -> character after the substring -> count */
    std::unordered_map<std::string, std::unordered_map<char, size_t>> occurences;

    for (size_t i{ 0U }; i < n - k; ++i)
        /* Extract a k-lettered substring at position i -> extract the letter after it -> increate count */
        ++occurences[word.substr(i, k)][word[i + k]];

    std::string lettersToPrint{};
    for (size_t i{ 0U }; i < (b - n) /* Do we really have to generate all of them? */; ++i)
    {
        /* k-lettered substring at the end of the "word". */
        std::string suffix(word.substr(word.length() - k, k));

        /* Detect the most popular character occurring after each suffix substring. */
        size_t mostPopularSuffixAppendLetterCount{ 0U };
        char mostPopularSuffixAppendLetter{ 'a' };

        for (char j{ 0 }; j < 26; ++j)
        {
            // careful! works with ASCII!!!
            const char currentCharacter{ char('z' - j) };
            const auto currentCharacterCount{ occurences[suffix][currentCharacter] };

            if (currentCharacterCount >= mostPopularSuffixAppendLetterCount)
            {
                mostPopularSuffixAppendLetterCount = currentCharacterCount;
                mostPopularSuffixAppendLetter = currentCharacter;
            }
        }

        word += mostPopularSuffixAppendLetter;

        if ((b - n) - i <= b + 1 - a)
            lettersToPrint += mostPopularSuffixAppendLetter;
    }

    return lettersToPrint;
}

int main()
{
    size_t n{ 65 }, k{ 2 }, a{ 67 }, b{ 783 };
    std::string word{ "ffdedffddfdefedddfdfddfddeedfdededddffeefffdeedfeddeddddfeefdffee" };

    assert(Model(n, k, a, b, word) == "dfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddf");

    return 0;
}

C++ 字符串算法数据结构循环

b can be as big as 10^18 and n can be only as big as 10^6, leaving 10^18 - 10^6 characters to generate -- Well, I know it's too late, but you shouldn't have written a single line of code knowing these are the constraints, and instead solely concentrate on the proper data structure(s) to use. Anytime you see a number in the billions, quadrillions, etc., then writing loops that iterate that many times -- you know will not work and may take hours, days, and sometimes years to actually complete the run.

0赞 PaulMcKenzie 11/13/2023

++occurences[word.substr(i, k)][word[i + k]]; -- std::string suffix(word.substr(word.length() - k, k)); -- Each time you call , a brand new has to be constructed. If you are going to stick with the code similar to this, reduce or eliminate the need to call , so that the overhead of constructing a isn't done. Maybe use ?substrstd::stringsubstrstd::stringstd::string_view

0赞 questionaire 11/13/2023

The code i provided is simplified because the original is long. I don't use any stl there besides a smart pointer. I coded the solution hoping I would come up with a better idea.

答：

0赞 Alois Christen 11/13/2023 #1

As this is for homeworks, I won't give you the full answer, to let you understand by yourself.

Here are some pointers :

Is there a repetition that appears in the string continuation ?
Can you prove it ?
If there is a repetition, what it's maximum length ?
Can you detect where the repetition starts without generating the whole continuation ?
Can you use these informations to find the continuation in the last (b + 1 - a) characters ?

上一个：如何生成一个没有死胡同的迷宫？[关闭]

下一个：路径总和 - 我哪里出错了？

在字符串构建中可靠地检测子字符串模式

Reliably detecting a substring pattern in string building

评论