如何加快字符串操作并避免缓慢循环-解网

问：

我正在编写一个代码，它使许多组合（组合在这里可能不是正确的词，字符串序列按照它们在字符串中实际存在的顺序排列）已经存在于字符串中。循环开始向 a 添加组合，但不幸的是，我的循环在处理任何超过 200 字节的文件时需要花费大量时间。我希望能够在这里处理数百 MB。List<string>

让我用最简单的方式解释我真正想要的是什么。假设我有一个字符串是“Afnan is awesome”（->主字符串），我想要的是一个字符串列表，其中包含主字符串的不同子字符串序列。例如-> A，F，N，A，N，I，S，，A，W，E，S，O，M，E。现在，这只是循环的第一次迭代。每次迭代时，我的子字符串长度都会增加，从而在第二次迭代中产生以下结果 -> Af，fn，na，n ， i，is，s ， a，aw，we，es，so，om，me。第三次迭代如下所示：Afn，fna，nan，an ，n i， is，is ，s a， aw， awe， wes， eso， som， ome。这将一直持续到我的子弦长度达到主弦长度的一半。

我的代码如下：

string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
    for (int x = 0; x < data.Length; x++)
    {
        if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds

        if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
        { 
            dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
        }
    }
        stringLengthIncrementer++; //To increase substring length with each iteration
}

我之所以使用，是因为我最多只需要整个字符串长度的一半。请注意，我搜索整个字符串的组合，而不是其中的一半。data.Length / 2

为了进一步简化我试图做的事情 -> 假设我有一个输入字符串 =

“ABCD”

输出将是 =

a， b， c， d， ab， bc， cd，这个其余部分将被剪掉，因为它的长度超过我的主字符串长度的一半 -> //abc， bcd， abcd

我希望一些正则表达式方法可以帮助我实现这一目标。任何不包含循环的东西。还有什么比这更快的吗？一些简单、复杂度更低的代码，哪个更有效率？

更新当我使用而不是用于字典时，我没有遇到任何性能变化，并且还得到了一个 OutOfMemoryException：HashsetList<string>

C# 正则表达式字符串

@AsbahQadeer上次更新：您需要的子集是（n 是字符串长度）。所以它是一个 n**2 算法。对于 100 字节的字符串，您有 5050 个字符串靠近您的注释。但是对于 4k 字符串，您将获得 8M 结果。这是不可能的。（例如，对于 1M 字符串 => 个 500G 的唯一字符串，除了内存成本外，您还应该添加计算时间来生成这些 500G 字符串）n*(n+1)/2I want to be able to work with hundreds of MBs here.

1赞 mjwills 9/4/2017

but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes需要多长时间？你希望它需要多长时间？

答：

0赞 Sergey L 9/4/2017 #1

一般改进，您可以在代码中执行以提高性能（我不考虑是否有其他更优化的解决方案）。

计算数据。Substring（x， stringLengthIncrementer）仅一次
当您进行搜索时，使用，它会更快。SortedList
使用计算出的项目数初始化（或或或其他）。像 new List（CalucatedCapacity）。ListSortedList

或者，您可以尝试编写一种算法，在不检查重复项的情况下生成组合。

0赞 mjwills 9/4/2017 #2

可以结合使用的功能（在 NuGet 上可用）来稍微简化代码。HashSetMoreLINQBatch

public static void Main()
{
    string data = File.ReadAllText("MyFilePath");
    //string data = "Afnan is awesome";

    var dictionary = new HashSet<string>();
    for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
    {
        foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
        {
            var batched = data.Skip(skipper).Batch(stringLengthIncrementer);

            foreach (var batch in batched)
            {
                dictionary.Add(new string(batch.ToArray()));
            }
        }
    }

    Console.WriteLine(dictionary);

    dictionary.ForEach(z => Console.WriteLine(z));
    Console.ReadLine();
}

对于此输入：

"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"

性能大约比当前代码快 10 倍。

1赞 Dave M 9/4/2017 #3

您可以使用 linq 来简化代码并非常容易地并行化它，但它不会快几个数量级，因为您需要在 100 MB 的文件上运行它（这很可能是不可能的）。

    var data = File.ReadAllText("MyFilePath");
    var result = Enumerable.Range(1, data.Length / 2)
        .AsParallel()
        .Select(len => new HashSet<string>(
            Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
                .Select(x => data.Substring(x, len))))
        .SelectMany(t=>t)
        .ToList();

如何加快字符串操作并避免缓慢循环

How to speed up string operations and avoid slow loops

评论

评论