同时运行多个 Selenium 任务并防止系统崩溃

Running many Selenium tasks concurrently and preventing system crashes

提问人:BenyaminDev 提问时间:9/25/2023 最后编辑:Theodor ZouliasBenyaminDev 更新时间:9/27/2023 访问量:81

问:

我将通过 Selenium 和 HtmlAgilityPack(在 C#、.NET 7 中)从网站获取每个国家/地区的人口排名。此代码适用于 10 个国家/地区,但是当我想请求所有国家/地区时,由于许多任务,系统崩溃并且我遇到缓慢。这是什么方法?

static async void GetData()
{
    string website = "......";
    List<string> Countries = new List<string>()
    {
        // 195 Countries
    };

    List<Task<JObject>> Tasks = new List<Task<JObject>>();
    foreach (string countryName in Countries)
    {
        Tasks.Add(FetchData(website + "/" + countryName));
    }
    await Task.WhenAll(Tasks);
    foreach (JObject populationRank in Tasks.Select(task => task.Result))
    {
        WriteLine(populationRank);
    }
}

static Task<JObject> FetchData(string URL)
{
    return Task.Run(async () =>
    {
        ChromeDriver myDriver = new CustomizedDriver();
        myDriver.Navigate().GoToUrl(URL);
        HtmlDocument Document = new HtmlDocument();
        Document.LoadHtml(await myDriver.GetPageSourceAsync());
        JObject Object = new JObject()
        {
            ["PopulationRank"] = Document.DocumentNode.SelectSingleNode("//div[@id='popRank']").InnerText
        };
        myDriver.Quit();
        myDriver.Dispose();
        return Object;
    });
}
static Task<string> GetPageSourceAsync(this IWebDriver driver)
{
    return Task.Run(() =>
    {
        while (true)
        {
            string PageState = (string)( (IJavaScriptExecutor)driver ).ExecuteScript("return document.readyState");
            if (PageState == "interactive" || PageState == "complete")
                return driver.PageSource;
        }
    });
}
static ChromeDriver CustomizedDriver()
{
    ChromeDriverService chromeService = ChromeDriverService.CreateDefaultService();
    chromeService.HideCommandPromptWindow = true;
    ChromeOptions options = new ChromeOptions();
    options.PageLoadStrategy = PageLoadStrategy.None;
    options.AddArgument("--headless --disable-cookies --blink-settings=imagesEnabled=false");
    return new ChromeDriver(chromeService, options);
}
C# selenium-webdriver 异步 async-await 任务

评论

0赞 Theodor Zoulias 9/26/2023
你能在问题中包括该方法的调用站点吗?GetData
0赞 BenyaminDev 9/27/2023
@TheodorZoulias 我为延误道歉;是的,我用的是 .NET 7 我写的这段代码只是简化了,否则信息在现实中是不同的。countrymeters.info/en 的网站是我使用的网站。您还可以从下拉列表中获取国家/地区列表
0赞 Theodor Zoulias 9/27/2023
由于你的目标是 .NET 7,你是否考虑过使用 Parallel.ForEachAsync API?
0赞 BenyaminDev 9/27/2023
@TheodorZoulias我在回答“@sadbuttrue”时说。这种方法反应灵敏,但不幸的是需要 10 分钟,并且系统有点滞后。这可以分为几个线程吗?以某种方式在所有 CPU 内核上实现?
0赞 Theodor Zoulias 9/27/2023
这是 / 应该做的。如果你不能通过试验 来提高它们的性能,那么你所做的任何事情本质上都是不可并行化的。Parallel.ForEachParallel.ForEachAsyncMaxDegreeOfParallelism

答:

0赞 sadbuttrue 9/25/2023 #1

您不应一次创建大量任务。相反,应使用 Parallel.ForEachAsync,如以下示例所示:

    static async void GetData()
{
    string website = "......";
    List<string> Countries = new List<string>()
    {
        // 195 Countries
    };

    var results = new ConcurrentBag<JObject>();
    
    var parallelOptions = new ParallelOptions()
    {
        MaxDegreeOfParallelism = 10 // Here you control how many countries in parallel to process.
    };

    Parallel.ForEachAsync(Countries, parallelOptions, async (country, token) => 
    {
        var result = await FetchData(website + "/" + countryName);
        results.Add(result);
    }

    foreach (JObject populationRank in results))
    {
        WriteLine(populationRank);
    }
}

static async Task<JObject> FetchData(string URL)
{
        ChromeDriver myDriver = new CustomizedDriver();
        myDriver.Navigate().GoToUrl(URL);
        HtmlDocument Document = new HtmlDocument();
        Document.LoadHtml(await myDriver.GetPageSourceAsync());
        JObject Object = new JObject()
        {
            ["PopulationRank"] = Document.DocumentNode.SelectSingleNode("//div[@id='popRank']").InnerText
        };
        myDriver.Quit();
        myDriver.Dispose();
        return Object;
}
static Task<string> GetPageSourceAsync(this IWebDriver driver)
{
    return Task.Run(() =>
    {
        while (true)
        {
            string PageState = (string)( (IJavaScriptExecutor)driver ).ExecuteScript("return document.readyState");
            if (PageState == "interactive" || PageState == "complete")
                return driver.PageSource;
        }
    });
}
static ChromeDriver CustomizedDriver()
{
    ChromeDriverService chromeService = ChromeDriverService.CreateDefaultService();
    chromeService.HideCommandPromptWindow = true;
    ChromeOptions options = new ChromeOptions();
    options.PageLoadStrategy = PageLoadStrategy.None;
    options.AddArgument("--headless --disable-cookies --blink-settings=imagesEnabled=false");
    return new ChromeDriver(chromeService, options);
}

评论

0赞 BenyaminDev 9/27/2023
您的方法响应迅速,但不幸的是,它需要 10 分钟,并且系统有点滞后。这可以分为几个线程吗?以某种方式在所有 CPU 内核上实现?
0赞 sadbuttrue 9/27/2023
可以通过递增 MaxDegreesOfParalelism 变量来增加线程数。