是否可以在 C# 中配置 HttpClient 以一致地检索网站元数据而不会被阻止？

Is it possible to configure HttpClient in C# to consistently retrieve website meta data without being blocked?

提问人：Alex Cooper 提问时间：11/17/2023 最后编辑：TylerHAlex Cooper 更新时间：11/18/2023 访问量：34

问：

我设置了一个 HttpClient 来从网站检索元数据，以便在将 url 发布到我的网站的用户消息中时构建预览。这适用于指向大多数网站的链接，但大量网站不会返回预期的数据。

例如，https://www.opendemocracy.net/en/uk/ - Facebook 和 Twitter 都检索元数据并显示预览，但返回给 C# httpClient 的 html 不包含所需的元数据，标题为“Attention Required”

<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />

我一直在检查 HeyMeta，但它也无法检索元 - https://www.heymeta.com/url/www.opendemocracy.net/en/url/www.opendemocracy.net/en/uk/

这是我的代码，用于检索完整的 html，可以从 Head 中提取元

HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false);
HttpContent content = response.Content;
var html = content.ReadAsStringAsync().Result;

成功返回的元示例

<title>Forge Bridge Cottage | Coniston</title>
<meta property="og:title" content="Forge Bridge Cottage | Coniston">
<meta property="og:description" content="Coppermines Cottages | Forge Bridge Cottage | Coniston  Lake District Cottages">

HeyMeta 的结果：https://www.heymeta.com/url/www.coppermines.co.uk/accommodation/forge-bridge-cottage-coniston

如何调整我的代码以始终如一地从 Facebook 和 Twitter 等网站检索元数据，而不会被明显识别为抓取机器人？有没有办法在请求中表明我只对应该公开可用的元数据感兴趣？

C# .net-core 元数据 dotnet-httpclient

0赞 Charlieface 11/17/2023

旁注：应该使用不像。还需要await.Resultvar html = await content.ReadAsStringAsync();responseusing

0赞 Charlieface 11/17/2023

尝试添加一个外观合理的用户代理。最终，Cloudflare页面试图阻止机器人，这就是你。

0赞 Alex Cooper 11/17/2023

谢谢，我添加了“等待”并使用响应。关键是添加 UserAgent，如此处所述 stackoverflow.com/questions/44076962/... （使用 httpClient.DefaultRequestHeaders.UserAgent.ParseAdd（“Mozilla/5.0 （compatible;AcmeInc/1.0）“）或 DefaultRequestHeaders.Add（”User-Agent“， ”C# 应用“）;)

答：

0赞 Alex Cooper 11/17/2023 #1

感谢@Charliface的评论，答案是添加UserAgent（如何？参考如何在HttpClient上设置默认用户代理？)

作为一个友好的机器人，只是想在帖子中提供一个返回网站用户引用的站点的链接，我的 API 只需要来自头部的元数据，但这可以从返回的完整 html 中提取

        using (HttpClient httpClient = new HttpClient())
        {
            httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("C# App");
            httpClient.Timeout = TimeSpan.FromSeconds(20);

            using (HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false))
            {
                HttpContent content = response.Content;
                var html = await content.ReadAsStringAsync();

                return html;
            }
        }

从提取的元数据创建的 OpenDemocracy.net 的示例链接：

上一个：.NET 8 - 我的平台不支持 SHA3.NET 8 - SHA3 not supported on my platform

下一个：添加 nuget 源后连接超时