如何获取下载文件链接说明?

How to get download file link description too?

提问人:Rojer Brief 提问时间:10/20/2022 最后编辑:Lance U. MatthewsRojer Brief 更新时间:10/20/2022 访问量:242

问:

链接示例:

<img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too" >

以及我用来解析 HTML 下载源文件中链接的方法:

public List<string> GetLinks(string message)
        {
            List<string> list = new List<string>();
            string txt = message;
            foreach (Match item in Regex.Matches(txt, @"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?"))
            {
                if (item.Value.Contains("thumbs"))
                {
                    int index1 = item.Value.IndexOf("mp4");

                    string news = ReplaceLastOccurrence(item.Value, "thumbs", "videos");

                    if (index1 != -1)
                    {
                        string result = news.Substring(0, index1 + 3);
                        if (!list.Contains(result))
                        {
                            list.Add(result);
                        }
                    }
                }
            }

            return list;
        }

但这只会给出我想得到的链接,以及此示例中的链接描述:

这是个测试

然后使用它:

string[] files = Directory.GetFiles(@"D:\Videos\");
            foreach (string file in files)
            {
                foreach(string text in GetLinks(File.ReadAllText(file)))
                {
                    if (!videosLinks.Contains(text))
                    {
                        videosLinks.Add(text);
                    }
                }
               
            }

下载链接时:

private async void btnStartDownload_Click(object sender, EventArgs e)
        {
            if (videosLinks.Count > 0)
            {
                for (int i = 0; i < videosLinks.Count; i++)
                {
                    string fileName = System.IO.Path.GetFileName(videosLinks[i]);
                    await DownloadFile(videosLinks[i], @"D:\Videos\videos\" + fileName);
                }
            }
        }

但是 fileName 我想成为每个链接的描述。

C# .NET 正则表达式 winforms html 解析

评论

1赞 Lance U. Matthews 10/20/2022
您没有匹配整个标签,因此,除非您对源 HTML 有所了解,否则您将无法确保 an 是该标签的标签,甚至无法确保您匹配的 URL 首先是 an 的属性。您是否考虑过使用 HTML 解析器?<img />alt=""<img />src=""<img />

答:

-1赞 Anirudha Gupta 10/20/2022 #1

如果使用正则表达式使用代码,则将花费更多的 CPU 周期并且执行速度较慢。使用一些库,如 AngleSharp。

我尝试在 AngleSharp 中编写您的代码。我就是这样做的。

        string test = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >\r\n";
        var configuration = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(configuration);
        using var doc = await context.OpenAsync(req => req.Content(test));

        string href = doc.QuerySelector("img").Attributes["alt"].Value;
1赞 Ibrahim Timimi 10/20/2022 #2

您可以使用 Html Agility Pack,这是一个用 C# 编写的 HTML 解析器,用于读/写 DOM,并支持纯 XPATH 或 XSLT。在下面的示例中,您可以在属性和其他中检索描述。alt

实现:

using HtmlAgilityPack;
using System;
                    
public class Program
{
    public static void Main()
    {
        HtmlDocument doc = new HtmlDocument();
        var html = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >";
        doc.LoadHtml(html);
        HtmlNode image = doc.DocumentNode.SelectSingleNode("//img");

        Console.WriteLine("Source: {0}", image.Attributes["src"].Value);
        Console.WriteLine("Description: {0}", image.Attributes["alt"].Value);
        Console.Read();
    }
}

演示:
https://dotnetfiddle.net/nAAZDL

输出:

Source: https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg
Description: This is the description i want to get too

评论

1赞 Lance U. Matthews 10/20/2022
这是简单、可靠的方法,并且可以通过不使用 .此外,尽管该问题没有说明可以对源 HTML 做出哪些假设,但这应该处理用单引号、双引号和根本没有引号括起来的属性值,而其他正则表达式答案和问题本身需要双引号。doc.LoadHtml(html);MemoryStream
0赞 Ibrahim Timimi 10/20/2022
@LanceU.Matthews:我已经更新了我的答案,以包括您的意见。谢谢你,先生
1赞 Lance U. Matthews 10/20/2022 #3

Ibrahim 的回答表明,使用适当的 HTML 解析器可以做到这一点是多么简单,但我想如果你只是想从单个页面中提取一个标签,或者不想使用外部依赖项,那么正则表达式并不是不合理的,特别是如果你可以对你所匹配的 HTML 做出某些假设。

请注意,下面的模式和代码仅用于演示目的,并不意味着是一个健壮的、详尽的标签解析器;由读者根据需要来增强它们,以处理它们在狂野的网络中可能遇到的各种 HTML 怪癖和特性。例如,该模式不会匹配属性值用单引号括起来或根本没有引号的图像标记,如果标记具有多个同名属性,则代码会引发异常。

我这样做的方法是使用一个模式来匹配一个标签及其所有属性对......<img />

<img(?:\s+(?<name>[a-z]+)="(?<value>[^"]*)")*\s*/?>

...然后,您可以对其进行查询以查找您关心的属性。您可以使用该模式将图像属性提取为如下所示的...Dictionary<string, string>

static IEnumerable<Dictionary<string, string>> EnumerateImageTags(string input)
{
    const string pattern =
@"
<img                     # Start of tag
    (?:                  # Attribute name/value pair: noncapturing group
        \s+              # One or more whitespace characters
        (?<name>[a-z]+)  # Attribute name: one or more letters
        =                # Literal equals sign
        ""               # Literal double quote
        (?<value>[^""]*) # Attribute value: zero or more non-double quote characters
        ""               # Literal double quote
    )*                   # Zero or more attributes are allowed
    \s*                  # Zero or more whitespace characters
/?>                      # End of tag with optional forward slash
";

    foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace))
    {
        string[] attributeValues = match.Groups["value"].Captures
            .Cast<Capture>()
            .Select(capture => capture.Value)
            .ToArray();
        // Create a case-insensitive dictionary mapping from each capture of the "name" group to the same-indexed capture of the "value" group
        Dictionary<string, string> attributes = match.Groups["name"].Captures
            .Cast<Capture>()
            .Select((capture, index) => new KeyValuePair<string, string>(capture.Value, attributeValues[index]))
            .ToDictionary(pair => pair.Key, pair => pair.Value, StringComparer.OrdinalIgnoreCase);

        yield return attributes;
    }
}

鉴于。。。SO74133924.html

<html>
    <body>
        <p>This image comes from https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg:
        <img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too">
        <p>This image has additional attributes on multiple lines in a self-closing tag:
        <img
            first="abc"
            src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
            empty=""
            alt="This image has additional attributes on multiple lines in a self-closing tag"
            last="xyz"
        />
        <p>This image has empty alternate text:
        <img src="https://example.com/?message=This image has empty alternate text" alt="">
        <p>This image has no alternate text:
        <img src="https://example.com/?message=This image has no alternate text">
    </body>
</html>

...你会像这样使用每个标签的属性字典......

static void Main()
{
    string input = File.ReadAllText("SO74133924.html");

    foreach (Dictionary<string, string> imageAttributes in EnumerateImageTags(input))
    {
        foreach (string attributeName in new string[] { "src", "alt" })
        {
            string displayValue = imageAttributes.TryGetValue(attributeName, out string attributeValue)
                ? $"\"{attributeValue}\"" : "(null)";
            Console.WriteLine($"{attributeName}: {displayValue}");
        }
        Console.WriteLine();
    }
}

...输出这个...

src: "https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
alt: "This is the description i want to get too"

src: "https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
alt: "This image has additional attributes on multiple lines in a self-closing tag"

src: "https://example.com/?message=This image has empty alternate text"
alt: ""

src: "https://example.com/?message=This image has no alternate text"
alt: (null)