使用 html-agility-pack 从网站中提取和解析信息

Extracting and parsing information from a website using html-agility-pack

提问人:Jose Cabrera Zuniga 提问时间:9/11/2023 最后编辑:Uwe KeimJose Cabrera Zuniga 更新时间:9/11/2023 访问量:29

问:

从链接中提取的下一个代码

https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/

具有引文列表的网站。我的最终目标是提取该信息并将其放入 json 对象列表中,以便每个对象都可以获得引用信息。

虽然此代码提取每个引文,但目前它使用以下命令提取第一个 pmid 值:

citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)

它保持显示35491994 这是第一个发现的引文的 PMID。为什么会这样?对于每个分配给引文变量的对象,此值是否不应该更改?

using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
        
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath

public class authorCitation
{
    public String pmid { get; set; }

}

public class processPubReferences{
    
    public HtmlDocument getRawData(String ncbiId)
    {
        String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
        
        HtmlWeb web = new HtmlWeb();
        HtmlDocument htmlDoc = web.Load(@url);
        htmlDoc.OptionFixNestedTags = true;

        Console.WriteLine("getRawData>Data Type of htmlDoc is:");
        Console.WriteLine(htmlDoc.GetType());  

        return htmlDoc;
    }
    public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
    {
        
        HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");

        Console.WriteLine("getCitations>type of nodetree is ");
        Console.WriteLine(nodetree.GetType());
        return nodetree;
    }
}



class TestClass
{
    public static void Main(string[] args)
    {
        processPubReferences pr = new processPubReferences();
        String ncbiId = "1dAdNxivfiO5l";
        var htmlDoc = pr.getRawData(ncbiId);        
        var citations = pr.getCitations(htmlDoc);
        // var pmidNode;

        foreach (var citation in citations)
        {
            Console.WriteLine("------------------------------Start Node INfo-----------------------------");
            Console.WriteLine(citation.InnerText);
            Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
            
            Console.WriteLine("------------------------------Ende Node Info -----------------------------");

        }        
    }
}
C# html-敏捷包

评论


答: 暂无答案