提问人:Jose Cabrera Zuniga 提问时间:9/11/2023 最后编辑:Uwe KeimJose Cabrera Zuniga 更新时间:9/11/2023 访问量:29
使用 html-agility-pack 从网站中提取和解析信息
Extracting and parsing information from a website using html-agility-pack
问:
从链接中提取的下一个代码
https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/
具有引文列表的网站。我的最终目标是提取该信息并将其放入 json 对象列表中,以便每个对象都可以获得引用信息。
虽然此代码提取每个引文,但目前它使用以下命令提取第一个 pmid 值:
citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)
它保持显示35491994 这是第一个发现的引文的 PMID。为什么会这样?对于每个分配给引文变量的对象,此值是否不应该更改?
using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath
public class authorCitation
{
public String pmid { get; set; }
}
public class processPubReferences{
public HtmlDocument getRawData(String ncbiId)
{
String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
HtmlWeb web = new HtmlWeb();
HtmlDocument htmlDoc = web.Load(@url);
htmlDoc.OptionFixNestedTags = true;
Console.WriteLine("getRawData>Data Type of htmlDoc is:");
Console.WriteLine(htmlDoc.GetType());
return htmlDoc;
}
public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
{
HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");
Console.WriteLine("getCitations>type of nodetree is ");
Console.WriteLine(nodetree.GetType());
return nodetree;
}
}
class TestClass
{
public static void Main(string[] args)
{
processPubReferences pr = new processPubReferences();
String ncbiId = "1dAdNxivfiO5l";
var htmlDoc = pr.getRawData(ncbiId);
var citations = pr.getCitations(htmlDoc);
// var pmidNode;
foreach (var citation in citations)
{
Console.WriteLine("------------------------------Start Node INfo-----------------------------");
Console.WriteLine(citation.InnerText);
Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
Console.WriteLine("------------------------------Ende Node Info -----------------------------");
}
}
}
答: 暂无答案
评论