如何使用 PowerShell 解析网站的 HTML-解网

问：

我正在尝试检索有关网站的一些信息，我想查找特定的标签/类，然后返回包含的文本值（innerHTML）。这就是我到目前为止所拥有的

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)


foreach ($obj in $HTML.all) { 
    $obj.getElementsByClassName('some-class-name') 
}

我认为将 HTML 转换为 HTML 对象存在问题，因为当我尝试“选择对象”它们时，我会看到很多未定义的属性和空结果。

那么，在花了两天时间之后，我应该如何使用 Powershell 解析 HTML？

我无法使用方法，因为我没有安装 Office（无法使用 IHTMLDocument2IHTMLDocument2)
我不能使用 with，因为 Powershell 在访问 ParsedHTML 属性时挂起并生成其他窗口（parsedhtml 不再响应，并且在 PowerShell 3.0 中使用 Invoke-Webrequest 会生成 Windows 安全警告Invoke-Webrequest-UseBasicParsing)

因此，既然使用正则表达式解析 HTML 是一个很大的禁忌，那么我该怎么做呢？似乎没有任何效果。

PowerShell DOM HTML 解析

$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}

#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links

PSParseHTML 模块包装 HTML Agility Pack，^[1] 和 AngleSharp .NET 库（NuGet 包）;您可以使用任何一个进行 HTML 解析;后者需要作为选择加入;至于它们各自的 DOM（对象模型）：-Engine AngleSharp
- 默认情况下使用的 HTML Agility Pack 提供了一个对象模型，该模型类似于标准 System.Xml.XmlDocument NET 类型（）提供的 XML DOM。有关其使用示例，请参阅此答案。[xml]
- AngleSharp 需要通过选择加入，它建立在官方的 W3C 规范之上，因此提供了 Web 浏览器中可用的 HTML DOM。值得注意的是，这意味着 its 和方法可以与通常的 CSS 选择器一起使用，如下所示。-Engine AngleSharp.QuerySelector().QuerySelectorAll()
使用此模块的另一个优点是它不仅是跨版本的，而且是跨平台的;也就是说，您可以在 Windows PowerShell 和 PowerShell （Core） 7+ 中使用它，也可以通过后者在类 Unix 平台上使用它。

一个基于 AngleSharp 引擎的独立示例，它解析英语维基百科的主页并提取属性值为：classvector-menu-content-list

# Install the PSParseHTML module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
  Write-Verbose "Installing PSParseHTML module for the current user..."
  Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
}

# Using the AngleSharp engine, parse the home page of the English Wikipedia
# into an HTML DOM.
$htmlDom = ConvertFrom-Html -Engine AngleSharp -Url https://en.wikipedia.org

# Extract all HTML elements with a 'class' attribute value of 'vector-menu-content-list'
# and output their text content (.TextContent)
$htmlDom.QuerySelectorAll('.vector-menu-content-list').TextContent

上一个：如何使用 PowerShell 解析网站的 HTML

下一个：从查看源抓取数据：https：//www.youtube.com/embed/

如何使用 PowerShell 解析网站的 HTML

How to parse the HTML of a website with PowerShell

评论

评论

评论