使用 lxml 解析来自 href 的实际链接 [duplicate]-解网

问：

27天前关闭。

使用 jupyter notebook、python 3。我正在从网络上下载一些文件，其中大部分是在本地批量下载的。这些文件列在网页上，但它们位于 href 属性中。我找到的代码给了我文本，但没有实际的链接（即使我的理解是代码应该得到链接）。

这是我所拥有的：

import os
import requests
from lxml import html
from lxml import etree
import urllib.request
import urllib.parse
...
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)
td_list = [e for e in parsed_content.iter() if e.tag == 'td']

directive_list = []
for td_e in td_list:
   txt = td_e.text_content()
   
   directive_list.append(txt)

这是一个很长的网页，里面有一堆条目，看起来像<a href="file1.pdf"> text1 </a>

此代码返回：text1、text2 等，而不是 file1.pdf、file2.pdf

如何提取链接？

Python 解析 LXML

import requests
from lxml import html

url = 'YOUR_URL_HERE'  # Replace with your URL
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)

# Find all 'a' elements inside 'td' elements
links = parsed_content.xpath('//td/a')

directive_list = []
for link in links:
    # Get the href attribute
    href = link.get('href')

    # You might want to join this with the base URL if they are relative links
    # href = urllib.parse.urljoin(url, href)

    directive_list.append(href)

# Print the list of links
print(directive_list)

上一个：lxml iterparse 会占用 4GB XML 文件的内存，即使使用 clear（）也是如此

下一个：无法使用 lxml 进行管理，既有漂亮的打印，又没有将 xml 元素转换为自闭合元素

使用 lxml 解析来自 href 的实际链接 [duplicate]

parsing actual link from href using lxml [duplicate]

评论