为什么我的代码多次打印出相同的 html 链接？-解网

问：

我正在 Python 上进行以下链接活动（这是关于 Python Web Access 数据 - Coursera 的作业）。问题来了：

在本作业中，您将编写一个扩展 http://www.py4e.com/code3/urllinks.py 的 Python 程序。程序将使用 urllib 从下面的数据文件中读取 HTML，从锚标记中提取 href= vaues，扫描相对于列表中名字处于特定位置的标记，点击该链接并重复该过程多次并报告您找到的姓氏。
实际问题：从以下位置开始：http://py4e-data.dr-chuck.net/known_by_Armen.html 在位置 18（名字是 1）找到链接。点击该链接。重复此过程 7 次。答案是您检索到的姓氏。提示：要加载的最后一页名称的第一个字符是：F

因此，我基本上复制了给定链接中的大部分代码，最后自己想出一些东西。它看起来像这样：

import urllib.request, urllib.parse, urllib.error
import collections
collections.Callable = collections.abc.Callable
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
for i in range(7):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        tag = tags[17]
        link = tag.get('href', None)
        print(link)
    url = link

我得到了结果：但它打印出来是这样的：

http://py4e-data.dr-chuck.net/known_by_Hailie.html......（它重复感觉像一百次）http://py4e-data.dr-chuck.net/known_by_Hailie.html......http://py4e-data.dr-chuck.net/known_by_Felicity.html

我的代码的哪一部分导致了这个问题？我怎样才能切断它？

python html 网页抓取 beautifulsoup html 解析

import urllib.request, urllib.parse, urllib.error
import collections
collections.Callable = collections.abc.Callable
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
for i in range(7):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    tag = tags[17]
    link = tag.get('href', None)
    print(link)
    url = link

此代码应该只为您提供 7 个 url，您将通过该 URL

为什么我的代码多次打印出相同的 html 链接？

Why is my code print out the same html link a lot of times?

评论

评论

评论