从抓取的数据创建数据帧时如何避免重复行?

How to avoid duplicate rows while creating dataframe from scraped data?

提问人:Joao Coelho 提问时间:11/13/2023 最后编辑:HedgeHogJoao Coelho 更新时间:11/13/2023 访问量:28

问:

这只是提取美元报价和变化的简单代码。导出到 excel 时,我得到了一个具有相同值的附加行。

如何消除此双 excel 条目?

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.cnbc.com/quotes/.DXY'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

valores = soup.find('div', class_='QuoteStrip-lastPriceStripContainer')

cotacao = valores.find('span')

variacoes = soup.find('span', class_='QuoteStrip-changeDown')

variacao = variacoes.find('span')

print(cotacao.text)
print(variacao.text)

cotacao_dolar = []

for row in soup:
    dic = {}

    dic['Cambio'] = cotacao.text
    dic['Variacao'] = variacao.text

    cotacao_dolar.append(dic)

df = pd.DataFrame(cotacao_dolar)

df.to_csv(r'C:\teste\cotacao_dolar.csv')

结果:

enter image description here

试图删除重复项,但我想直接从 python 代码中删除该行。

Python Pandas DataFrame 网页抓取 Beautifulsoup

评论

0赞 HedgeHog 11/13/2023
欢迎来到 SO - 除了以后的帖子。请观看阅读如何提问,以改进、编辑和格式化您的问题。谢谢

答:

1赞 HedgeHog 11/13/2023 #1

问题是您正在分别迭代它的两个标签soup

<class 'bs4.element.Doctype'>
<class 'bs4.element.Tag'>

创建 你的 ,因此将其附加到 你的 .dictlist

删除循环:

dic = {}
dic['Cambio'] = cotacao.text
dic['Variacao'] = variacao.text

df = pd.DataFrame([dic])

或者可以将脚本简化为:

import requests 
from bs4 import BeautifulSoup 
import pandas as pd

url = 'https://www.cnbc.com/quotes/.DXY'

response = requests.get(url) 
soup = BeautifulSoup(response.text, 'html.parser')

data = {e.get('class')[0]:e.text.split(' ')[0] for e in soup.select('.QuoteStrip-lastPriceStripContainer span[class]')}

pd.DataFrame([data])