提问人:Joao Coelho 提问时间:11/13/2023 最后编辑:HedgeHogJoao Coelho 更新时间:11/13/2023 访问量:28
从抓取的数据创建数据帧时如何避免重复行?
How to avoid duplicate rows while creating dataframe from scraped data?
问:
这只是提取美元报价和变化的简单代码。导出到 excel 时,我得到了一个具有相同值的附加行。
如何消除此双 excel 条目?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.cnbc.com/quotes/.DXY'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
valores = soup.find('div', class_='QuoteStrip-lastPriceStripContainer')
cotacao = valores.find('span')
variacoes = soup.find('span', class_='QuoteStrip-changeDown')
variacao = variacoes.find('span')
print(cotacao.text)
print(variacao.text)
cotacao_dolar = []
for row in soup:
dic = {}
dic['Cambio'] = cotacao.text
dic['Variacao'] = variacao.text
cotacao_dolar.append(dic)
df = pd.DataFrame(cotacao_dolar)
df.to_csv(r'C:\teste\cotacao_dolar.csv')
结果:
试图删除重复项,但我想直接从 python 代码中删除该行。
答:
1赞
HedgeHog
11/13/2023
#1
问题是您正在分别迭代它的两个标签soup
<class 'bs4.element.Doctype'>
<class 'bs4.element.Tag'>
创建 你的 ,因此将其附加到 你的 .dict
list
删除循环:
dic = {}
dic['Cambio'] = cotacao.text
dic['Variacao'] = variacao.text
df = pd.DataFrame([dic])
或者可以将脚本简化为:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.cnbc.com/quotes/.DXY'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = {e.get('class')[0]:e.text.split(' ')[0] for e in soup.select('.QuoteStrip-lastPriceStripContainer span[class]')}
pd.DataFrame([data])
评论