抓取 Barchart.com 财务数据

Scraping Barchart.com financial data

提问人:Adam 提问时间:9/24/2023 最后编辑:Adam 更新时间:9/24/2023 访问量:74

问:

我正在尝试抓取表格中的财务数据 https://www.barchart.com/stocks/quotes/IBM/income-statement/

使用 inspect 元素我没有看到任何 XHR/fetch 请求,但看起来数据是通过名为 global-MBHFEFVQ.js 的 js 文件生成的,但很难遵循混淆代码。

对于 barchart.com 网站上的其他一些数据,它似乎可以通过 API 调用进行抓取,如这篇文章中所述:如何从 barchart.com 中抓取这些股票代码?但我不确定损益表数据是否相同。任何帮助都是值得赞赏的,因为我是网络抓取的新手。

我将使用 php 来抓取数据,但其他语言也可以。

现在我只是获取整个页面并将我感兴趣的数据提取为子字符串,但这并不理想,因为我必须对网站中的所有其他开销进行排序,并且还要遍历 url 中的每个“reportPage”以获取每年的数据。

    $url = "https://www.barchart.com/stocks/quotes/IBM/income-statement/quarterly?reportPage=2";
    $html = file_get_contents($url);
    $date_start = stripos($html, "report__row-dates");
    $date_end = stripos($html, "</tr>", $offset = $date_start);
    $dates = substr($html, $date_start, $date_end - $date_start);
网页抓取 web-inspector

评论

2赞 Gilles Quénot 9/24/2023
你的代码在哪里?
0赞 Adam 9/24/2023
刚刚编辑了帖子以添加我当前的方法。目前的方法并不理想,因为我只是将整个网站作为一个字符串阅读,然后在其中搜索感兴趣的数据。理想情况下,有某种请求/api 调用只会返回我真正感兴趣的数据的 json。
0赞 Gilles Quénot 9/24/2023
你不认为使用适当的标签会是一个更好的问题吗?如果你不打扰你的问题,不要指望 ppl 会给你时间。php

答:

1赞 hanshenrik 9/24/2023 #1

看起来数据是通过名为 global-MBHFEFVQ.js 的 js 文件生成的

不知道你从哪里得到的,它都嵌入在 HTML 中,

<?php

declare(strict_types=1);
$html = file_get_contents('https://www.barchart.com/stocks/quotes/IBM/income-statement/annual');
//var_dump($html);die();
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$tbody = $xp->query('//tr[contains(@class,"bc-financial-report")]/parent::tbody')->item(0);
$trs = $xp->query('./tr', $tbody);
$data_keys = [];
foreach($trs->item(0)->getElementsByTagName('td') as $td){
    $data_keys[] = trim($td->textContent);
}
$data = [];
for($i=1;$i<$trs->length;++$i){
    $tr = $trs->item($i);
    $tds = $xp->query('./td', $tr);
    $row = [];
    foreach($tds as $td){
        $row[] = trim($td->textContent);
    }
    $data[] = array_combine($data_keys, $row);
}
var_export($data);

array (
  0 => 
  array (
    ' ' => 'Sales',
    '12-2022' => '60,530,000',
    '12-2021' => '57,350,000',
    '12-2020' => '55,179,000',
    '12-2019' => '57,714,000',
    '12-2018' => '79,591,000',
  ),
  1 => 
  array (
    ' ' => 'Cost of Goods',
    '12-2022' => '27,842,000',
    '12-2021' => '25,865,000',
    '12-2020' => '24,314,000',
    '12-2019' => '26,180,000',
    '12-2018' => '42,654,000',
  ),
  2 => 
  array (
    ' ' => 'Gross Profit',
    '12-2022' => '32,687,000',
    '12-2021' => '31,486,000',
    '12-2020' => '30,865,000',
    '12-2019' => '31,533,000',
    '12-2018' => '36,936,000',
  ),
  3 => 
  array (
    ' ' => 'Operating Expenses',
    '12-2022' => '25,176,000',
    '12-2021' => '25,233,000',
    '12-2020' => '26,823,000',
    '12-2019' => '24,634,000',
    '12-2018' => '24,745,000',
  ),
  4 => 
  array (
    ' ' => 'Operating Income',
    '12-2022' => '7,512,000',
    '12-2021' => '6,252,000',
    '12-2020' => '4,042,000',
    '12-2019' => '6,900,000',
    '12-2018' => '12,192,000',
  ),
  5 => 
  array (
    ' ' => 'Interest Expense',
    '12-2022' => '1,216,000',
    '12-2021' => '1,155,000',
    '12-2020' => '1,288,000',
    '12-2019' => '1,344,000',
    '12-2018' => '723,000',
  ),
  6 => 
  array (
    ' ' => 'Other Income',
    '12-2022' => '-5,140,000',
    '12-2021' => '-260,000',
    '12-2020' => '-182,000',
    '12-2019' => '1,650,000',
    '12-2018' => '-127,000',
  ),
  7 => 
  array (
    ' ' => 'Pre-tax Income',
    '12-2022' => '1,156,000',
    '12-2021' => '4,837,000',
    '12-2020' => '2,572,000',
    '12-2019' => '7,206,000',
    '12-2018' => '11,342,000',
  ),
  8 => 
  array (
    ' ' => 'Income Tax',
    '12-2022' => '-626,000',
    '12-2021' => '124,000',
    '12-2020' => '-1,360,000',
    '12-2019' => '60,000',
    '12-2018' => '2,619,000',
  ),
  9 => 
  array (
    ' ' => 'Net Income Continuous',
    '12-2022' => '1,783,000',
    '12-2021' => '4,712,000',
    '12-2020' => '3,932,000',
    '12-2019' => '7,146,000',
    '12-2018' => '8,723,000',
  ),
  10 => 
  array (
    ' ' => 'Net Income Discontinuous',
    '12-2022' => '-143,000',
    '12-2021' => '1,030,000',
    '12-2020' => '1,658,000',
    '12-2019' => '2,285,000',
    '12-2018' => '5,000',
  ),
  11 => 
  array (
    ' ' => 'Net Income',
    '12-2022' => '$1,640,000',
    '12-2021' => '$5,742,000',
    '12-2020' => '$5,590,000',
    '12-2019' => '$9,431,000',
    '12-2018' => '$8,728,000',
  ),
  12 => 
  array (
    ' ' => 'EPS Basic Total Ops',
    '12-2022' => '1.82',
    '12-2021' => '6.41',
    '12-2020' => '6.28',
    '12-2019' => '10.63',
    '12-2018' => '9.57',
  ),
  13 => 
  array (
    ' ' => 'EPS Basic Continuous Ops',
    '12-2022' => '1.97',
    '12-2021' => '5.26',
    '12-2020' => '4.42',
    '12-2019' => '8.05',
    '12-2018' => '9.56',
  ),
  14 => 
  array (
    ' ' => 'EPS Basic Discontinuous Ops',
    '12-2022' => '-0.16',
    '12-2021' => '1.15',
    '12-2020' => '1.86',
    '12-2019' => '2.58',
    '12-2018' => '0.01',
  ),
  15 => 
  array (
    ' ' => 'EPS Diluted Total Ops',
    '12-2022' => '1.80',
    '12-2021' => '6.35',
    '12-2020' => '6.23',
    '12-2019' => '10.56',
    '12-2018' => '9.52',
  ),
  16 => 
  array (
    ' ' => 'EPS Diluted Continuous Ops',
    '12-2022' => '1.95',
    '12-2021' => '5.21',
    '12-2020' => '4.38',
    '12-2019' => '8.00',
    '12-2018' => '9.51',
  ),
  17 => 
  array (
    ' ' => 'EPS Diluted Discontinuous Ops',
    '12-2022' => '-0.16',
    '12-2021' => '1.14',
    '12-2020' => '1.85',
    '12-2019' => '2.56',
    '12-2018' => '0.01',
  ),
  18 => 
  array (
    ' ' => 'EPS Diluted Before Non-Recurring Items',
    '12-2022' => '9.13',
    '12-2021' => '7.93',
    '12-2020' => '8.67',
    '12-2019' => '12.81',
    '12-2018' => '13.81',
  ),
  19 => 
  array (
    ' ' => 'EBITDA(a)',
    '12-2022' => '$12,314,000',
    '12-2021' => '$12,669,000',
    '12-2020' => '$10,737,000',
    '12-2019' => '$12,959,000',
    '12-2018' => '$16,672,000',
  ),
)

评论

0赞 Adam 10/22/2023
非常感谢。我想我希望在某个地方有一个 api 调用,我可以用它来获取所有数据的预格式化 json,而无需执行 getcontents 10 次来遍历所有年份的数据。但是您在提取我正在寻找的数据方面做得很好。