构建正确的 API 查询 URL,以按多个关键字从 clinicaltrials.gov 中筛选数据

Building correct API query URL to filter data from clinicaltrials.gov by multiple keywords

提问人:mandmeier 提问时间:11/11/2023 最后编辑:Brian Tompsett - 汤莱恩mandmeier 更新时间:11/17/2023 访问量:53

问:

我正在尝试从公共 API 获取一些数据,并且需要一些帮助来找出请求 URL 的正确查询语法。

下面是我的剧本。(没关系修复或改进功能,到目前为止它运行良好。

我需要的是正确的查询 URL。

我想从 clinicaltrials.gov 获取搜索词“EGFR”的临床研究列表,但缩小搜索范围,以便仅返回“OverallStatus”字段中具有“Recruiting”或“Active,not recruiting”的结果。以下是“OverallStatus”字段的可能值。

我很难弄清楚 API 文档。有一个包含“搜索表达式”和“语法”的页面,但它们没有说明如何搜索多个值。如何构建查询字符串以在字段中搜索多个可能的值?

我很欣赏任何见解!

library(tidyverse)
library(httr)
library(jsonlite)
library(glue)


get_studies_df <- function(query_url){
  
  # get clinical studies data
  res <- httr::GET(query_url)
  
  if(!httr::status_code(res) == 200){
    #if request failed return empty data frame
    empty_df <- stats::setNames(data.frame(matrix(ncol = 5, nrow = 0)), c("Rank", "NCTId", "Condition", "BriefTitle", "OverallStatus"))
    return(empty_df)
  }
  
  # get data from response obj
  data <- httr::content(res, as="text", encoding = "UTF-8") %>%
    jsonlite::fromJSON()
  
  # prepare clinical studies data frame
  studies_df <- data$StudyFieldsResponse$StudyFields %>%
    # combine conditions if there is more than one
    dplyr::rowwise() %>%
    mutate(Condition = paste(Condition, collapse = ", ")) %>%
    dplyr::ungroup()
  # unlist data frame columns to show full length text
  for (i in c(1:ncol(studies_df))){
    studies_df[,i] <- unlist(studies_df[,i])
  }
  
  return(studies_df)
  
}
 

### here are all the query strings I tried ###

# get all studies for EGFR (WORKING, but finds 5000+ studies, way too many)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# get "Recruiting" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# get "Active" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

### I'm trying to get "Recruiting" OR "Active" studies. These are NOT WORKING ###

# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns nothing
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+RANGE[Recruiting,Active]&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns everything ("Recruiting", "Completed", "Unknown status", "Active, not recruiting") ??
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+OR+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"


df <- get_studies_df(query_url)

输出表:

enter image description here

R URL 查询字符串 httr

评论


答:

0赞 PattiCat 11/17/2023 #1

我在尝试了解如何使用经典 API 进行查询时遇到了同样的问题,而且我不喜欢他们的文档。也就是说,我发现他们的演示很有用,可以通过对搜索表达式进行一些试验和错误来构建正确的 url。

https://classic.clinicaltrials.gov/api/gui/demo/simple_study_fields

为了同时包含 RECRUITING 和 ACTIVE,我以这种方式填写了框(您必须将它们作为列表传递):


expr= EGFR 和 AREA[OverallStatus][RECRUITING,ACTIVE]


字段= NCTId,BriefTitle,Condition,OverallStatus


max_rank = 50,json格式,我得到了这个:

https://classic.clinicaltrials.gov/api/query/study_fields?expr=EGFR+AND+AREA%5BOverallStatus%5D%5BRECRUITING%2CACTIVE%5D&fields=NCTId%2CBriefTitle%2CCondition%2COverallStatus&min_rnk=1&max_rnk=50&fmt=json

注意:我提取的 50 项研究中没有任何一项是“招募”,不确定我的搜索时间是否太短,而且这些研究现在都没有真正招募还是什么。我想另一种方法是获取具有最大排名的所有内容,然后使用代码按 json 的 OverallStatus 值进行过滤。