提问人:mandmeier 提问时间:11/11/2023 最后编辑:Brian Tompsett - 汤莱恩mandmeier 更新时间:11/17/2023 访问量:53
构建正确的 API 查询 URL,以按多个关键字从 clinicaltrials.gov 中筛选数据
Building correct API query URL to filter data from clinicaltrials.gov by multiple keywords
问:
我正在尝试从公共 API 获取一些数据,并且需要一些帮助来找出请求 URL 的正确查询语法。
下面是我的剧本。(没关系修复或改进功能,到目前为止它运行良好。
我需要的是正确的查询 URL。
我想从 clinicaltrials.gov 获取搜索词“EGFR”的临床研究列表,但缩小搜索范围,以便仅返回“OverallStatus”字段中具有“Recruiting”或“Active,not recruiting”的结果。以下是“OverallStatus”字段的可能值。
我很难弄清楚 API 文档。有一个包含“搜索表达式”和“语法”的页面,但它们没有说明如何搜索多个值。如何构建查询字符串以在字段中搜索多个可能的值?
我很欣赏任何见解!
library(tidyverse)
library(httr)
library(jsonlite)
library(glue)
get_studies_df <- function(query_url){
# get clinical studies data
res <- httr::GET(query_url)
if(!httr::status_code(res) == 200){
#if request failed return empty data frame
empty_df <- stats::setNames(data.frame(matrix(ncol = 5, nrow = 0)), c("Rank", "NCTId", "Condition", "BriefTitle", "OverallStatus"))
return(empty_df)
}
# get data from response obj
data <- httr::content(res, as="text", encoding = "UTF-8") %>%
jsonlite::fromJSON()
# prepare clinical studies data frame
studies_df <- data$StudyFieldsResponse$StudyFields %>%
# combine conditions if there is more than one
dplyr::rowwise() %>%
mutate(Condition = paste(Condition, collapse = ", ")) %>%
dplyr::ungroup()
# unlist data frame columns to show full length text
for (i in c(1:ncol(studies_df))){
studies_df[,i] <- unlist(studies_df[,i])
}
return(studies_df)
}
### here are all the query strings I tried ###
# get all studies for EGFR (WORKING, but finds 5000+ studies, way too many)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# get "Recruiting" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# get "Active" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
### I'm trying to get "Recruiting" OR "Active" studies. These are NOT WORKING ###
# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns nothing
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+RANGE[Recruiting,Active]&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns everything ("Recruiting", "Completed", "Unknown status", "Active, not recruiting") ??
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+OR+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
df <- get_studies_df(query_url)
输出表:
答:
我在尝试了解如何使用经典 API 进行查询时遇到了同样的问题,而且我不喜欢他们的文档。也就是说,我发现他们的演示很有用,可以通过对搜索表达式进行一些试验和错误来构建正确的 url。
https://classic.clinicaltrials.gov/api/gui/demo/simple_study_fields
为了同时包含 RECRUITING 和 ACTIVE,我以这种方式填写了框(您必须将它们作为列表传递):
expr= EGFR 和 AREA[OverallStatus][RECRUITING,ACTIVE]
字段= NCTId,BriefTitle,Condition,OverallStatus
max_rank = 50,json格式,我得到了这个:
注意:我提取的 50 项研究中没有任何一项是“招募”,不确定我的搜索时间是否太短,而且这些研究现在都没有真正招募还是什么。我想另一种方法是获取具有最大排名的所有内容,然后使用代码按 json 的 OverallStatus 值进行过滤。
评论