提问人:Maria Mola 提问时间:11/15/2023 最后编辑:Maria Mola 更新时间:11/17/2023 访问量:87
连续 2 年在 R 中根据某些值创建新变量
Create new variable in R based on certain values 2 years in a row
问:
我正在尝试使用 UCDP 与战斗相关的死亡数据集,称为 BattleDeaths_v22_1_conf from https://ucdp.uu.se/downloads/(参见 UCDP 与战斗相关的死亡数据集版本 23.1)
我想创建一个新的变量或数据集,该变量或数据集仅包含连续 2 年有 1000 人死亡的国家/地区,并且仅在 2008 年之后。 但是,我最终得到了一个没有观测值的变量。
我使用了数据集的“国家”变量 (location_id) 和战斗死亡变量 (bd_best)。
到目前为止,我已经在 R 中完成了此操作:
library(dplyr)
filtered_data <- subset(BattleDeaths_v22_1_conf), bd_best >= 1000 & year >= 2008)
filtered_data <- filtered_data %>%
arrange(location_inc, year) %>%
group_by(location_inc) %>%
mutate(sum_deaths_two_years = lag(bd_best) + bd_best)
目前为止,一切都好。
final_data <- filtered_data %>%
group_by(location_inc) %>%
filter(all(sum_deaths_two_years >= 2000))
现在,我最终得到一个具有 0 个观测值的变量。但是,我可以在原始数据集中看到,有一些观测值符合我的标准。
理想情况下,我还希望有一个单独的变量,如果国家一年有 1000 例与战斗有关的死亡,并且前一年或下一年至少有 25 例与战斗有关的死亡,则包括这些国家。 但是我不知道如何使用 R,任何帮助将不胜感激
答:
0赞
Adriano Mello
11/15/2023
#1
试试这个:
library(dplyr)
# Data ------------------------------
example_df <- tibble::tribble(
~location_inc, ~year, ~bd_best,
"Iraq", 2009L, 1036L,
"Iraq", 2010L, 989L,
"Iraq", 2011L, 864L,
"Iraq", 2012L, 565L,
"Iraq", 2013L, 1870L, # Desired
"Iraq", 2014L, 13761L, # Desired
"Iraq", 2015L, 10981L, # Desired
"Iraq", 2016L, 9775L, # Desired
"Iraq", 2017L, 10025L, # Desired
"Iraq", 2018L, 866L,
"Iraq", 2019L, 498L,
"Iraq", 2020L, 671L,
"Iraq", 2021L, 707L,
"Iraq", 2022L, 335L,
"Sudan", 2009L, 353L,
"Sudan", 2010L, 1010L, # Desired
"Sudan", 2011L, 1404L, # Desired
"Sudan", 2012L, 1173L, # Desired
"Sudan", 2013L, 594L,
"Sudan", 2014L, 856L,
"Sudan", 2015L, 1264L, # Desired
"Sudan", 2016L, 1309L, # Desired
"Sudan", 2017L, 160L,
"Sudan", 2018L, 243L,
"Sudan", 2020L, 45L,
"Sudan", 2021L, 31L,
"Sudan", 2022L, 47L)
# Code ------------------------------
example_df <- filter(
example_df,
.by = location_inc,
bd_best >= 1000,
lag(bd_best, default = -1) >= 1000 | lead(bd_best, default = -1) >= 1000)
# Outcome ---------------------------
example_df
# A tibble: 10 × 3
location_inc year bd_best
<chr> <int> <int>
1 Iraq 2013 1870
2 Iraq 2014 13761
3 Iraq 2015 10981
4 Iraq 2016 9775
5 Iraq 2017 10025
6 Sudan 2010 1010
7 Sudan 2011 1404
8 Sudan 2012 1173
9 Sudan 2015 1264
10 Sudan 2016 1309
来源: https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-231-xlsx.zip
1赞
LE__Visionnaire
11/16/2023
#2
使用包“data.table”和“magrittr”的另一个解决方案将为您提供所需的输出
library(data.table)
library(magrittr)
DT # your excel file
# Put the excel tab as a data.table, select filter columns and rename them
DT %>% setDT() %>% .[,(Country = location_inc , year , Deaths = bd_best] %>%
.[Deaths>=1000 & year>= 2008] %>% # we filter using desired conditions
# we add a columns containing for each row the years where the country is present in the table
.[, YD := year %>% unique %>% toString , by = Country] %>%
# we now look if there are consecutive years in this list of years for each country
.[as.character(Year+1) %in% YD] %>%
# we select the countries respecting this conditions
.[, .(Country = unique(Country))] %>%
# we display the output
.[] }
如果要保留 3 列(国家/地区、死亡人数和年份),请不要使用仅选择国家/地区列的行。
评论
BattleDeaths_v22_1_conf
dput