Hallo Florian,
hier ist die etwas sicherere Variante für die Aggregation:
Code: Alles auswählen
# build sum of died people (independet of sex) per cause
Cols <- c("Year", "Cause", as.character(c(0:5, seq(10, 95, 5))))
result <- aggregate(. ~ Year + Cause, data=data_cause[Cols], FUN = sum)
übrigens hat aggregate() auch einen Parameter subset= ...
Code: Alles auswählen
# get data with vector of causes (many deaths) from all years
# data_cause = data_country[data_country$Cause %in% data_manyDeathsCause, ]
# build sum of died people (independet of sex) per cause
Cols <- c("Year", "Cause", as.character(c(0:5, seq(10, 95, 5))))
result <- aggregate(. ~ Year + Cause, FUN = sum, data=data_country[Cols], subset=Cause %in% data_manyDeathsCause)
Momentan frage ich mich, ob der Teil
Code: Alles auswählen
# Set all data are not available to 0
data_manyDeaths[is.na(data_manyDeaths)] = 0
zwingend notwendig ist. Vielleicht kann man ihn weglassen und statt dessen bei
aggregate(..., na.action=na.pass, na.rm=TRUE)
Das
na.rm=TRUE soll an die Funktion
sum() weitergereicht werden.
Wenn man das so mach, dann gibt es in
result vier Zeilen mehr, in denen einige Werte NA sind:
Code: Alles auswählen
#Lade CSV Daten der WHO
data = read.csv2("https://www.tastic.cc/Mortality_2015_1999_Germany_SouthAfrica.csv", header = TRUE, sep = ",", check.names = FALSE)
data <- data0
# Hole Daten f?r ein Land
data_country = subset(data, Country == "South Africa")
# remove the total deaths
data_countryYear = subset(data_country, Cause != "TOT")
# Get data where many peaple died at group 30
data_manyDeaths = subset(data_countryYear, `30` > 2000)
# Get Data for age 30
#data_manyDeaths = data_manyDeaths[order(data_manyDeaths$`30`), ]
# extract causes for death
# data_manyDeathsCause = data_manyDeaths$Cause
# remove duplicated causes
# data_manyDeathsCause = data_manyDeathsCause[!duplicated(data_manyDeathsCause)]
# drop levels which are 0
data_manyDeathsCause <- levels(droplevels(data_manyDeaths$Cause))
# get data with vector of causes (many deaths) from all years
# data_cause = data_country[data_country$Cause %in% data_manyDeathsCause, ]
# build sum of died people (independet of sex) per cause
Cols <- c("Year", "Cause", as.character(c(0:5, seq(10, 95, 5))))
result <- aggregate(. ~ Year + Cause, FUN = sum, data=data_country[Cols],
subset=Cause %in% data_manyDeathsCause, na.action=na.pass, na.rm=TRUE)
... und weil ich
data.table so mag, gibt es hier eine passende Variante:
Code: Alles auswählen
library("data.table")
Data <- fread("https://www.tastic.cc/Mortality_2015_1999_Germany_SouthAfrica.csv")
Data_country <- Data[Country == "South Africa"]
Data_manyDeathsCause <- Data_country[Cause != "TOT" & `30` > 2000, unique(Cause)]
Result <- Data_country[Cause %in% Data_manyDeathsCause, lapply(.SD, sum, na.rm=TRUE),
by=.(Year, Cause), .SDcols=as.character(c(0:5, seq(10, 95, 5)))]
Statt
%in% hätte man einen
join machen können.
... oder man verpasst dem Datatable
Data_country den
key Cause
Code: Alles auswählen
# mit einem join:
Result <- Data_country[data.table(Cause=Data_manyDeathsCause), on="Cause", lapply(.SD, sum, na.rm=TRUE),
by=.(Year, Cause), .SDcols=as.character(c(0:5, seq(10, 95, 5)))]
# mit key="Cause" für Data_country
setkey(Data_country, "Cause")
Result <- Data_country[Data_manyDeathsCause, lapply(.SD, sum, na.rm=TRUE),
by=.(Year, Cause), .SDcols=as.character(c(0:5, seq(10, 95, 5)))]
Gruß, Jörg