r - Fill missing sequence values with dplyr -
i have data frame missing values "snap_id". i'd fill in missing values floating point values based on sequence previous non-missing value (lag()?). achieve using dplyr if possible.
assumptions:
- there never missing data first or last row i'm generating missing dates based on missing days between min , max of data set
- there can multiple gaps in data set
current data:
end snap_id 1 2015-06-26 12:59:00 365 2 2015-06-26 13:59:00 366 3 2015-06-27 00:01:00 na 4 2015-06-27 23:00:00 na 5 2015-06-28 00:01:00 na 6 2015-06-28 23:00:00 na 7 2015-06-29 09:00:00 367 8 2015-06-29 09:59:00 368
what want achieve:
end snap_id 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 366.2 5 2015-06-28 00:01:00 366.3 6 2015-06-28 23:00:00 366.4 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0
as data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260, 1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "utc", class = c("posixct", "posixt")), snap_id = c(365, 366, na, na, na, na, 367, 368)), .names = c("end", "snap_id"), row.names = c(na, -8l), class = "data.frame")
this attempt @ achieving goal, works first missing value:
df %>% arrange(end) %>% mutate(snap_id=ifelse(is.na(snap_id),lag(snap_id)+0.1,snap_id)) end snap_id 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 na 5 2015-06-28 00:01:00 na 6 2015-06-28 23:00:00 na 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0
the outstanding answer @mathematical.coffee below:
df %>% arrange(end) %>% group_by(tmp=cumsum(!is.na(snap_id))) %>% mutate(snap_id=snap_id[1] + 0.1*(0:(length(snap_id)-1))) %>% ungroup() %>% select(-tmp)
edit: new version works number of na runs. 1 doesn't need zoo
, either.
first, notice tmp=cumsum(!is.na(snap_id))
groups snap_id
s such groups of same tmp
consist of 1 non-na value followed run of na values.
then group variable , add .1 first snap_id fill out nas:
df %>% arrange(end) %>% group_by(tmp=cumsum(!is.na(snap_id))) %>% mutate(snap_id=snap_id[1] + 0.1*(0:(length(snap_id)-1))) end snap_id tmp 1 2015-06-26 12:59:00 365.0 1 2 2015-06-26 13:59:00 366.0 2 3 2015-06-27 00:01:00 366.1 2 4 2015-06-27 23:00:00 366.2 2 5 2015-06-28 00:01:00 366.3 2 6 2015-06-28 23:00:00 366.4 2 7 2015-06-29 09:00:00 367.0 3 8 2015-06-29 09:59:00 368.0 4
then can drop tmp
column afterwards (add %>% select(-tmp)
end).
edit: old version doesn't work subsequent runs of na
s.
if aim fill each na previous value + 0.1, can use zoo
's na.locf
(which fills each na
previous value), along cumsum(is.na(snap_id))*0.1
add 0.1.
library(zoo) df %>% arrange(end) %>% mutate(snap_id=ifelse(is.na(snap_id), na.locf(snap_id) + cumsum(is.na(snap_id))*0.1, snap_id))
Comments
Post a Comment