r - Counting Frequencies Using (logical?) Expressions -
i have been teaching myself r scratch please bear me. have found multiple ways count observations, however, trying figure out how count frequencies using (logical?) expressions. have massive set of data approx 1 million observations. df set so:
latitude longitude id year month day value 66.16667 -10.16667 cpuele25399 1979 1 7 0 66.16667 -10.16667 cpuele25399 1979 1 8 0 66.16667 -10.16667 cpuele25399 1979 1 9 0
there 154 unique id's , 154 unique lat/long. focusing in on top 1% of values each unique id. each unique id have calculated 99th percentile using associated values. went further , calculated each id's 99th percentile individual years , months i.e.. cpuele25399 1979 month=1 99th percentile value 3 (3 being floor of top 1%)
using these threshold values: each id, each year, each month- need count amount of times (per month per year) value >= ids 99th percentile
i have tried @ least 100 different approaches think fundamentally misunderstanding maybe in syntax? snippet of code has gotten me farthest:
ddply(total, c('latitude','longitude','id','year','month'), function(x) c(threshold=quantile(x$value,probs=.99,na.rm=true), frequency=nrow(x$value>=quantile(x$value,probs=.99,na.rm=true))))
r throws warning message saying >= not useful factors? if 1 out there understands convoluted message supremely grateful help.
using these threshold values: each id, each year, each month- need count amount of times (per month per year) value >= ids 99th percentile
does mean want to
- calculate 99th percentile each id (i.e. disregarding month year etc), , then
- work out number of times exceed value, split month , year id?
(note: example code groups lat/lon not mentioned in question, ignoring it. if wish add in, add grouping variable in appropriate places).
in case, can use ddply
calculate per-id percentile first:
# calculate percentile each id total <- ddply(total, .(id), transform, threshold=quantile(value, probs=.99, na.rm=t))
and can group (id, month , year) see how many times exceed:
total <- ddply(total, .(id, month, year), summarize, freq=sum(value >= threshold))
note summarize
return dataframe many rows there columns of .(id, month, year)
, i.e. drop latitude/longitude columns. if want keep use transform
instead of summarize
, , freq
repeated different (lat, lon) each (id, mon, year) combo.
notes on ddply:
- can
.(id, month, year)
ratherc('id', 'month', 'year')
have done - if want add columns, using
summarize
ormutate
ortransform
lets slickly without needingtotal$
in front of column names.
Comments
Post a Comment