r - Use of ddply + mutate with a custom function? -
i utilize ddply
quite frequently, historically summarize
(occasionally mutate
) , basic functions mean()
, var1 - var2
, etc. have dataset in i'm trying apply custom, more involved function , started trying dig how ddply
. i've got successful solution, don't understand why works vs. more "normal" functions.
related
custom function not recognized ddply {plyr}... how pass variables custom function in ddply? r-help: [r] right utilize of ddply own function (i ended basing solution on this)here's illustration info set:
library(plyr) df <- data.frame(id = rep(letters[1:3], each = 3), value = 1:9)
normally, i'd utilize ddply
so:
df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))
my visualization of ddply
splits df
"mini" info frames based on grouped combos of id
, , add together new column calling mean()
on column name exists in df
. so, effort implement function extended idea:
# actually, logical extension of above use: # ddply(..., mean = function(value) { mean(value) }) df_ply_2 <- ddply(df, .(id), mutate, mean = function(df) { mean(df$value) }) error: effort replicate object of type 'closure'
all help on custom functions don't apply mutate
, seems inconsistent, or @ to the lowest degree annoying me, analog implemented solution is:
df_mean <- function(df) { temp <- data.frame(mean = rep(mean(df$value), nrow(df))) temp } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
in-line, looks have this:
df_ply_4 <- df df_ply_4$mean <- ddply(df, .(id), function(x) { temp <- data.frame(mean = rep(mean(x$value), length(x$value))) temp})$mean
why can't utilize mutate
custom function? "built-in" functions homecoming sort of class ddply
can deal vs. having kick out total data.frame
, phone call out column care about?
thanks helping me "get it"!
update after @gregor's answer
awesome answer, , think it. was, indeed, confused mutate
, summarize
meant... thinking arguments ddply
regarding how handle result vs. being functions themselves. so, big insight.
also, helped understand without mutate/summarize
, need homecoming data.frame
, reason have cbind
column name of column in df
gets returned.
lastly if do utilize mutate
, it's helpful realize can homecoming vector result , right result. thus, can this, i've understood after reading answer:
# caught code above doesn't right thing # , recycles single value returned mean() vs. repeating # expected. know it's taking vector, know need homecoming # vector same length mini df custom_mean <- function(x) { rep(mean(x), length(x)) } df_ply_5 <- ddply(df, .(id), mutate, mean = custom_mean(value))
thanks 1 time again in-depth answer!
update per @gregor's lastly comment
hmmm. used rep(mean(x), length(x))
due observation df_ply_3
's result (i admit not looking @ closely when ran first time making post, saw didn't give me error!):
df_mean <- function(x) { data.frame(mean = mean(x$value)) } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean df_ply_3 id value mean 1 1 2 2 2 5 3 3 8 4 b 4 2 5 b 5 5 6 b 6 8 7 c 7 2 8 c 8 5 9 c 9 8
so, i'm thinking code accident based on fact had 3 id
variables repeated 3 times. actual homecoming equivalent of summarize
(one row per id
value), , recycled. testing theory appears accurate if update info frame so:
df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"), value = 1:10)
i error when trying utilize df_ply_3
method df_mean()
:
error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) : replacement has 4 rows, info has 10
so, mini df passed df_mean
returns df
mean
result of taking mean if value
vector (returns 1 value). so, output data.frame
of 3 values, 1 per id
group. i'm thinking mutate
way sort of "remembers" passed mini info frame, , repeats single output match it's length?
in case, commenting on df_ply_5
; indeed, if remove rep()
bit , homecoming mean(x)
, works great!
you're right. ddply
indeed breaks info downwards mini info frames based on grouper, , applies function each piece.
with ddply
, work done info frames, .fun
argument must take (mini) info frame input , homecoming info frame output.
mutate
, summarize
functions fit bill (they take , homecoming info frames). can view individual help pages, or run them on info frame outside of ddply
see this, e.g.
mutate(mtcars, mean.mpg = mean(mpg)) summarize(mtcars, mean.mpg = mean(mpg))
if don't utilize mutate
or summarize
, is, only utilize custom function, function needs take (mini) info frame argument, , homecoming info frame.
if do utilize mutate
or summarize
, other functions pass ddply
aren't used ddply
, they're passed on used mutate
or summarize
. , functions used mutate
, summarize
deed on columns of data, not on entire data.frame. why
ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))
notice don't pass mutate
function. don't ddply(mtcars, "cyl", mutate, mean)
. have tell take mean of. in ?mutate
, description of ...
"named parameters giving definitions of new columns", not functions. (is mean()
different "custom function"? no.)
thus doesn't work anonymous functions--or functions @ all. pass expression! can define custom function beforehand.
custom_function <- function(x) {mean(x + runif(length(x))} ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg)) ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))
this extends well, can have functions take multiple arguments, , can give them different columns arguments, if you're using mutate
or summarize
, have give other functions arguments; you're not passing functions.
you seem want pass ddply
function "knows" column take mean of. that, think you'd need not utilize mutate
or summarize
, can hack own version. summarize
-like behavior, homecoming data.frame single value, mutate
-like behavior, homecoming original data.frame value cbind
ed on
mean.mpg.mutate = function(df) { cbind.data.frame(df, mean.mpg = mean(df$mpg)) } mean.mpg.summarize = function(df) { data.frame(mean.mpg = mean(df$mpg)) } ddply(mtcars, "cyl", mean.mpg.mutate) ddply(mtcars, "cyl", mean.mpg.summarize)
tl;dr why can't utilize mutate custom function? "built-in" functions homecoming sort of class ddply can deal vs. having kick out total data.frame , phone call out column care about?
quite opposite! mutate
, summarize
take info frames inputs , kick out info frames returns. mutate , summarize are functions you're passing ddply, not mean or whatever else.
mutate , summarize convenience functions you'll utilize 99% of time utilize ddply
.
if don't utilize mutate/summarize, function needs take , homecoming info frame.
if utilize mutate/summarize, don't pass them functions, pass them expressions can evaluated (mini) info frame. if it's mutate, homecoming should vector appended info (recycled necessary). if it's summarize, homecoming should single value. don't pass function, mean
; pass expression, mean(mpg)
.
dplyr
? this written before dplyr
thing, or @ to the lowest degree big thing. dplyr
removes lot of confusion process because replaces nesting of ddply
mutate
or summarize
arguments sequential functions group_by
followed mutate
or summarize
. dplyr
version of reply be
library(dplyr) group_by(mtcars, cyl) %>% mutate(mean.mpg = mean(mpg))
with new column creation passed straight mutate
(or summarize
), there isn't confusion function what.
r plyr
No comments:
Post a Comment