Sunday, 15 February 2015

r - Use of ddply + mutate with a custom function? -



r - Use of ddply + mutate with a custom function? -

i utilize ddply quite frequently, historically summarize (occasionally mutate) , basic functions mean(), var1 - var2, etc. have dataset in i'm trying apply custom, more involved function , started trying dig how ddply. i've got successful solution, don't understand why works vs. more "normal" functions.

related

custom function not recognized ddply {plyr}... how pass variables custom function in ddply? r-help: [r] right utilize of ddply own function (i ended basing solution on this)

here's illustration info set:

library(plyr) df <- data.frame(id = rep(letters[1:3], each = 3), value = 1:9)

normally, i'd utilize ddply so:

df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))

my visualization of ddply splits df "mini" info frames based on grouped combos of id, , add together new column calling mean() on column name exists in df. so, effort implement function extended idea:

# actually, logical extension of above use: # ddply(..., mean = function(value) { mean(value) }) df_ply_2 <- ddply(df, .(id), mutate, mean = function(df) { mean(df$value) }) error: effort replicate object of type 'closure'

all help on custom functions don't apply mutate, seems inconsistent, or @ to the lowest degree annoying me, analog implemented solution is:

df_mean <- function(df) { temp <- data.frame(mean = rep(mean(df$value), nrow(df))) temp } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean

in-line, looks have this:

df_ply_4 <- df df_ply_4$mean <- ddply(df, .(id), function(x) { temp <- data.frame(mean = rep(mean(x$value), length(x$value))) temp})$mean

why can't utilize mutate custom function? "built-in" functions homecoming sort of class ddply can deal vs. having kick out total data.frame , phone call out column care about?

thanks helping me "get it"!

update after @gregor's answer

awesome answer, , think it. was, indeed, confused mutate , summarize meant... thinking arguments ddply regarding how handle result vs. being functions themselves. so, big insight.

also, helped understand without mutate/summarize, need homecoming data.frame, reason have cbind column name of column in df gets returned.

lastly if do utilize mutate, it's helpful realize can homecoming vector result , right result. thus, can this, i've understood after reading answer:

# caught code above doesn't right thing # , recycles single value returned mean() vs. repeating # expected. know it's taking vector, know need homecoming # vector same length mini df custom_mean <- function(x) { rep(mean(x), length(x)) } df_ply_5 <- ddply(df, .(id), mutate, mean = custom_mean(value))

thanks 1 time again in-depth answer!

update per @gregor's lastly comment

hmmm. used rep(mean(x), length(x)) due observation df_ply_3's result (i admit not looking @ closely when ran first time making post, saw didn't give me error!):

df_mean <- function(x) { data.frame(mean = mean(x$value)) } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean df_ply_3 id value mean 1 1 2 2 2 5 3 3 8 4 b 4 2 5 b 5 5 6 b 6 8 7 c 7 2 8 c 8 5 9 c 9 8

so, i'm thinking code accident based on fact had 3 id variables repeated 3 times. actual homecoming equivalent of summarize (one row per id value), , recycled. testing theory appears accurate if update info frame so:

df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"), value = 1:10)

i error when trying utilize df_ply_3 method df_mean():

error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) : replacement has 4 rows, info has 10

so, mini df passed df_mean returns df mean result of taking mean if value vector (returns 1 value). so, output data.frame of 3 values, 1 per id group. i'm thinking mutate way sort of "remembers" passed mini info frame, , repeats single output match it's length?

in case, commenting on df_ply_5; indeed, if remove rep() bit , homecoming mean(x), works great!

you're right. ddply indeed breaks info downwards mini info frames based on grouper, , applies function each piece.

with ddply, work done info frames, .fun argument must take (mini) info frame input , homecoming info frame output.

mutate , summarize functions fit bill (they take , homecoming info frames). can view individual help pages, or run them on info frame outside of ddply see this, e.g.

mutate(mtcars, mean.mpg = mean(mpg)) summarize(mtcars, mean.mpg = mean(mpg))

if don't utilize mutate or summarize, is, only utilize custom function, function needs take (mini) info frame argument, , homecoming info frame.

if do utilize mutate or summarize, other functions pass ddply aren't used ddply, they're passed on used mutate or summarize. , functions used mutate , summarize deed on columns of data, not on entire data.frame. why

ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))

notice don't pass mutate function. don't ddply(mtcars, "cyl", mutate, mean). have tell take mean of. in ?mutate, description of ... "named parameters giving definitions of new columns", not functions. (is mean() different "custom function"? no.)

thus doesn't work anonymous functions--or functions @ all. pass expression! can define custom function beforehand.

custom_function <- function(x) {mean(x + runif(length(x))} ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg)) ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))

this extends well, can have functions take multiple arguments, , can give them different columns arguments, if you're using mutate or summarize, have give other functions arguments; you're not passing functions.

you seem want pass ddply function "knows" column take mean of. that, think you'd need not utilize mutate or summarize, can hack own version. summarize-like behavior, homecoming data.frame single value, mutate-like behavior, homecoming original data.frame value cbinded on

mean.mpg.mutate = function(df) { cbind.data.frame(df, mean.mpg = mean(df$mpg)) } mean.mpg.summarize = function(df) { data.frame(mean.mpg = mean(df$mpg)) } ddply(mtcars, "cyl", mean.mpg.mutate) ddply(mtcars, "cyl", mean.mpg.summarize) tl;dr

why can't utilize mutate custom function? "built-in" functions homecoming sort of class ddply can deal vs. having kick out total data.frame , phone call out column care about?

quite opposite! mutate , summarize take info frames inputs , kick out info frames returns. mutate , summarize are functions you're passing ddply, not mean or whatever else.

mutate , summarize convenience functions you'll utilize 99% of time utilize ddply.

if don't utilize mutate/summarize, function needs take , homecoming info frame.

if utilize mutate/summarize, don't pass them functions, pass them expressions can evaluated (mini) info frame. if it's mutate, homecoming should vector appended info (recycled necessary). if it's summarize, homecoming should single value. don't pass function, mean; pass expression, mean(mpg).

what dplyr?

this written before dplyr thing, or @ to the lowest degree big thing. dplyr removes lot of confusion process because replaces nesting of ddply mutate or summarize arguments sequential functions group_by followed mutate or summarize. dplyr version of reply be

library(dplyr) group_by(mtcars, cyl) %>% mutate(mean.mpg = mean(mpg))

with new column creation passed straight mutate (or summarize), there isn't confusion function what.

r plyr

No comments:

Post a Comment