Friday, 15 July 2011

R, select rows according to the rank of a certain column -



R, select rows according to the rank of a certain column -

i have r dataframe below,

name score marry 98 marry 77 marry 87 marry 96 mark 99 mark 44 mark 79 john 87 john 77

for each of name, want select rows highest 2 score, should be,

name score marry 98 marry 96 mark 99 mark 79 john 87 john 77

could help? many thanks!

here's possible base of operations approach:

mydf[with(mydf, ave(-score, name, fun = order)) %in% c(1, 2), ] # name score # 1 marry 98 # 4 marry 96 # 5 mark 99 # 7 mark 79 # 8 john 87 # 9 john 77

for curious, on timings--here's little test...

two sample datasets, both 1m rows, 2 columns, 1 1000 possible values "name" , other 10000 possible values.

set.seed(1) df1 <- data.frame( name = sample(1000, 1000000, true), score = sample(0:100, 1000000, true) ) df2 <- data.frame( name = sample(10000, 1000000, true), score = sample(0:100, 1000000, true) )

the functions benchmark--i'll seek add together "dplyr" later after reinstall it.

fun1 <- function(mydf) { mydf[with(mydf, ave(-score, name, fun = order)) %in% c(1, 2), ] } fun2 <- function(mydf) { as.data.table(mydf)[order(-score), .sd[1:2], by=name] } fun3 <- function(mydf) { df <- as.data.table(mydf) setorder(df, -score)[, head(.sd, 2), = name] }

the benchmarking.

library(microbenchmark) microbenchmark(fun1(df1), fun2(df1), fun3(df1), fun1(df2), fun2(df2), fun3(df2), times = 20) # unit: milliseconds # expr min lq mean median uq max neval # fun1(df1) 502.76809 513.98317 569.47883 597.90488 603.34458 686.4302 20 # fun2(df1) 733.12544 741.18777 796.67106 822.60824 828.88449 839.3837 20 # fun3(df1) 87.80581 93.07012 95.34281 95.56374 97.49608 101.7991 20 # fun1(df2) 672.60241 764.10237 764.60365 772.33959 780.14679 799.3505 20 # fun2(df2) 6338.14881 6360.42621 6407.66675 6412.99278 6451.75626 6479.2681 20 # fun3(df2) 354.24119 366.47396 382.58666 369.78597 374.01897 468.9197 20

r

No comments:

Post a Comment