Tuesday, 15 September 2015

r - K means clustering of variable with multiple values -



r - K means clustering of variable with multiple values -

i have sample info below big info set, each participant given multiple status scoring.

participant<-c("p1","p1","p2","p2","p3","p3") condition<-c( "c1","c2","c1","c2","c1","c2") score<-c(4,5, 5,7,8,2) t<-data.frame(participant, condition, score)

i trying utilize k-mean clustering split participants in different groups, there way it, considering status not numeric?

thanks!

@anony has right idea. have numeric info - there (evidently) c1-score , c2-score each participant. need convert info "long" format (data in single column (score) sec column (condition) differentiating scores, "wide" format (scores under different conditions in separate columns). can run kmeans clustering on scores grouping participants.

here how in r, using larger illustration demonstrate clusters.

# illustration 100 participants in 3 clusters set.seed(1) # reproducibble illustration t <- data.frame(participant=rep(paste0("p",sprintf("%03i",1:100)),each=2), status =paste0("c",1:2), score =c(rpois(70,c(10,25)),rpois(70,c(25,10)),rpois(60,c(15,10)))) head(t) # participant status score # 1 p001 c1 8 # 2 p001 c2 25 # 3 p002 c1 7 # 4 p002 c2 27 # 5 p003 c1 14 # 6 p003 c2 28 library(reshape2) # dcast(...) # convert long wide format result <- dcast(t,participant~condition,value.var="score") # k-means on columns containing scores - 3 clusters result$clust <- kmeans(result[,2:ncol(result)],centers=3)$clust result[sample(1:100,6),] # random sample of 6 rows # participant c1 c2 clust # 12 p012 13 21 1 # 24 p024 7 32 1 # 85 p085 10 6 2 # 43 p043 27 5 3 # 48 p048 29 11 3 # 66 p066 24 17 3

now can plot scores, showing how participant clusters.

# plot scores each participant, color coded cluster. plot(c2~c1,result,col=result$clust, pch=20)

edit: response op's comment.

op wants know if there more 1 score participant/condition. reply depends on why there multiple scores. if replicates random , have central tendency, taking mean justified, although in theory participants more replicates should more heavily weighted.

one other hand, suppose these test scores. (but not always), scores go multiple sittings. these scores not random - there trend. in case might more meaningful take recent score.

as 3rd example, if scores used create decision based on policy (such sat, colleges utilize highest score), appropriate aggregating function might max, not mean.

finally, might case number of replicates in fact of import distinguishing characteristic. in case include not scores number of replicates each participant/condition when clustering. relevant in kinds of standardized testing under nclb, students take test on , on 1 time again until pass.

btw: type of question (the 1 in comment) definitely belongs on http://stats.stackexchange.com/.

r cluster-analysis k-means mean

No comments:

Post a Comment