Thursday, 15 August 2013

data manipulation - Merging Similar Observations in R -



data manipulation - Merging Similar Observations in R -

i'm trying utilize r info management.

i have info frame multiple variables (+200 columns) , many observation (+10,000 rows). there lot of missing data, , duplicated or uncompleted observations. 1 observation should equal 1 person (1 row = 1 unique person)

here dataset illustration (thank's @aosmith) :

dat = data.frame(email = c(rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2), na), name = c(na, "alfred c.", na, "bob v.", "cathy l."), var1 = c(2, 2, na, na, 1), var2 = c(1, na, 3, na, 1), var3 = c(na, na, 1, 0, 2), var4 = c(0, na, na, na, na))

i want merge observations in end, 1 row equals 1 person. identify person utilize email. when there no email, want maintain observations (so if email missing, dont want r delete observation. every observation no email considered unique observation).

for times can spot same email address, need r update fields of each variable when there missing info data on subsequent observations (with same email address) found . if there existing info 1 or more variable, want r create, each time new variable store different values.

here illustration create easier understand.

we need transform :

email name var1 var2 var3 var4 ... var200 user1@hotmail.com <na> 2 1 na 0 ... . user1@hotmail.com alfred c. 2 na na na ... . user2@gmail.com <na> na 3 1 na ... . user2@gmail.com bob v. na na 0 na ... . <na> cathy l. 1 1 2 na ... .

into (combining rows same email , maintain info same persons in 1 row keeping info when cannot identify person same email address. if email na have maintain if unique person) :

email name var1 var2 var3a var3b var4 ... var200 user1@hotmail.com alfred c. 2 1 na na 0 . . user2@gmail.com bob v. na 3 1 0 na . . <na> cathy l. 1 1 2 . na . . userx@email.com . . etc etc etc etc etc etc

is there easy way ? i'm struggling dplyr , tidyr 2 days... in end, 1 row should contain info 1 person able identify using email variable. need maintain other observations not identify belonging 1 person.

thank help , time!

i came alternative in case don't know how many values each variable within subject have. you'll see of steps (making separate names separate columns).

the process set dataset long format using gather, removing missing , duplicate values each subject , variable combination, making variable names when there more 1 value per variable (add b, c, etc. ends of variable names), , putting dataset wide format spread.

dat = data.frame(email = rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2), twitter = c(na, "user1", na, "user2"), var1 = c(2, 2, na, na), var2 = c(1, na, 3, na), var3 = c(na, na, 1, 0), var4 = c(0, na, na, na)) library(dplyr) library(tidyr) dat %>% gather(allvar, value, twitter:var4) %>% group_by(email, allvar) %>% filter(!is.na(value) & !duplicated(value)) %>% mutate(allvar2 = paste0(allvar, c("", letters[2:26])[1:n()])) %>% ungroup() %>% select(-allvar) %>% spread(allvar2, value, convert = true) source: local info frame [2 x 7] email twitter var1 var2 var3 var3b var4 1 user1@hotmail.com user1 2 1 na na 0 2 user2@gmail.com user2 na 3 1 0 na

edit new illustration when email addresses missing

i'm not exclusively clear if have either twitter or email info or both - if so, think simplified filling in twitter na.locf in @jazurro's reply , working combination of email , twitter the grouping variable.

to maintain rows no email, filter them out, need, , rbind_list them in. in case naming duplicated variables, e.g., var3 , var3b work out (it possible name them var3a, var3b instead, won't work rbinding method).

dat = data.frame(email = c(rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2), na), twitter = c(na, "user1", na, "user2", "user3"), var1 = c(2, 2, na, na, 1), var2 = c(1, na, 3, na, 1), var3 = c(na, na, 1, 0, 2), var4 = c(0, na, na, na, na)) dat %>% filter(!is.na(email)) %>% # filter out rows missing email gather(allvar, value, twitter:var4, na.rm=true) %>% group_by(email, allvar) %>% distinct(value) %>% mutate(allvar2 = paste0(allvar, c("", "b")[1:n()])) %>% # name duplicated variables, ex: var3, var3b # op gets error using n(); utilize length(value) instead ungroup() %>% select(-allvar) %>% spread(allvar2, value, convert = true) %>% # create sure spread converts variables appropriately rbind_list(.,dat[is.na(dat$email),]) # rbind rows missing email source: local info frame [3 x 7] email twitter var1 var2 var3 var3b var4 1 user1@hotmail.com user1 2 1 na na 0 2 user2@gmail.com user2 na 3 1 0 na 3 na user3 1 1 2 na na

r data-manipulation

No comments:

Post a Comment