My Blog: data manipulation - Merging Similar Observations in R -

Thursday, 15 August 2013

data manipulation - Merging Similar Observations in R -

i'm trying utilize r info management.

i have info frame multiple variables (+200 columns) , many observation (+10,000 rows). there lot of missing data, , duplicated or uncompleted observations. 1 observation should equal 1 person (1 row = 1 unique person)

here dataset illustration (thank's @aosmith) :

dat = data.frame(email = c(rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2), na),               name = c(na, "alfred c.", na, "bob v.", "cathy l."),               var1 = c(2, 2, na, na, 1),               var2 = c(1, na, 3, na, 1),               var3 = c(na, na, 1, 0, 2),               var4 = c(0, na, na, na, na))

i want merge observations in end, 1 row equals 1 person. identify person utilize email. when there no email, want maintain observations (so if email missing, dont want r delete observation. every observation no email considered unique observation).

for times can spot same email address, need r update fields of each variable when there missing info data on subsequent observations (with same email address) found . if there existing info 1 or more variable, want r create, each time new variable store different values.

here illustration create easier understand.

we need transform :

          email        name    var1 var2 var3 var4 ... var200 user1@hotmail.com      <na>     2    1   na    0   ...   . user1@hotmail.com    alfred c.  2   na   na   na   ...   .   user2@gmail.com      <na>    na    3    1   na   ...   .   user2@gmail.com     bob v.   na   na    0   na   ...   .              <na>    cathy l.   1    1    2   na   ...   .

into (combining rows same email , maintain info same persons in 1 row keeping info when cannot identify person same email address. if email na have maintain if unique person) :

           email           name      var1    var2    var3a   var3b   var4   ...  var200    user1@hotmail.com      alfred c.     2       1      na       na     0      .      .      user2@gmail.com       bob v.      na       3      1        0      na     .      .                 <na>      cathy l.      1       1      2        .      na     .      .      userx@email.com         .          .      etc    etc      etc    etc    etc    etc

is there easy way ? i'm struggling dplyr , tidyr 2 days... in end, 1 row should contain info 1 person able identify using email variable. need maintain other observations not identify belonging 1 person.

thank help , time!

i came alternative in case don't know how many values each variable within subject have. you'll see of steps (making separate names separate columns).

the process set dataset long format using gather, removing missing , duplicate values each subject , variable combination, making variable names when there more 1 value per variable (add b, c, etc. ends of variable names), , putting dataset wide format spread.

dat = data.frame(email = rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2),                   twitter = c(na, "user1", na, "user2"),                   var1 = c(2, 2, na, na),                   var2 = c(1, na, 3, na),                   var3 = c(na, na, 1, 0),                   var4 = c(0, na, na, na)) library(dplyr) library(tidyr)  dat %>%     gather(allvar, value, twitter:var4) %>%     group_by(email, allvar) %>%     filter(!is.na(value) & !duplicated(value)) %>%     mutate(allvar2 = paste0(allvar, c("", letters[2:26])[1:n()])) %>%      ungroup() %>%     select(-allvar) %>%     spread(allvar2, value, convert = true)  source: local   info frame [2 x 7]                email twitter var1 var2 var3 var3b var4 1 user1@hotmail.com   user1    2    1   na    na    0 2   user2@gmail.com   user2   na    3    1     0   na

edit new illustration when email addresses missing

i'm not exclusively clear if have either twitter or email info or both - if so, think simplified filling in twitter na.locf in @jazurro's reply , working combination of email , twitter the grouping variable.

to maintain rows no email, filter them out, need, , rbind_list them in. in case naming duplicated variables, e.g., var3 , var3b work out (it possible name them var3a, var3b instead, won't work rbinding method).

dat = data.frame(email = c(rep(c("user1@hotmail.com", "user2@gmail.com"), each = 2), na),                   twitter = c(na, "user1", na, "user2", "user3"),                   var1 = c(2, 2, na, na, 1),                   var2 = c(1, na, 3, na, 1),                   var3 = c(na, na, 1, 0, 2),                   var4 = c(0, na, na, na, na))  dat %>%     filter(!is.na(email)) %>% # filter out rows missing email     gather(allvar, value, twitter:var4, na.rm=true) %>%     group_by(email, allvar) %>%     distinct(value) %>%     mutate(allvar2 = paste0(allvar, c("", "b")[1:n()])) %>% # name duplicated variables, ex: var3, var3b     # op gets error using n();  utilize length(value) instead     ungroup() %>%     select(-allvar) %>%     spread(allvar2, value, convert = true) %>% #  create sure spread converts variables appropriately     rbind_list(.,dat[is.na(dat$email),]) # rbind rows missing email  source: local   info frame [3 x 7]                email twitter var1 var2 var3 var3b var4 1 user1@hotmail.com   user1    2    1   na    na    0 2   user2@gmail.com   user2   na    3    1     0   na 3                na   user3    1    1    2    na   na

r data-manipulation

My Blog

Thursday, 15 August 2013

data manipulation - Merging Similar Observations in R -

No comments:

Post a Comment