string - Clustering a long list of words -
i have next problem @ hand: have long list of words, perchance names, surnames, etc. need cluster word list, such similar words, illustration words similar edit (levenshtein) distance appears in same cluster. illustration "algorithm" , "alogrithm" should have high chances appear in same cluster.
i aware of classical unsupervised clustering methods k-means clustering, em clustering in pattern recognition literature. problem here these methods work on points reside in vector space. have words of strings @ hand here. seems that, question of how represent strings in numerical vector space , calculate "means" of string clusters not sufficiently answered, according survey efforts until now. naive approach attack problem combine k-means clustering levenshtein distance, question still remains "how represent "means" of strings?". there weight called tf-idf weigt, seems related area of "text document" clustering, not clustering of single words. seems there special string clustering algorithms existing, 1 @ http://pike.psu.edu/cleandb06/papers/cameraready_120.pdf
my search in area going on still, wanted ideas here well. recommend in case, aware of methods kind of problem?
don't clustering. misleading. algorithms (more or less forcefully) break info predefined number of groups, no matter what. k-means isn't right type of algorithm problem should rather obvious, isn't it?
this sounds similar; difference scale. clustering algorithm produce "macro" clusters, e.g. split info set 10 clusters. want much of info isn't clustered @ all, want want merge near-duplicate strings, may stem errors, right?
levenshtein distance threshold need. can seek accelerate using hashing techniques, example.
similarly, tf-idf wrong tool. it's used clustering texts, not strings. tf-idf weight assigned single word (string; assumed string not contain spelling errors!) within larger document. doesn't work on short documents, , won't work @ on single-word strings.
string cluster-analysis k-means levenshtein-distance pattern-recognition
No comments:
Post a Comment