开发者

text files clustering

开发者 https://www.devze.com 2023-03-04 14:04 出处:网络
i have text files as shown below ex: file 1: yamaha gladiator bike file 2: bajaj pulsar bike file 3: yamaha

i have text files as shown below

ex:

file 1:

       yamaha
       gladiator 
       bike  

file 2:

       bajaj 
       pulsar
       bike

file 3:

       yamaha 
       gladiator
       india

i have to read these file indivisually and create clusters. means to say, from above ex, file 1 and file 3 are similar and will create one cluster. i want atleast a single word to be matched between two files to make a clus开发者_高级运维ter. so finally i have to get two clusters from above ex as 1: yamaha and 2: bajaj. pls help me with this....


Sounds like you simply need to read each file into a Set<String> of words and then looking for intersections to build your clusters. That could be achieved, for example, by building a map of words to a count of occurrences (Map<String, Integer>) or a map of words to a set of filenames (Map<String, Set<String>>).

Not sure where you second example cluster comes from as "bajaj" only exists in file 2.

EDIT: based on request to explain how Maps and Sets work

Instantiating a Map that maps strings (the word) to a set of filenames:

Map<String, Set<String>> wordsToFilenames = new HashMap<String, Set<String>>();

Adding a word found in a filename to this (assume we've read in a word from the file into the word variable and have the filename in a filename variable, both Strings):

Set<String> filenamesForWord;

if (wordsToFilenames.containsKey(word)) {
    filenamesForWord = wordsToFilenames.get(word);
}
else {
    filenamesForWord = new HashSet<String>();
    wordsToFilenames.put(word, filenamesForWord);
}

filenamesForWord.add(filename);


You can look at the naïve Bayesian classifier which does quite well in document classification. For other algorithms, try googling text classification algorithm.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号