text files clustering_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-04 14:04 出处：网络

i have text files as shown below ex: file 1: yamaha gladiator bike file 2: bajaj pulsar bike file 3: yamaha

相关专题：

i have text files as shown below

ex:

file 1:

       yamaha
       gladiator 
       bike

file 2:

       bajaj 
       pulsar
       bike

file 3:

       yamaha 
       gladiator
       india

i have to read these file indivisually and create clusters. means to say, from above ex, file 1 and file 3 are similar and will create one cluster. i want atleast a single word to be matched between two files to make a clus开发者_高级运维ter. so finally i have to get two clusters from above ex as 1: yamaha and 2: bajaj. pls help me with this....

Sounds like you simply need to read each file into a Set<String> of words and then looking for intersections to build your clusters. That could be achieved, for example, by building a map of words to a count of occurrences (Map<String, Integer>) or a map of words to a set of filenames (Map<String, Set<String>>).

Not sure where you second example cluster comes from as "bajaj" only exists in file 2.

EDIT: based on request to explain how Maps and Sets work

Instantiating a Map that maps strings (the word) to a set of filenames:

Map<String, Set<String>> wordsToFilenames = new HashMap<String, Set<String>>();

Adding a word found in a filename to this (assume we've read in a word from the file into the word variable and have the filename in a filename variable, both Strings):

Set<String> filenamesForWord;

if (wordsToFilenames.containsKey(word)) {
    filenamesForWord = wordsToFilenames.get(word);
}
else {
    filenamesForWord = new HashSet<String>();
    wordsToFilenames.put(word, filenamesForWord);
}

filenamesForWord.add(filename);

You can look at the naïve Bayesian classifier which does quite well in document classification. For other algorithms, try googling text classification algorithm.