Possible Duplicate:
Improving performance of preprocessing large set of documents
Hi, I have a document set contain about 100 documents. I have to preprocess each of these documents and compare these documents with each other. If I do it in sequential manner it will consume huge amount of time. So I want to know some parellel algorithms that can be used and how can i implement those using Java.
Ragards, nuwan
There is a lot of literature about detecting document similarity. You need to do a literature search and/or a web search for software / algorithms / techniques that matches your requirements.
Simply replacing a brute-force sequential pair-wise comparison with a brute-force parallel pair-wise comparison is not the answer. That approach only gives you an O(P)
speedup (at best), where you have to deal with O(N^2 * S^2)
where N is the number of documents and S
is the average document size.
For a start, the classic way of finding similarities between two large text files involves breaking each file into lines, calculating hashes of each the respective file's lines, sorting the hashes and comparing them. This process is O(SlogS)
If you have documents d1, d2, d3, d4 - if you compared each document with all other documents, then it would be O(N^2)
. However, I assume that comparing d1 to d2 is the same as comparing d2 to d1, so you can optimize there. So basically, you only need to compare d1-d2, d1-d3, d1-d4, d2-d3, d2-d4, d3-d4, which is O((N-1)!
Perhaps start by building a map of all comparisons that need to be done. Then, split that map into X equal size collections, where X is the number of processes you want to run. Finally, spin off that many threads (or farm the work out to that many servers), and let them run, then merge the results back together.
If you need to preprocess each document individually (so the comparisons really don't matter at that point), then just break the problem up into as many processes as you want, and distribute that work across the processes. Without really know what kind of preprocessing and comparison and document types you're dealing with, I can't really get into much more specifics than that.
I'm assuming your looking for similarities between documents rather than identical documents - if that were the case you could generate a checksum for each document in parallel and then comparing then would be relatively easy.
For similarities you could use a fingerprinting approach. I have a friend how uses this for looking for text reuse in a large corpus of documents. You can calculate the fingerprints for each document in parallel and then load the fingerprints to do the match in memory and parallel.
Winnowing: Local Algorithms for Document Fingerprinting