Using Bloom filter, w开发者_运维问答e will be getting space optimization. The cassandra framework also has an implementation of Bloom Filter. But in detail, how is this space optimization achieved?
You can understand how it saves space using this example : Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.
Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with they're keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a url in the address bar, I need to check if it malicious. Thus I run the url through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.
Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Lets say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the url MIGHT be in the list of malicious URLS. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the url is actually present. Since most of the times, a url is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the url MIGHT be malicious , only in those cases we make a call to the server. That 'MIGHT' is 99% right.
So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every url entered.
So I have seen this question before, and I used advice above and it turned out to be way to slow for me. So I wrote my own. It is not fully general, but I am sure if somebody is desperate for performance like I am they will make it more general by themselves :)
I used Murmur hash implementation that you can download here: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/
The code: package uk.ac.cam.cl.ss958.SpringBoardSimulation;
import ie.ucd.murmur.MurmurHash;
import java.util.BitSet;
import java.util.Random;
public class FastBloomFilter {
private final BitSet bs;
final int [] hashSeeds;
final int capacity;
public FastBloomFilter(int slots, int hashFunctions) {
bs = new BitSet(slots);
Random r = new Random(System.currentTimeMillis());
hashSeeds = new int[hashFunctions];
for (int i=0; i<hashFunctions; ++i) {
hashSeeds[i] = r.nextInt();
}
capacity = slots;
}
public void add(int value) {
byte [] b = new byte[] {
(byte)(value >>> 24),
(byte)(value >>> 16),
(byte)(value >>> 8),
(byte)value};
for (int i=0; i<hashSeeds.length; ++i) {
int h = MurmurHash.hash32(b, 4, hashSeeds[i]);
bs.set(Math.abs(h)%capacity, true);
}
}
public void clear() {
bs.clear();
}
public boolean mightContain(int value) {
byte [] b = new byte[] {
(byte)(value >>> 24),
(byte)(value >>> 16),
(byte)(value >>> 8),
(byte)value};
for (int i=0; i<hashSeeds.length; ++i) {
int h = MurmurHash.hash32(b, 4, hashSeeds[i]);
if(!bs.get(Math.abs(h)%capacity)) {
return false;
}
return true;
}
public static void main(String [] args) {
FastBloomFilter bf = new FastBloomFilter(1000, 10);
System.out.println("Query for 2000: " + bf.mightContain(2000));
System.out.println("Adding 2000");
bf.add(2000);
System.out.println("Query for 2000: " + bf.mightContain(2000));
}
}
A bloom filter isn't a "framework". It's really more like simply an algorithm. The implementation ain't very long.
Here's one in Java I've tried (.jar, source code and JavaDoc being all available):
"Stand alone Java implementations of Cuckoo Hashing and Bloom Filters" (you may want to Google for this in case the following link ain't working anymore):
http://lmonson.com/blog/?page_id=99
I wrote a short post about implementing a bloom filter using Java 8 features, that I hope is relevant to the issue of space savings. I went a bit further to discuss how to bit slice a collection of bloom filters, when some information retrieval systems would do this, which is relevant to efficiencies when you have lots of bloom filters.
You can use Bloom filter based on Redis server with Redisson lib. Based on 128-bits HighwayHash. Here is an example:
RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");
// initialize bloom filter once with
// expectedInsertions = 55000000
// falseProbability = 0.03
bloomFilter.tryInit(55000000L, 0.03);
bloomFilter.add(new SomeObject(someStateHere1));
bloomFilter.add(new SomeObject(someStateHere2));
// does it contain object?
bloomFilter.contains(new SomeObject(someStateHere3));
Bloom filter are probabilistic data structures which can tell you in O(1) time whether an entry is present in a database or not. It can however give some false positives. But with proper selection of hash functions and the size of the bit array, the percentage of correct results can be as high as 99.99%. Whenever there is an entry in a database , you also populate the bloom by setting the bits as 1 on those indices which are returned by the hash functions. The hash functions return a value between the start and end index of the bit array. Whatever value is returned by the hash functions those bits in the bit array are set to 1. During lookup , the query param is passed again through the same hash functions. If all the bits are set to one then there is a probability of the data being present in the database. If any of the bits is 0 then definitely the entry is not present in the database. Below is the code for simple bloom filter
import java.util.HashSet;
import java.util.Random;
public class Bloom {
static int bloom[]= new int[10000];
static HashSet<Integer> set=new HashSet<Integer>();
static int result[]= new int[4];
// truepositive,truenegative,falsepositive,falsenegative
public static void main(String[] args) {
populate();
getLookUpResult();
for(int i : result){
System.out.println(i);
}
}
static void populate(){
for(int i=0;i<1000;i++){
int numb=getRandom(0,2000);
set.add(numb);
int h1=(numb*numb*3)%2000;
bloom[h1]=1;
int h2=(numb*19)%2000;
bloom[h2]=1;
int h3=(numb*numb)%2000;
bloom[h3]=1;
}
}
public static int getRandom(int l,int h){
Random r = new Random();
int low = l;
int high = h;
int result = r.nextInt(high-low) + low;
return result;
}
public static void getLookUpResult(){
for(int i=0;i<2000;i++){
if(isPresent(i)){
if(set.contains(i)){ // true positive
result[0]++;
}
else{ // false positive
result[2]++;
}
}else{
if(set.contains(i)){ // falsenegative
result[3]++;
}
else{
result[1]++; //true negative
}
}
}
}
public static boolean isPresent(int number){
int h1=(number*number*number)%2000;
int h2=(number*19)%2000;
int h3=(number*number)%2000;
return (bloom[h1]==1 && bloom[h2]==1 && bloom[h3]==1);
}
} `
精彩评论