intersection of two strings using Java HashSet_问答_开发者

I am trying to learn Java by doing some assignments from a Stanford class and am having trouble answering this 开发者_如何学Goquestion.

boolean stringIntersect(String a, String b, int len): Given 2 strings, consider all the substrings within them of length len. Returns true if there are any such substrings which appear in both strings. Compute this in O(n) time using a HashSet.

I can't figure out how to do it using a Hashset because you cannot store repeating characters. So stringIntersect(hoopla, loopla, 5) should return true.

thanks!

Edit: Thanks so much for all your prompt responses. It was helpful to see explanations as well as code. I guess I couldn't see why storing substrings in a hashset would make the algorithm more efficient. I originally had a solution like :

public static boolean stringIntersect(String a, String b, int len) {
    assert (len>=1);
    if (len>a.length() || len>b.length()) return false;
    String s1=new String(),s2=new String();
    if (a.length()<b.length()){
        s1=a;
        s2=b;
    }
    else {
        s1=b;
        s2=a;
    }
    int index = 0;
    while (index<=s1.length()-len){
        if (s2.contains(s1.substring(index,index+len)))return true;
        index++;
    }
    return false;
}

I'm not sure I understand what you mean by "you cannot store repeating characters" A hashset is a Set, so it can do two things: you can add value to it, and you can add values to it, and you can check if a value is already in it. In this case, the problem wants you to answer the question by storing strings, not chars, in the HashSet. To do this in java:

Set<String> stringSet = new HashSet<String>();

Try breaking this problem into two parts: 1. Generate all the substrings of length len of a string 2. Use this to solve the problem.

The hint for part two is: Step 1: For the first string enter the substrings into a hashset Step 2: For the second string, check the values in the hashset

Note (Advanced): this problem is poorly specified. Entering and checking strings in a hashtable is O the length of the string. For string a of length n you have O(n-k) substrings of length k. So for string a being a string of length n and string b being a string of length m you have O((n-k)*k+(m-k)*k) this is not really big Oh of n, since your running time for k = n/2 is O((n/2)*(n/2)) = O(n^2)

Edit: So what if you actually want to do this in O(n) (or perhaps O(n+m+k))? My belief is that the original homework was asking for something like the algorithm I described above. But we can do better. Whats more, we can do better and still make a HashSet the crucial tool for our algorithm. The idea is to perform our search using a "Rolling Hash." Wikipedia describes a couple: http://en.wikipedia.org/wiki/Rolling_hash, but we will implement our own.

A simple solution would be to XOR the values of the character hashes together. This could allow us to add a new char to the hash O(1) and remove one O(1) making computing the next hash trivial. But this simple algorithm wont work for two reasons

The character hashes may not provide sufficient entropy. Okay, we dont know if we will have this problem, but lets solve it anyways, just for fun.
We will hash permutations to the same value ... "abc" should not have the same hash as "cba"

To solve the first problem we can use an idea from AI, namely lets steel from Zobrist hashing. The idea is to assign every possible character a random value of a greater length. If we were using ASCI, we could easily create an array with all the ASCI characters, but that will run into problems when using unicode characters. The alternative is to assign values lazily.

object LazyCharHash{
  private val map = HashMap.empty[Char,Int]
  private val r = new Random
  def lHash(c: Char): Int = {
    val d = map.get(c)
    d match {
      case None => {
        map.put(c,r.nextInt)
        lHash(c)
      }
      case Some(v) => v
    }
  }
}

This is Scala code. Scala tends to be less verbose than Java, but still allows me to use Java collections, as such I will be using imperative style Scala through out. It wouldn't be that hard to translate.

The second problem can be solved aswell. First, instead of using a pure XOR, we combine our XOR with a shift, thus the hash function is now:

def fullHash(s: String) = {
  var h = 0
  for(i <- 0 until s.length){
    h = h >>> 1
    h = h ^ LazyCharHash.lHash(s.charAt(i))
  }
  h
}

Of-course, using fullHash wont give a performance advantage. It is just a specification

We need a way of using our hash function to store values in the HashSet (I promised we would use it). We can just create a wrapper class:

class HString(hash: Int, string: String){
  def getHash = hash
  def getString = string
  override def equals(otherHString: Any): Boolean = {
    otherHString match {
      case other: HString => (hash == other.getHash) && (string == other.getString)
      case _ => false
    }
  }
  override def hashCode = hash
}

Okay, to make the hashing function rolling, we just have to XOR the value associated with the character we will no longer be using. To that just takes shifting that value by the appropriate amount.

def stringIntersect(a: String, b: String, len: Int): Boolean = {
  val stringSet = new HashSet[HString]()
  var h = 0
  for(i <- 0 until len){
    h = h >>> 1
    h = h ^ LazyCharHash.lHash(a.charAt(i))
  }
  stringSet.add(new HString(h,a.substring(0,len)))
  for(i <- len until a.length){
    h = h >>> 1
    h = h ^ (LazyCharHash.lHash(a.charAt(i - len)) >>> (len))
    h = h ^ LazyCharHash.lHash(a.charAt(i))
    stringSet.add(new HString(h,a.substring(i - len + 1,i + 1)))
  }
  ...

You can figure out how to finish this code on your own.

Is this O(n)? Well, it matters what mean. Big Oh, big Omega, big Theta, are all metrics of bounds. They could serve as metrics of the worst case of the algorithm, the best case, or something else. In this case these modification gives expected O(n) performance, but this only holds if we avoid hash collisions. It still take O(n) to tell if two Strings are equals. This random approach works pretty well, and you can scale up the size of the random bit arrays to make it work better, but it does not have guaranteed performance.

You should not store characters in the Hashset, but substrings.

When considering string "hoopla": if you store the substrings "hoopl" and "oopla" in the Hashset (linear operation), then it's linear again to find if one of the substrings of "loopla" matches.

I don't know how they're thinking you're supposed to use the HashSet but I ended up doing a solution like this:

public class StringComparator {

  public static boolean compare( String a, String b, int len ) {

    Set<String> pieces = new HashSet<String>();

    for ( int x = 0; (x + len) <= b.length(); x++ ) {
        pieces.add( a.substring( x, x + len  ) );
    }

    for ( String piece : pieces ) {
        if ( b.contains(piece) ) {
            return true;
        }
    }

    return false;

}

}