Let's say that I am streaming non-empty strings (char[]/char*s) into my program. I would like to create a set of them. That is, for any element a in set S, a is unique in S.
I have thought to approach this in a few ways, but have run into issues.
If I knew the amount of items n I开发者_开发知识库 would be reading, I could just create a hash table, with all elements beginning as null, of the same size and if there was a collision, do not insert it into that table. When the insertions are done, I would iterate through the array of the hashtable, counting non-null values, size, and then create an array of that size, and then copy all the values to it.
I could use just use a single array and resize it before an element is added, using a search algorithm to check to see if an element already exists before resizing/adding it.
I realize the second method would work, but because the elements may not be sorted, could also take a very long time for large inputs because of choice of search algorithm and resizing, regardless.
Any input would be appreciated. Please feel free to ask questions in the comment box below if you need further information. Libraries would be very helpful! (Google searching "Sets in C" and similar things doesn't help very much.)
A hash table can work even if you didn't know the size of the number of elements that you are going to be inserting ... you would simply define you hash table to use "buckets" (i.e., each position is actually a linked list of elements that hash to the same value), and you would search through each "bucket" to make sure that each element has not already been inserted into the hash-table. The key to avoiding large "buckets" to search through would be a good hash algorithm.
You can also, if you can define a weak ordering of your objects, use a binary search tree. Then if !(A < B) and !(B < A), it can be assumed A == B, and you would therefore not insert any additional iterations of that object into the tree, which again would define a set.
While I know you're using C, consider the fact that in the C++ STL, std::set
uses a RB-tree (red-black tree which is a balanced binary search tree), and std::unordered_set
uses a hash-table.
Using an array is a bad idea ... resizing operations will take a long time, where-as insertions into a tree can be done in O(log N) time, and for a hash-table, ammortized O(1).
精彩评论